This comprehensive guide covers the most frequently asked interview questions for AI and LLM engineering positions at startups and tech companies in 2025. From transformer architecture to RAG systems, prompt engineering to production deployment – these are the questions that separate candidates who get offers from those who don’t.
LLM Fundamentals
1. What is the Transformer architecture and how does it work?
A Transformer is a deep learning architecture that processes sequences using self-attention instead of sequential processing. It has encoder and decoder blocks, self-attention layers, feed-forward networks, and positional encodings. Unlike RNNs, Transformers process all tokens in parallel, making them faster and better at capturing long-range dependencies. GPT uses decoder-only architecture, while BERT uses encoder-only.
2. Explain the self-attention mechanism.
Self-attention allows the model to weigh the importance of different tokens when processing each element. For each token, it creates three vectors: Query (Q), Key (K), and Value (V). Attention scores come from multiplying Q and K, applying softmax, then multiplying by V.
The formula is:
Attention(Q,K,V) = softmax(QK^T / √d_k) × V
This captures relationships between any two tokens regardless of distance.
3. What is multi-head attention and why is it used?
Multi-head attention splits the attention mechanism into multiple parallel “heads,” each learning different representations. Each head independently computes Q, K, V attention on a subspace of the embedding. One head might focus on syntax while another captures semantics. The outputs are concatenated and transformed. Typical models use 8-16 attention heads per layer, enabling richer contextual understanding.
4. What is tokenization and why does it matter for LLMs?
Tokenization breaks text into smaller units (tokens) that can be words, subwords, or characters. LLMs use subword methods like Byte-Pair Encoding (BPE) or WordPiece to handle rare words by splitting them into known subunits. This ensures even unseen words can be processed. Tokenization choice affects model performance and how efficiently the context window is utilized.
5. How do positional encodings work?
Positional encodings add sequence order information to token embeddings because self-attention treats tokens as a set without inherent order. Without them, “The cat chased the dog” would be identical to “The dog chased the cat.” They use sinusoidal functions or learned vectors to assign unique positions. Modern techniques like RoPE (Rotary Position Embedding) help models extrapolate to longer sequences.
6. What is the difference between encoder-only, decoder-only, and encoder-decoder architectures?
Encoder-only (BERT): Processes input bidirectionally, best for understanding tasks like classification and NER.
Decoder-only (GPT, Claude, Llama): Generates text left-to-right, best for text generation.
Encoder-Decoder (T5, BART): Encoder processes input, decoder generates output; best for translation and summarization.
Most modern LLMs use decoder-only because it excels at generation while achieving strong understanding through scale.
Training and Fine-Tuning
7. What is the difference between pre-training and fine-tuning?
Pre-training is the foundational phase where an LLM learns general language patterns from massive datasets using self-supervised objectives like next-token prediction. Fine-tuning takes this pre-trained model and further trains it on a smaller, task-specific dataset. Pre-training takes weeks and requires enormous resources; fine-tuning can be done in hours with much less data and compute.
8. What is Supervised Fine-Tuning (SFT)?
SFT trains a pre-trained LLM on a labeled dataset with input-output pairs for a specific task. The model adjusts weights based on prediction errors. Use it when you have high-quality labeled data for tasks like classification or instruction-following. SFT is typically the first step after pre-training and before RLHF, teaching the model to follow instructions reliably.
9. Explain RLHF and its purpose.
RLHF (Reinforcement Learning from Human Feedback) trains models using human guidance to align outputs with human values. It involves three phases:
- Supervised Fine-Tuning on high-quality examples
- Training a Reward Model where humans rank outputs
- PPO optimization to maximize reward scores
RLHF makes responses safer, more helpful, and ethically aligned – critical for chatbots like ChatGPT.
10. What is LoRA and how does it enable efficient fine-tuning?
LoRA (Low-Rank Adaptation) adds small, trainable low-rank matrices alongside frozen pre-trained weights. It decomposes weight updates into two smaller matrices, reducing trainable parameters to ~0.2-0.3% of the original model.
Benefits:
- 70% less memory
- Faster training
- Reduced overfitting
- Ability to store multiple task-specific adapters separately
11. When would you use QLoRA vs full fine-tuning?
QLoRA combines LoRA with 4-bit quantization – base weights are stored in 4-bit while LoRA adapters train in 16-bit.
Use QLoRA when:
- Fine-tuning large models (70B+) on limited hardware like a single 24GB GPU
- Memory is critically constrained
Use full fine-tuning when:
- Complex domains (math, code) where precise parameter adjustments are critical
- You have sufficient resources
12. What is PEFT and what techniques does it include?
PEFT (Parameter-Efficient Fine-Tuning) adapts LLMs by updating only a small subset of parameters while freezing most weights.
Key techniques:
- Adapter Layers: Small trainable layers between existing layers
- LoRA: Low-rank matrix injection
- Prompt Tuning: Learning soft prompts
- Prefix Tuning: Prepending trainable vectors
PEFT checkpoints are only a few MBs compared to 40GB+ for full models.
Prompt Engineering
13. What is Chain-of-Thought (CoT) prompting?
Chain-of-Thought prompting encourages the LLM to show reasoning step-by-step before giving a final answer. Instead of jumping to conclusions, the model “thinks aloud,” breaking complex problems into smaller logical steps. This significantly improves performance on math, logic, and multi-step reasoning tasks. Trigger it by adding “Let’s think step by step” or providing reasoning examples.
14. Explain zero-shot vs few-shot prompting.
Zero-shot gives the model a task without any examples – just describe what you want.
Few-shot includes 2-5 examples demonstrating the desired pattern before the actual question.
Use zero-shot for simple tasks. Switch to few-shot when zero-shot fails, or when task complexity requires demonstrations for consistent, accurate outputs and format adherence.
15. What are system prompts and how do they differ from user prompts?
A system prompt sets the LLM’s overall behavior, role, and constraints for an entire conversation. It defines persona (“You are a helpful customer support agent”), boundaries (“Only answer questions about our products”), and output format.
A user prompt is the actual query. System prompts are processed first and establish context before any user interaction.
16. What are best practices for prompt optimization?
- Be specific and concise with clear instructions
- Use examples (few-shot) to show what you want
- Prefer instructions over constraints (“Summarize in 3 bullet points” beats “Don’t be verbose”)
- Control temperature – lower (0.1-0.3) for accuracy, higher (0.7-1.0) for creativity
- Structure output with specific formats
- Iterate and A/B test variations
17. What prompting strategies help reduce hallucinations?
- Use RAG to ground responses in retrieved factual data
- Add explicit constraints like “Only use information provided” or “Say ‘I don’t know’ if unsure”
- Request citations
- Lower temperature to reduce randomness
- Use Chain-of-Thought to force reasoning that reveals errors
- Apply structured outputs with JSON/schema formats
- For critical applications, combine prompt engineering with retrieval and human review
RAG Systems
18. What is RAG and why is it useful?
RAG (Retrieval-Augmented Generation) combines a retrieval system with a language model. First, the retriever searches external knowledge sources to find relevant information. Then, the LLM uses this retrieved context to produce accurate answers.
RAG is useful because it:
- Reduces hallucinations
- Keeps responses current
- Allows LLMs to access domain-specific or private data they weren’t trained on
19. What are vector embeddings and how are they used in RAG?
Embeddings are numerical representations (arrays of floats) that capture semantic meaning. Models like BERT or OpenAI create dense vectors where similar concepts are positioned close together. In RAG, both documents and queries are converted to embeddings, then similarity search finds relevant documents by comparing vector distances. This enables semantic search that understands meaning, not just keywords.
20. Explain different chunking strategies for RAG systems.
- Fixed-size: Splits text into uniform token counts – simple but may break context
- Sentence/paragraph-based: Keeps natural boundaries for coherence
- Semantic chunking: Uses embeddings to group by meaning – most accurate but expensive
- Sliding window: Creates overlapping chunks to preserve context
- Hierarchical: Creates multi-level chunks (summaries + details)
Semantic chunking generally performs best for accuracy.
21. What is a vector database and why is it used in RAG?
Vector databases store and search high-dimensional embeddings efficiently using indexing like HNSW (Hierarchical Navigable Small World) for fast approximate nearest neighbor search. When a query comes in, it’s converted to an embedding and matched against document vectors.
Popular options: Pinecone, Weaviate, Milvus, FAISS, Chroma
They’re essential for semantic search at scale.
22. What similarity metrics are used in vector search?
- Cosine similarity: Measures angle between vectors (0 to 1). Most common for text because it ignores magnitude and focuses on meaning
- Euclidean distance (L2): Measures straight-line distance; good when magnitude matters
- Dot product: Similar to cosine but doesn’t normalize
For text/semantic search, cosine similarity is typically preferred.
23. What’s the difference between sparse and dense retrieval?
Sparse retrieval (BM25, TF-IDF): Uses keyword matching with high-dimensional sparse vectors – fast and good for exact matching but misses semantic meaning.
Dense retrieval (BERT, DPR): Uses neural embeddings capturing semantic meaning but requires more compute.
Hybrid search combines both: sparse retrieval for initial speed, then dense methods re-rank for accuracy.
Model Evaluation
24. What metrics do you use to evaluate LLM outputs?
- Perplexity: Measures prediction confidence
- BLEU: Measures n-gram precision for translation
- ROUGE: Recall-based for summarization
- BERTScore: Measures semantic similarity using embeddings
- Accuracy/F1: For classification tasks
- Human evaluation: Assesses coherence, relevance, and helpfulness
No single metric captures everything – combine multiple metrics based on task requirements.
25. How do you detect hallucinations in LLM outputs?
- SelfCheckGPT: Samples multiple responses and checks consistency – hallucinations show contradictions
- RAG-based fact-checking: Verifies claims against retrieved documents
- NLI models: Detect entailment/contradiction with ground truth
- Semantic entropy: Identifies high uncertainty indicating potential hallucination
- LLM-as-Judge: Uses another LLM to evaluate faithfulness
26. How do you evaluate bias in language models?
- Counterfactual testing: Change demographic attributes while keeping context same, check if outputs differ
- Use benchmarks: CrowS-Pairs and StereoSet measure stereotypical associations
- Apply sentiment analysis across groups to detect differential treatment
- Use diverse human evaluation panels
- Test for both explicit bias and subtle stereotyping
27. What are common LLM benchmarks and what do they measure?
- MMLU/MMLU-Pro: Tests multitask language understanding across 57 subjects
- HumanEval: Measures code generation accuracy
- GPQA: Tests expert-level QA requiring domain knowledge
- TruthfulQA: Measures factual accuracy and misconception avoidance
- HellaSwag: Tests commonsense reasoning
- SQuAD: Evaluates reading comprehension
Use multiple benchmarks for comprehensive assessment.
Deployment and Inference
28. What is model quantization and why use it?
Quantization reduces weight precision from higher (FP32) to lower (INT8, INT4, FP16). It decreases model size by 2-4x, speeds up inference, reduces memory usage, and lowers energy consumption. This enables deployment on edge devices and reduces GPU costs.
Trade-off: Some accuracy loss, especially below INT8.
29. What is the difference between INT8 and FP16 quantization?
FP16 (half-precision float): Maintains floating-point representation but halves memory from FP32 – good balance of speed and accuracy.
INT8 (8-bit integer): Provides greater compression and faster integer arithmetic but may lose more accuracy and requires calibration.
FP16 is safer for most models; INT8 needs careful tuning.
30. What is KV cache and why does it matter for LLM inference?
KV cache stores computed Key and Value vectors from attention layers during autoregressive generation, avoiding recomputation for previous tokens. This changes attention complexity from O(n²) to O(n) per step. Without it, each new token would require reprocessing the entire sequence.
Trade-off: KV cache grows linearly with sequence length, consuming significant GPU memory.
31. How do you reduce LLM inference latency?
Key techniques:
- Quantization (INT8/FP16) to reduce compute
- KV caching to avoid recomputation
- Continuous batching to maximize GPU utilization
- Flash Attention for memory-efficient attention
- Speculative decoding using a smaller draft model
- Model parallelism across GPUs
- Optimized serving frameworks like vLLM or TensorRT-LLM
32. What are vLLM and TensorRT-LLM, and when would you use each?
vLLM: Uses PagedAttention for efficient KV cache management – achieves 2-4x throughput improvement, easy Hugging Face integration, cloud-agnostic.
TensorRT-LLM: NVIDIA’s optimized library using kernel fusion – maximum performance on NVIDIA GPUs but requires compilation and is hardware-specific.
Choose vLLM for flexibility; TensorRT-LLM for peak NVIDIA performance.
33. How do you scale LLM serving for production?
- Use horizontal scaling with load balancers across multiple instances
- Implement auto-scaling (Kubernetes) for traffic spikes
- Apply model parallelism for large models
- Enable continuous batching for throughput
- Use KV cache sharing for common prefixes
- Add caching layers for repeated queries
- Implement rate limiting and circuit breakers for stability
AI Safety and Alignment
34. What are AI guardrails and how do you implement them?
Guardrails are safety mechanisms monitoring LLM inputs and outputs in real-time.
Implementation includes:
- Rule-based filters using keyword lists and regex
- LLM-as-judge systems evaluating policy violations
- Content moderation layers checking for toxicity, bias, and PII leakage
- Input validation preventing prompt injection
Tools like Guardrails AI and NeMo Guardrails provide production frameworks.
35. How do you handle prompt injection attacks?
Prompt injection exploits occur when malicious input overrides developer instructions.
Defenses:
- Separate user input from system prompts and validate inputs
- Use output filtering and moderation layers
- Implement “least privilege” – limit what the LLM can access
- Use adversarial testing/red teaming before deployment
- Apply layered defenses since no single solution is perfect
36. What is LLM red teaming and why is it important?
Red teaming systematically tests LLMs with adversarial prompts to find vulnerabilities before deployment. It tests for bias, toxicity, prompt injection susceptibility, jailbreaking, PII leakage, and hallucinations. Generate adversarial inputs, run through the LLM, evaluate responses, and document weaknesses.
Tools like DeepTeam and Promptfoo automate this. Essential because LLMs have wide attack surfaces.
37. What are best practices for responsible AI?
- Transparency: Document model limitations and training data
- Bias mitigation: Use diverse data, test for fairness across groups
- Human-in-the-loop: Include human review for critical decisions
- Data privacy: Implement PII detection and governance
- Continuous monitoring: Track outputs for drift and harmful content
- Clear accountability: Assign ownership for AI decisions
Practical Implementation
38. What is LangChain and what are its core components?
LangChain is a modular framework for LLM applications.
Core components:
- Models: Interfaces to LLMs like OpenAI, Anthropic
- Prompts: Templates with variables
- Chains: Sequences of LLM calls
- Agents: Systems using LLMs to decide which tools to invoke
- Memory: Persists state across interactions
- Indexes: Work with vector stores for RAG
Simplifies building chatbots and Q&A systems.
39. What is function calling in LLMs and how does it work?
Function calling lets LLMs respond with structured JSON specifying function names and arguments instead of plain text. You define tools in JSON Schema, the model decides which function to call based on the prompt, then returns structured arguments you parse and execute.
Important: The model does NOT execute functions – it generates parameters. This enables reliable API and database interactions.
40. What is the difference between Pinecone, Weaviate, and Chroma?
Pinecone: Fully managed cloud, best for production at scale with sub-50ms latency at billions of vectors, higher cost but zero infrastructure.
Weaviate: Open-source, supports hybrid search (vector + BM25), GraphQL API, can self-host, best for complex data relationships.
Chroma: Lightweight, Python-native, best for prototyping – runs embedded with your app but limited for production scale.
41. How do you implement a RAG system step by step?
- Ingest: Load documents and split into chunks
- Embed: Convert chunks to vectors using models like OpenAI embeddings
- Store: Index embeddings in a vector database
- Retrieve: At query time, find semantically similar chunks using cosine similarity
- Generate: Pass retrieved context plus user query to LLM to produce grounded responses
Deep Learning Fundamentals
42. Explain backpropagation.
Backpropagation trains neural networks by computing gradients of the loss function with respect to each weight.
Forward pass: Input flows through layers to produce output; loss is calculated.
Backward pass: Error propagates backward using the chain rule to compute gradients layer-by-layer. These gradients indicate how to adjust weights to reduce error. Weights are then updated using an optimizer like gradient descent.
43. How does gradient descent work?
Gradient descent minimizes the loss function by iteratively adjusting parameters in the direction that reduces loss. It calculates the gradient of loss with respect to each weight, then updates:
w_new = w_old - learning_rate × gradient
Three variants:
- Batch GD (entire dataset)
- Stochastic GD (one sample)
- Mini-batch GD (small batches – most common)
Learning rate controls step size.
44. What are common loss functions and when to use them?
Classification:
- Cross-entropy loss measures difference between predicted probabilities and actual labels
- Binary Cross-Entropy for 2 classes, Categorical for multi-class
Regression:
- MSE penalizes larger errors more
- MAE is robust to outliers
Generative models:
- KL Divergence measures distribution differences
Choose based on task: cross-entropy for classification, MSE for regression.
45. Explain vanishing and exploding gradients. How do you address them?
Vanishing gradients: Gradients become tiny during backpropagation, causing early layers to stop learning – common with sigmoid/tanh.
Exploding gradients: Gradients become excessively large, causing unstable updates.
Solutions:
- Use ReLU activation
- Proper weight initialization (Xavier, He)
- Batch normalization
- Gradient clipping
- Residual connections
- Adaptive optimizers like Adam
46. Why do we need activation functions?
Activation functions introduce non-linearity, enabling neural networks to learn complex patterns beyond linear relationships. Without them, stacking layers would just produce linear transformations.
- ReLU (max(0,x)): Most popular, computationally efficient
- Sigmoid (0 to 1): For binary output
- Softmax: Probability distribution for multi-class
- Leaky ReLU: Addresses “dying ReLU” problem
47. What is overfitting and how do you prevent it?
Overfitting occurs when a model memorizes training data patterns including noise, performing well on training but poorly on unseen data.
Prevention:
- Dropout: Randomly disables neurons (20-50%)
- Regularization (L1/L2): Penalizes large weights
- Early stopping: Halts when validation loss increases
- Data augmentation: Increases variety
- Reduce model complexity
- Add more training data
Current Trends 2025
48. How do AI agents work and what are their key components?
AI agents are autonomous systems that perceive environments, make decisions, and take actions to achieve goals.
Components:
- LLM Core: Powers reasoning
- Memory Systems: Short-term (conversation) and long-term (vector store)
- Tool Integration: APIs, databases, functions
- Planning Module: Breaks goals into subtasks
- Orchestration: Coordinates execution
Unlike chatbots, agents maintain state, plan multi-step workflows, and adapt based on feedback.
49. What are reasoning models and how do they differ from standard LLMs?
Reasoning models (OpenAI o1/o3, DeepSeek-R1) generate “thinking tokens” before outputs, showing step-by-step reasoning.
Key differences:
- Use Chain-of-Thought internally
- Better at complex multi-step problems and math
- Require different prompting patterns (less explicit instructions needed)
- Higher latency and token usage
- Provide more interpretable decisions
Used for complex problem-solving rather than simple generation.
50. How do multimodal models process different types of inputs?
Multimodal models process text, images, audio by:
- Input Encoding: Each modality uses specialized encoders (text tokenizer, vision encoder like CLIP/ViT)
- Embedding Alignment: Converting modalities into a shared embedding space
- Cross-Attention: Attending across modality representations
- Unified Processing: Transformer processes combined representations
Models like GPT-4V, Claude 3, and Gemini reason across text and images together.
Quick Reference by Difficulty Level
Entry-Level Topics: Transformers, tokenization, basic training, prompting basics (Questions 1-6, 7-8, 13-17, 42-47)
Mid-Level Topics: Fine-tuning methods, RAG implementation, deployment (Questions 9-12, 18-23, 28-33, 38-41)
Advanced Topics: Evaluation, safety, optimization, current trends (Questions 24-27, 34-37, 48-50)
Key Takeaways
- Master the fundamentals: Transformer architecture, attention mechanisms, and tokenization are foundational to every interview.
- Know your fine-tuning options: Understand when to use SFT, RLHF, LoRA, QLoRA, and full fine-tuning based on resources and requirements.
- RAG is everywhere: Most production LLM applications use RAG. Know chunking strategies, vector databases, and hybrid search.
- Deployment matters: Quantization, KV cache, and inference optimization separate candidates who can ship from those who can’t.
- Safety is non-negotiable: Guardrails, prompt injection defense, and responsible AI practices are expected knowledge.
- Stay current: AI agents, reasoning models, and multimodal architectures are the 2025 frontier.
Good luck with your interview.