50 AI & LLM Engineer Interview Questions 2025 Amir Teymoori

This comprehensive guide covers the most frequently asked interview questions for AI and LLM engineering positions at startups and tech companies in 2025. From transformer architecture to RAG systems, prompt engineering to production deployment – these are the questions that separate candidates who get offers from those who don’t.

LLM Fundamentals

1. What is the Transformer architecture and how does it work?

A Transformer is a deep learning architecture that processes sequences using self-attention instead of sequential processing. It has encoder and decoder blocks, self-attention layers, feed-forward networks, and positional encodings. Unlike RNNs, Transformers process all tokens in parallel, making them faster and better at capturing long-range dependencies. GPT uses decoder-only architecture, while BERT uses encoder-only.

2. Explain the self-attention mechanism.

Self-attention allows the model to weigh the importance of different tokens when processing each element. For each token, it creates three vectors: Query (Q), Key (K), and Value (V). Attention scores come from multiplying Q and K, applying softmax, then multiplying by V.

The formula is:

Attention(Q,K,V) = softmax(QK^T / √d_k) × V

This captures relationships between any two tokens regardless of distance.

3. What is multi-head attention and why is it used?

Multi-head attention splits the attention mechanism into multiple parallel “heads,” each learning different representations. Each head independently computes Q, K, V attention on a subspace of the embedding. One head might focus on syntax while another captures semantics. The outputs are concatenated and transformed. Typical models use 8-16 attention heads per layer, enabling richer contextual understanding.

4. What is tokenization and why does it matter for LLMs?

Tokenization breaks text into smaller units (tokens) that can be words, subwords, or characters. LLMs use subword methods like Byte-Pair Encoding (BPE) or WordPiece to handle rare words by splitting them into known subunits. This ensures even unseen words can be processed. Tokenization choice affects model performance and how efficiently the context window is utilized.

5. How do positional encodings work?

Positional encodings add sequence order information to token embeddings because self-attention treats tokens as a set without inherent order. Without them, “The cat chased the dog” would be identical to “The dog chased the cat.” They use sinusoidal functions or learned vectors to assign unique positions. Modern techniques like RoPE (Rotary Position Embedding) help models extrapolate to longer sequences.

6. What is the difference between encoder-only, decoder-only, and encoder-decoder architectures?

Encoder-only (BERT): Processes input bidirectionally, best for understanding tasks like classification and NER.

Decoder-only (GPT, Claude, Llama): Generates text left-to-right, best for text generation.

Encoder-Decoder (T5, BART): Encoder processes input, decoder generates output; best for translation and summarization.

Most modern LLMs use decoder-only because it excels at generation while achieving strong understanding through scale.

Training and Fine-Tuning

7. What is the difference between pre-training and fine-tuning?

Pre-training is the foundational phase where an LLM learns general language patterns from massive datasets using self-supervised objectives like next-token prediction. Fine-tuning takes this pre-trained model and further trains it on a smaller, task-specific dataset. Pre-training takes weeks and requires enormous resources; fine-tuning can be done in hours with much less data and compute.

8. What is Supervised Fine-Tuning (SFT)?

SFT trains a pre-trained LLM on a labeled dataset with input-output pairs for a specific task. The model adjusts weights based on prediction errors. Use it when you have high-quality labeled data for tasks like classification or instruction-following. SFT is typically the first step after pre-training and before RLHF, teaching the model to follow instructions reliably.

9. Explain RLHF and its purpose.

RLHF (Reinforcement Learning from Human Feedback) trains models using human guidance to align outputs with human values. It involves three phases:

Supervised Fine-Tuning on high-quality examples
Training a Reward Model where humans rank outputs
PPO optimization to maximize reward scores

RLHF makes responses safer, more helpful, and ethically aligned – critical for chatbots like ChatGPT.

10. What is LoRA and how does it enable efficient fine-tuning?

LoRA (Low-Rank Adaptation) adds small, trainable low-rank matrices alongside frozen pre-trained weights. It decomposes weight updates into two smaller matrices, reducing trainable parameters to ~0.2-0.3% of the original model.

Benefits:

70% less memory
Faster training
Reduced overfitting
Ability to store multiple task-specific adapters separately

11. When would you use QLoRA vs full fine-tuning?

QLoRA combines LoRA with 4-bit quantization – base weights are stored in 4-bit while LoRA adapters train in 16-bit.

Use QLoRA when:

Fine-tuning large models (70B+) on limited hardware like a single 24GB GPU
Memory is critically constrained

Use full fine-tuning when:

Complex domains (math, code) where precise parameter adjustments are critical
You have sufficient resources

12. What is PEFT and what techniques does it include?

PEFT (Parameter-Efficient Fine-Tuning) adapts LLMs by updating only a small subset of parameters while freezing most weights.

Key techniques:

Adapter Layers: Small trainable layers between existing layers
LoRA: Low-rank matrix injection
Prompt Tuning: Learning soft prompts
Prefix Tuning: Prepending trainable vectors

PEFT checkpoints are only a few MBs compared to 40GB+ for full models.

Prompt Engineering

13. What is Chain-of-Thought (CoT) prompting?

Chain-of-Thought prompting encourages the LLM to show reasoning step-by-step before giving a final answer. Instead of jumping to conclusions, the model “thinks aloud,” breaking complex problems into smaller logical steps. This significantly improves performance on math, logic, and multi-step reasoning tasks. Trigger it by adding “Let’s think step by step” or providing reasoning examples.

14. Explain zero-shot vs few-shot prompting.

Zero-shot gives the model a task without any examples – just describe what you want.

Few-shot includes 2-5 examples demonstrating the desired pattern before the actual question.

Use zero-shot for simple tasks. Switch to few-shot when zero-shot fails, or when task complexity requires demonstrations for consistent, accurate outputs and format adherence.

15. What are system prompts and how do they differ from user prompts?

A system prompt sets the LLM’s overall behavior, role, and constraints for an entire conversation. It defines persona (“You are a helpful customer support agent”), boundaries (“Only answer questions about our products”), and output format.

A user prompt is the actual query. System prompts are processed first and establish context before any user interaction.

16. What are best practices for prompt optimization?

Be specific and concise with clear instructions
Use examples (few-shot) to show what you want
Prefer instructions over constraints (“Summarize in 3 bullet points” beats “Don’t be verbose”)
Control temperature – lower (0.1-0.3) for accuracy, higher (0.7-1.0) for creativity
Structure output with specific formats
Iterate and A/B test variations

17. What prompting strategies help reduce hallucinations?

Use RAG to ground responses in retrieved factual data
Add explicit constraints like “Only use information provided” or “Say ‘I don’t know’ if unsure”
Request citations
Lower temperature to reduce randomness
Use Chain-of-Thought to force reasoning that reveals errors
Apply structured outputs with JSON/schema formats
For critical applications, combine prompt engineering with retrieval and human review

RAG Systems

18. What is RAG and why is it useful?

RAG (Retrieval-Augmented Generation) combines a retrieval system with a language model. First, the retriever searches external knowledge sources to find relevant information. Then, the LLM uses this retrieved context to produce accurate answers.

RAG is useful because it:

Reduces hallucinations
Keeps responses current
Allows LLMs to access domain-specific or private data they weren’t trained on

19. What are vector embeddings and how are they used in RAG?

Embeddings are numerical representations (arrays of floats) that capture semantic meaning. Models like BERT or OpenAI create dense vectors where similar concepts are positioned close together. In RAG, both documents and queries are converted to embeddings, then similarity search finds relevant documents by comparing vector distances. This enables semantic search that understands meaning, not just keywords.

20. Explain different chunking strategies for RAG systems.

Fixed-size: Splits text into uniform token counts – simple but may break context
Sentence/paragraph-based: Keeps natural boundaries for coherence
Semantic chunking: Uses embeddings to group by meaning – most accurate but expensive
Sliding window: Creates overlapping chunks to preserve context
Hierarchical: Creates multi-level chunks (summaries + details)

Semantic chunking generally performs best for accuracy.

21. What is a vector database and why is it used in RAG?

Vector databases store and search high-dimensional embeddings efficiently using indexing like HNSW (Hierarchical Navigable Small World) for fast approximate nearest neighbor search. When a query comes in, it’s converted to an embedding and matched against document vectors.

Popular options: Pinecone, Weaviate, Milvus, FAISS, Chroma

They’re essential for semantic search at scale.

22. What similarity metrics are used in vector search?

Cosine similarity: Measures angle between vectors (0 to 1). Most common for text because it ignores magnitude and focuses on meaning
Euclidean distance (L2): Measures straight-line distance; good when magnitude matters
Dot product: Similar to cosine but doesn’t normalize

For text/semantic search, cosine similarity is typically preferred.

23. What’s the difference between sparse and dense retrieval?

Sparse retrieval (BM25, TF-IDF): Uses keyword matching with high-dimensional sparse vectors – fast and good for exact matching but misses semantic meaning.

Dense retrieval (BERT, DPR): Uses neural embeddings capturing semantic meaning but requires more compute.

Hybrid search combines both: sparse retrieval for initial speed, then dense methods re-rank for accuracy.

Model Evaluation

24. What metrics do you use to evaluate LLM outputs?

Perplexity: Measures prediction confidence
BLEU: Measures n-gram precision for translation
ROUGE: Recall-based for summarization
BERTScore: Measures semantic similarity using embeddings
Accuracy/F1: For classification tasks
Human evaluation: Assesses coherence, relevance, and helpfulness

No single metric captures everything – combine multiple metrics based on task requirements.

25. How do you detect hallucinations in LLM outputs?

SelfCheckGPT: Samples multiple responses and checks consistency – hallucinations show contradictions
RAG-based fact-checking: Verifies claims against retrieved documents
NLI models: Detect entailment/contradiction with ground truth
Semantic entropy: Identifies high uncertainty indicating potential hallucination
LLM-as-Judge: Uses another LLM to evaluate faithfulness

26. How do you evaluate bias in language models?

Counterfactual testing: Change demographic attributes while keeping context same, check if outputs differ
Use benchmarks: CrowS-Pairs and StereoSet measure stereotypical associations
Apply sentiment analysis across groups to detect differential treatment
Use diverse human evaluation panels
Test for both explicit bias and subtle stereotyping

27. What are common LLM benchmarks and what do they measure?

MMLU/MMLU-Pro: Tests multitask language understanding across 57 subjects
HumanEval: Measures code generation accuracy
GPQA: Tests expert-level QA requiring domain knowledge
TruthfulQA: Measures factual accuracy and misconception avoidance
HellaSwag: Tests commonsense reasoning
SQuAD: Evaluates reading comprehension

Use multiple benchmarks for comprehensive assessment.

Deployment and Inference

28. What is model quantization and why use it?

Quantization reduces weight precision from higher (FP32) to lower (INT8, INT4, FP16). It decreases model size by 2-4x, speeds up inference, reduces memory usage, and lowers energy consumption. This enables deployment on edge devices and reduces GPU costs.

Trade-off: Some accuracy loss, especially below INT8.

29. What is the difference between INT8 and FP16 quantization?

FP16 (half-precision float): Maintains floating-point representation but halves memory from FP32 – good balance of speed and accuracy.

INT8 (8-bit integer): Provides greater compression and faster integer arithmetic but may lose more accuracy and requires calibration.

FP16 is safer for most models; INT8 needs careful tuning.

30. What is KV cache and why does it matter for LLM inference?

KV cache stores computed Key and Value vectors from attention layers during autoregressive generation, avoiding recomputation for previous tokens. This changes attention complexity from O(n²) to O(n) per step. Without it, each new token would require reprocessing the entire sequence.

Trade-off: KV cache grows linearly with sequence length, consuming significant GPU memory.

31. How do you reduce LLM inference latency?

Key techniques:

Quantization (INT8/FP16) to reduce compute
KV caching to avoid recomputation
Continuous batching to maximize GPU utilization
Flash Attention for memory-efficient attention
Speculative decoding using a smaller draft model
Model parallelism across GPUs
Optimized serving frameworks like vLLM or TensorRT-LLM

32. What are vLLM and TensorRT-LLM, and when would you use each?

vLLM: Uses PagedAttention for efficient KV cache management – achieves 2-4x throughput improvement, easy Hugging Face integration, cloud-agnostic.

TensorRT-LLM: NVIDIA’s optimized library using kernel fusion – maximum performance on NVIDIA GPUs but requires compilation and is hardware-specific.

Choose vLLM for flexibility; TensorRT-LLM for peak NVIDIA performance.

33. How do you scale LLM serving for production?

Use horizontal scaling with load balancers across multiple instances
Implement auto-scaling (Kubernetes) for traffic spikes
Apply model parallelism for large models
Enable continuous batching for throughput
Use KV cache sharing for common prefixes
Add caching layers for repeated queries
Implement rate limiting and circuit breakers for stability

AI Safety and Alignment

34. What are AI guardrails and how do you implement them?

Guardrails are safety mechanisms monitoring LLM inputs and outputs in real-time.

Implementation includes:

Rule-based filters using keyword lists and regex
LLM-as-judge systems evaluating policy violations
Content moderation layers checking for toxicity, bias, and PII leakage
Input validation preventing prompt injection

Tools like Guardrails AI and NeMo Guardrails provide production frameworks.

35. How do you handle prompt injection attacks?

Prompt injection exploits occur when malicious input overrides developer instructions.

Defenses:

Separate user input from system prompts and validate inputs
Use output filtering and moderation layers
Implement “least privilege” – limit what the LLM can access
Use adversarial testing/red teaming before deployment
Apply layered defenses since no single solution is perfect

36. What is LLM red teaming and why is it important?

Red teaming systematically tests LLMs with adversarial prompts to find vulnerabilities before deployment. It tests for bias, toxicity, prompt injection susceptibility, jailbreaking, PII leakage, and hallucinations. Generate adversarial inputs, run through the LLM, evaluate responses, and document weaknesses.

Tools like DeepTeam and Promptfoo automate this. Essential because LLMs have wide attack surfaces.

37. What are best practices for responsible AI?

Transparency: Document model limitations and training data
Bias mitigation: Use diverse data, test for fairness across groups
Human-in-the-loop: Include human review for critical decisions
Data privacy: Implement PII detection and governance
Continuous monitoring: Track outputs for drift and harmful content
Clear accountability: Assign ownership for AI decisions

Practical Implementation

38. What is LangChain and what are its core components?

LangChain is a modular framework for LLM applications.

Core components:

Models: Interfaces to LLMs like OpenAI, Anthropic
Prompts: Templates with variables
Chains: Sequences of LLM calls
Agents: Systems using LLMs to decide which tools to invoke
Memory: Persists state across interactions
Indexes: Work with vector stores for RAG

Simplifies building chatbots and Q&A systems.

39. What is function calling in LLMs and how does it work?

Function calling lets LLMs respond with structured JSON specifying function names and arguments instead of plain text. You define tools in JSON Schema, the model decides which function to call based on the prompt, then returns structured arguments you parse and execute.

Important: The model does NOT execute functions – it generates parameters. This enables reliable API and database interactions.

40. What is the difference between Pinecone, Weaviate, and Chroma?

Pinecone: Fully managed cloud, best for production at scale with sub-50ms latency at billions of vectors, higher cost but zero infrastructure.

Weaviate: Open-source, supports hybrid search (vector + BM25), GraphQL API, can self-host, best for complex data relationships.

Chroma: Lightweight, Python-native, best for prototyping – runs embedded with your app but limited for production scale.

41. How do you implement a RAG system step by step?

Ingest: Load documents and split into chunks
Embed: Convert chunks to vectors using models like OpenAI embeddings
Store: Index embeddings in a vector database
Retrieve: At query time, find semantically similar chunks using cosine similarity
Generate: Pass retrieved context plus user query to LLM to produce grounded responses

Deep Learning Fundamentals

42. Explain backpropagation.

Backpropagation trains neural networks by computing gradients of the loss function with respect to each weight.

Forward pass: Input flows through layers to produce output; loss is calculated.

Backward pass: Error propagates backward using the chain rule to compute gradients layer-by-layer. These gradients indicate how to adjust weights to reduce error. Weights are then updated using an optimizer like gradient descent.

43. How does gradient descent work?

Gradient descent minimizes the loss function by iteratively adjusting parameters in the direction that reduces loss. It calculates the gradient of loss with respect to each weight, then updates:

w_new = w_old - learning_rate × gradient

Three variants:

Batch GD (entire dataset)
Stochastic GD (one sample)
Mini-batch GD (small batches – most common)

Learning rate controls step size.

44. What are common loss functions and when to use them?

Classification:

Cross-entropy loss measures difference between predicted probabilities and actual labels
Binary Cross-Entropy for 2 classes, Categorical for multi-class

Regression:

MSE penalizes larger errors more
MAE is robust to outliers

Generative models:

KL Divergence measures distribution differences

Choose based on task: cross-entropy for classification, MSE for regression.

45. Explain vanishing and exploding gradients. How do you address them?

Vanishing gradients: Gradients become tiny during backpropagation, causing early layers to stop learning – common with sigmoid/tanh.

Exploding gradients: Gradients become excessively large, causing unstable updates.

Solutions:

Use ReLU activation
Proper weight initialization (Xavier, He)
Batch normalization
Gradient clipping
Residual connections
Adaptive optimizers like Adam

46. Why do we need activation functions?

Activation functions introduce non-linearity, enabling neural networks to learn complex patterns beyond linear relationships. Without them, stacking layers would just produce linear transformations.

ReLU (max(0,x)): Most popular, computationally efficient
Sigmoid (0 to 1): For binary output
Softmax: Probability distribution for multi-class
Leaky ReLU: Addresses “dying ReLU” problem

47. What is overfitting and how do you prevent it?

Overfitting occurs when a model memorizes training data patterns including noise, performing well on training but poorly on unseen data.

Prevention:

Dropout: Randomly disables neurons (20-50%)
Regularization (L1/L2): Penalizes large weights
Early stopping: Halts when validation loss increases
Data augmentation: Increases variety
Reduce model complexity
Add more training data

Current Trends 2025

48. How do AI agents work and what are their key components?

AI agents are autonomous systems that perceive environments, make decisions, and take actions to achieve goals.

Components:

LLM Core: Powers reasoning
Memory Systems: Short-term (conversation) and long-term (vector store)
Tool Integration: APIs, databases, functions
Planning Module: Breaks goals into subtasks
Orchestration: Coordinates execution

Unlike chatbots, agents maintain state, plan multi-step workflows, and adapt based on feedback.

49. What are reasoning models and how do they differ from standard LLMs?

Reasoning models (OpenAI o1/o3, DeepSeek-R1) generate “thinking tokens” before outputs, showing step-by-step reasoning.

Key differences:

Use Chain-of-Thought internally
Better at complex multi-step problems and math
Require different prompting patterns (less explicit instructions needed)
Higher latency and token usage
Provide more interpretable decisions

Used for complex problem-solving rather than simple generation.

50. How do multimodal models process different types of inputs?

Multimodal models process text, images, audio by:

Input Encoding: Each modality uses specialized encoders (text tokenizer, vision encoder like CLIP/ViT)
Embedding Alignment: Converting modalities into a shared embedding space
Cross-Attention: Attending across modality representations
Unified Processing: Transformer processes combined representations

Models like GPT-4V, Claude 3, and Gemini reason across text and images together.

Quick Reference by Difficulty Level

Entry-Level Topics: Transformers, tokenization, basic training, prompting basics (Questions 1-6, 7-8, 13-17, 42-47)

Mid-Level Topics: Fine-tuning methods, RAG implementation, deployment (Questions 9-12, 18-23, 28-33, 38-41)

Advanced Topics: Evaluation, safety, optimization, current trends (Questions 24-27, 34-37, 48-50)

Key Takeaways

Master the fundamentals: Transformer architecture, attention mechanisms, and tokenization are foundational to every interview.

Know your fine-tuning options: Understand when to use SFT, RLHF, LoRA, QLoRA, and full fine-tuning based on resources and requirements.

RAG is everywhere: Most production LLM applications use RAG. Know chunking strategies, vector databases, and hybrid search.

Deployment matters: Quantization, KV cache, and inference optimization separate candidates who can ship from those who can’t.

Safety is non-negotiable: Guardrails, prompt injection defense, and responsible AI practices are expected knowledge.

Stay current: AI agents, reasoning models, and multimodal architectures are the 2025 frontier.

Good luck with your interview.