Looking for a fast, accurate guide to the most important words in AI and large language models? Here’s a clean, up-to-date glossary of 120 core terms, ordered roughly by how often people search, discuss, and use them in 2025.
Core LLM Concepts (1-10)
- Large Language Model (LLM): A Transformer-based neural network trained to predict the next token and perform language tasks.
- Transformer: The dominant neural architecture using attention instead of recurrence or convolutions.
- RAG (Retrieval-Augmented Generation): Adds external documents to prompts so models can ground answers in retrieved evidence.
- Vector Database: A database optimized for similarity search over embeddings (ANN indexes like HNSW/IVF, scalar filters, hybrid search).
- Embedding: Numeric vector representation of text, images, or audio used for search, clustering, or retrieval.
- Prompt: The input instruction and context given to an LLM to steer its behavior.
- System Prompt: Hidden, high-priority instructions that set overall behavior and guardrails.
- Agents: LLM programs that plan, call tools/APIs, and iterate toward a goal with feedback.
- Tool/Function Calling: Structured requests where a model outputs JSON arguments to call external tools reliably.
- Context Window: The maximum tokens the model attends to (input plus generated tokens).
Tokens & Text Processing (11-22)
- Token: The unit a model reads/writes (sub-words, symbols, or bytes, depending on the tokenizer).
- Tokenization (BPE/SentencePiece/WordPiece): Algorithms that split text into tokens for efficient modeling.
- Self-Attention: Mechanism letting each token attend to others to compute contextualized representations.
- Cross-Attention: Attention across two sequences (e.g., decoder attending to retrieved passages or image features).
- Softmax: Converts logits into a probability distribution; temperature scales it to control randomness.
- Logits: Raw scores before softmax; higher means more probable next tokens.
- Temperature: Scales logits to control randomness; lower is more deterministic, higher is more diverse.
- Top-p (Nucleus Sampling): Sample from the smallest set of tokens whose cumulative probability exceeds p.
- Top-k Sampling: Sample from the k most probable tokens after renormalization.
- Greedy Decoding: Always pick the highest-probability next token (most deterministic, least diverse).
- Beam Search: Keep the best few partial sequences to approximate the overall best output.
- Hallucination: Fluent but false or ungrounded output; mitigated by RAG, citations, and better evaluation.
Performance & Optimization (23-35)
- Grounding: Tying answers to retrieved or trusted sources (documents, tools, databases) to improve accuracy.
- KV Cache (Past Key-Values): Stores prior attention keys/values to speed up autoregressive decoding.
- Speculative Decoding: A small “draft” model proposes tokens the target model verifies in parallel for speedups.
- FlashAttention: Memory-efficient exact attention kernel that reduces HBM traffic and accelerates training/inference.
- Mixture of Experts (MoE): Sparse layers route tokens to a few specialized experts, increasing parameters without proportional compute.
- Router/Gating: Learns which expert(s) handle each token in MoE layers while balancing load.
- Perplexity: Exponential of average negative log-likelihood; lower implies better next-token prediction.
- MMLU: A broad knowledge benchmark used to evaluate model academic recall and reasoning.
- GSM8K: Grade-school math word-problem benchmark for reasoning and step-by-step calculation.
- HumanEval: Code-generation benchmark that executes tests against generated solutions.
- Chain-of-Thought (CoT): Prompting or training to produce step-by-step reasoning before final answers.
- Tree-of-Thought (ToT): Explore multiple reasoning branches, then select or vote on the best.
- ReAct: Interleaves reasoning traces with actions (tool calls) to solve tasks transparently.
Training & Fine-Tuning (36-49)
- Instruction Tuning: Supervised fine-tuning on instruction-response pairs to follow natural requests.
- SFT (Supervised Fine-Tuning): General term for supervised training on labeled input-output pairs.
- RLHF: Aligns models with human preferences using a reward model and reinforcement learning.
- DPO (Direct Preference Optimization): Preference learning without an explicit reward model; simpler than RLHF.
- RLAIF: Uses AI feedback instead of (or alongside) human feedback to scale preference learning.
- LoRA: Low-rank adapters that fine-tune a small set of parameters on top of a frozen base model.
- QLoRA: Memory-efficient fine-tuning with 4-bit quantization + LoRA adapters.
- PEFT (Parameter-Efficient Fine-Tuning): Family of techniques (LoRA, adapters, prefix/prompt tuning) to update few parameters.
- Prompt Tuning / Prefix Tuning: Learnable prompt vectors prepended to inputs instead of updating full weights.
- Few-Shot / Zero-Shot: Solve tasks with a handful of examples—or none—purely from instructions.
- Evaluation Harness: Scripts and datasets that run consistent, repeatable LLM evaluations.
- Guardrails/Moderation: Policies and filters that block unsafe, copyrighted, or private content.
- Function/Schema Validation: Forcing well-formed JSON or typed outputs (e.g., via adapters) to reduce parsing errors.
- Context Overflow: When input exceeds the window; requires truncation, summarization, or long-context models.
Model Architecture & Optimization (50-66)
- Long-Context Techniques: RoPE scaling, ALiBi, attention optimizations, and memory-efficient kernels.
- Quantization (INT8/INT4/NF4): Lower precision weights/activations to reduce memory and speed up inference.
- Distillation: Train a smaller “student” to mimic a larger “teacher” model’s behavior.
- Pruning: Remove weights or neurons with minimal impact to compress models.
- AdamW: Widely used optimizer combining Adam with decoupled weight decay.
- Learning Rate Schedule: Warmup and decay strategies for stable training.
- Batch Size / Micro-Batching: Number of examples per update; micro-batches simulate larger batches under memory limits.
- Weight Decay: L2-style regularization to reduce overfitting during training.
- Dropout: Randomly zero activations during training to improve generalization.
- Label Smoothing: Softens targets to improve calibration and reduce overconfidence.
- Residual Connections: Skip paths that ease optimization and stabilize deep networks.
- LayerNorm / RMSNorm: Normalization layers; RMSNorm removes mean-centering for efficiency.
- Positional Encoding (Sinusoidal/Learned): Injects position information so attention can model order.
- RoPE (Rotary Positional Embeddings): Encodes relative positions via rotations applied inside attention.
- ALiBi: Adds distance-based linear bias to attention for train-short, test-long extrapolation.
- FFN (Feed-Forward Network): Per-token MLP sublayer inside each Transformer block.
- SwiGLU/GEGLU: Gated activation variants used in modern FFNs for quality and efficiency.
Multimodal & Image Generation (67-79)
- Multimodal (VLM): Models that accept or produce multiple modalities (text, image, audio, video).
- Vision Transformer (ViT): Applies Transformer blocks to image patches for vision tasks.
- Diffusion Model: Learns to denoise noise step-by-step to synthesize images or audio.
- Latent Diffusion: Runs diffusion in a compressed latent space (via VAE) for speed and quality.
- VAE (Variational Autoencoder): Encoder-decoder that learns a latent space used by latent diffusion.
- U-Net: Encoder-decoder CNN backbone common in diffusion denoisers.
- DiT (Diffusion Transformer): Transformer-based denoiser replacing U-Nets in diffusion pipelines.
- Classifier-Free Guidance: Combines conditional and unconditional predictions to steer images toward prompts.
- Guidance Scale: Strength of prompt adherence versus diversity in diffusion sampling.
- Negative Prompt: Terms to avoid in image generation (e.g., “no text, no extra fingers”).
- Sampler (DDIM/DPM-Solver): Numerical solvers that trade steps for speed vs. fidelity in diffusion.
- Seed: Random initialization controlling reproducibility of generated outputs.
- Inference Steps: Number of denoising steps; fewer are faster, more can improve detail (up to a point).
Vector Search & Retrieval (80-91)
- HNSW: Hierarchical Navigable Small-World graph index for fast approximate nearest neighbor search.
- FAISS: Popular similarity-search library for vector indexes on CPU/GPU.
- Cosine Similarity: Measures angle between vectors; common for text embeddings.
- Dot Product (Inner Product): Similarity via unnormalized vector projection; common in ANN indexes.
- Euclidean Distance (L2): Straight-line distance between vectors; used in many ANN backends.
- Hybrid Search (Sparse + Dense): Combine BM25 keywords with embeddings for recall and precision.
- BM25: Classic keyword ranking function; strong baseline for sparse retrieval.
- Reranker (Cross-Encoder): Re-scores candidates with a more accurate but slower model.
- ColBERT (Late Interaction): Multi-vector retrieval that matches tokens for fine-grained relevance.
- Chunking: Split long documents into overlap-aware segments for better retrieval and context fit.
- Overlap / Stride: The sliding context shared between neighboring chunks to preserve continuity.
- MMR (Maximal Marginal Relevance): Diversifies retrieved passages by balancing relevance and novelty.
Tools & Frameworks (92-100)
- Embedding Model: The specific model used to compute vectors (domain-specific choices matter).
- Semantic Caching: Store previous prompts/answers/vectors to speed up repeated or similar queries.
- Vibe Coding: Iterative, AI-paired coding style—describe intent, let the agent build, then refine based on results.
- Claude Code: Anthropic’s agentic coding tool that edits files, runs commands, and opens PRs from your terminal/IDE.
- OpenAI Codex: OpenAI’s early code-generation model (2021) that powered initial GitHub Copilot; foundational but largely superseded by newer GPT-4-class coding models.
- DSPy: Programmatic framework to define, evaluate, and optimize LLM pipelines with signatures, modules, and optimizers.
- LangChain: Toolkit to compose LLM workflows (chains, tools, agents) with integrations to models and stores.
- LlamaIndex: Data framework for RAG (ingestion, indexing, querying, evaluation) across vector stores.
- Hugging Face: Open ecosystem for models, datasets, Spaces, and libraries like Transformers and Diffusers.
Advanced Inference & Serving (101-112)
- Multi-Query Attention (MQA): Shares keys/values across heads to cut KV memory and speed decoding with minimal quality loss.
- Grouped-Query Attention (GQA): Shares K/V within head groups—quality closer to MHA with MQA-like memory savings; used in modern LLMs.
- Sliding-Window Attention (SWA): Limits attention to a moving window, enabling long contexts with linear memory/runtime growth.
- PagedAttention (vLLM): Memory manager that pages KV cache blocks to keep GPUs highly utilized and reduce fragmentation.
- Continuous Batching: Dynamically merges requests of different lengths to maximize throughput in serving.
- Tensor Parallelism: Splits large matrix ops across GPUs (e.g., by columns/rows) to fit and accelerate massive models.
- Pipeline Parallelism: Partitions layers across devices and streams micro-batches to keep all stages busy.
- FSDP / ZeRO Sharding: Shards parameters, gradients, and optimizer states across workers to train models beyond single-GPU memory.
- FP8: 8-bit floating-point formats (e4m3/e5m2) that reduce bandwidth and memory while preserving training quality with calibration.
- AWQ: Activation-aware weight quantization that preserves salient channels for accurate 4-bit inference on GPUs.
- GPTQ: Post-training quantization method that minimizes layer-wise error for accurate 4-bit weights.
- KV Cache Quantization: Compresses past keys/values (e.g., 8-bit/4-bit) to boost throughput and extend context length.
Emerging Techniques & Methods (113-120)
- Mamba (SSM): Selective state-space model with linear-time sequence processing and strong long-context performance.
- GraphRAG: Builds/uses knowledge graphs with retrieval to ground answers in entities and relations, improving factuality.
- HyDE: Generates hypothetical passages from the query, embeds them, then retrieves real documents matching those pseudo-answers.
- Self-Consistency: Samples multiple reasoning paths (CoT) and votes—often boosts math and logic accuracy.
- Reflexion: Agent loop that critiques its own steps, writes takeaways, and retries to improve results.
- ORPO: Odds-ratio preference optimization—single-stage alignment alternative to RLHF with simpler training.
- IPO (Implicit Preference Optimization): Preference learning that avoids reward models by matching optimal policies implicitly.
- Structured Outputs (JSON/Schema): Constrains generations to schemas for reliable tool calls and API-safe responses.
Tip: Bookmark this page. As stacks evolve—new adapters, longer contexts, faster attention kernels—these fundamentals stay useful no matter which model or provider you pick.