AI/LLM Glossary: 120 Essential Terms

Looking for a fast, accurate guide to the most important words in AI and large language models? Here’s a clean, up-to-date glossary of 120 core terms, ordered roughly by how often people search, discuss, and use them in 2025.

Core LLM Concepts (1-10)

Large Language Model (LLM): A Transformer-based neural network trained to predict the next token and perform language tasks.
Transformer: The dominant neural architecture using attention instead of recurrence or convolutions.
RAG (Retrieval-Augmented Generation): Adds external documents to prompts so models can ground answers in retrieved evidence.
Vector Database: A database optimized for similarity search over embeddings (ANN indexes like HNSW/IVF, scalar filters, hybrid search).
Embedding: Numeric vector representation of text, images, or audio used for search, clustering, or retrieval.
Prompt: The input instruction and context given to an LLM to steer its behavior.
System Prompt: Hidden, high-priority instructions that set overall behavior and guardrails.
Agents: LLM programs that plan, call tools/APIs, and iterate toward a goal with feedback.
Tool/Function Calling: Structured requests where a model outputs JSON arguments to call external tools reliably.
Context Window: The maximum tokens the model attends to (input plus generated tokens).

Tokens & Text Processing (11-22)

Token: The unit a model reads/writes (sub-words, symbols, or bytes, depending on the tokenizer).
Tokenization (BPE/SentencePiece/WordPiece): Algorithms that split text into tokens for efficient modeling.
Self-Attention: Mechanism letting each token attend to others to compute contextualized representations.
Cross-Attention: Attention across two sequences (e.g., decoder attending to retrieved passages or image features).
Softmax: Converts logits into a probability distribution; temperature scales it to control randomness.
Logits: Raw scores before softmax; higher means more probable next tokens.
Temperature: Scales logits to control randomness; lower is more deterministic, higher is more diverse.
Top-p (Nucleus Sampling): Sample from the smallest set of tokens whose cumulative probability exceeds p.
Top-k Sampling: Sample from the k most probable tokens after renormalization.
Greedy Decoding: Always pick the highest-probability next token (most deterministic, least diverse).
Beam Search: Keep the best few partial sequences to approximate the overall best output.
Hallucination: Fluent but false or ungrounded output; mitigated by RAG, citations, and better evaluation.

Performance & Optimization (23-35)

Grounding: Tying answers to retrieved or trusted sources (documents, tools, databases) to improve accuracy.
KV Cache (Past Key-Values): Stores prior attention keys/values to speed up autoregressive decoding.
Speculative Decoding: A small “draft” model proposes tokens the target model verifies in parallel for speedups.
FlashAttention: Memory-efficient exact attention kernel that reduces HBM traffic and accelerates training/inference.
Mixture of Experts (MoE): Sparse layers route tokens to a few specialized experts, increasing parameters without proportional compute.
Router/Gating: Learns which expert(s) handle each token in MoE layers while balancing load.
Perplexity: Exponential of average negative log-likelihood; lower implies better next-token prediction.
MMLU: A broad knowledge benchmark used to evaluate model academic recall and reasoning.
GSM8K: Grade-school math word-problem benchmark for reasoning and step-by-step calculation.
HumanEval: Code-generation benchmark that executes tests against generated solutions.
Chain-of-Thought (CoT): Prompting or training to produce step-by-step reasoning before final answers.
Tree-of-Thought (ToT): Explore multiple reasoning branches, then select or vote on the best.
ReAct: Interleaves reasoning traces with actions (tool calls) to solve tasks transparently.

Training & Fine-Tuning (36-49)

Instruction Tuning: Supervised fine-tuning on instruction-response pairs to follow natural requests.
SFT (Supervised Fine-Tuning): General term for supervised training on labeled input-output pairs.
RLHF: Aligns models with human preferences using a reward model and reinforcement learning.
DPO (Direct Preference Optimization): Preference learning without an explicit reward model; simpler than RLHF.
RLAIF: Uses AI feedback instead of (or alongside) human feedback to scale preference learning.
LoRA: Low-rank adapters that fine-tune a small set of parameters on top of a frozen base model.
QLoRA: Memory-efficient fine-tuning with 4-bit quantization + LoRA adapters.
PEFT (Parameter-Efficient Fine-Tuning): Family of techniques (LoRA, adapters, prefix/prompt tuning) to update few parameters.
Prompt Tuning / Prefix Tuning: Learnable prompt vectors prepended to inputs instead of updating full weights.
Few-Shot / Zero-Shot: Solve tasks with a handful of examples—or none—purely from instructions.
Evaluation Harness: Scripts and datasets that run consistent, repeatable LLM evaluations.
Guardrails/Moderation: Policies and filters that block unsafe, copyrighted, or private content.
Function/Schema Validation: Forcing well-formed JSON or typed outputs (e.g., via adapters) to reduce parsing errors.
Context Overflow: When input exceeds the window; requires truncation, summarization, or long-context models.

Model Architecture & Optimization (50-66)

Long-Context Techniques: RoPE scaling, ALiBi, attention optimizations, and memory-efficient kernels.
Quantization (INT8/INT4/NF4): Lower precision weights/activations to reduce memory and speed up inference.
Distillation: Train a smaller “student” to mimic a larger “teacher” model’s behavior.
Pruning: Remove weights or neurons with minimal impact to compress models.
AdamW: Widely used optimizer combining Adam with decoupled weight decay.
Learning Rate Schedule: Warmup and decay strategies for stable training.
Batch Size / Micro-Batching: Number of examples per update; micro-batches simulate larger batches under memory limits.
Weight Decay: L2-style regularization to reduce overfitting during training.
Dropout: Randomly zero activations during training to improve generalization.
Label Smoothing: Softens targets to improve calibration and reduce overconfidence.
Residual Connections: Skip paths that ease optimization and stabilize deep networks.
LayerNorm / RMSNorm: Normalization layers; RMSNorm removes mean-centering for efficiency.
Positional Encoding (Sinusoidal/Learned): Injects position information so attention can model order.
RoPE (Rotary Positional Embeddings): Encodes relative positions via rotations applied inside attention.
ALiBi: Adds distance-based linear bias to attention for train-short, test-long extrapolation.
FFN (Feed-Forward Network): Per-token MLP sublayer inside each Transformer block.
SwiGLU/GEGLU: Gated activation variants used in modern FFNs for quality and efficiency.

Multimodal & Image Generation (67-79)

Multimodal (VLM): Models that accept or produce multiple modalities (text, image, audio, video).
Vision Transformer (ViT): Applies Transformer blocks to image patches for vision tasks.
Diffusion Model: Learns to denoise noise step-by-step to synthesize images or audio.
Latent Diffusion: Runs diffusion in a compressed latent space (via VAE) for speed and quality.
VAE (Variational Autoencoder): Encoder-decoder that learns a latent space used by latent diffusion.
U-Net: Encoder-decoder CNN backbone common in diffusion denoisers.
DiT (Diffusion Transformer): Transformer-based denoiser replacing U-Nets in diffusion pipelines.
Classifier-Free Guidance: Combines conditional and unconditional predictions to steer images toward prompts.
Guidance Scale: Strength of prompt adherence versus diversity in diffusion sampling.
Negative Prompt: Terms to avoid in image generation (e.g., “no text, no extra fingers”).
Sampler (DDIM/DPM-Solver): Numerical solvers that trade steps for speed vs. fidelity in diffusion.
Seed: Random initialization controlling reproducibility of generated outputs.
Inference Steps: Number of denoising steps; fewer are faster, more can improve detail (up to a point).

Vector Search & Retrieval (80-91)

HNSW: Hierarchical Navigable Small-World graph index for fast approximate nearest neighbor search.
FAISS: Popular similarity-search library for vector indexes on CPU/GPU.
Cosine Similarity: Measures angle between vectors; common for text embeddings.
Dot Product (Inner Product): Similarity via unnormalized vector projection; common in ANN indexes.
Euclidean Distance (L2): Straight-line distance between vectors; used in many ANN backends.
Hybrid Search (Sparse + Dense): Combine BM25 keywords with embeddings for recall and precision.
BM25: Classic keyword ranking function; strong baseline for sparse retrieval.
Reranker (Cross-Encoder): Re-scores candidates with a more accurate but slower model.
ColBERT (Late Interaction): Multi-vector retrieval that matches tokens for fine-grained relevance.
Chunking: Split long documents into overlap-aware segments for better retrieval and context fit.
Overlap / Stride: The sliding context shared between neighboring chunks to preserve continuity.
MMR (Maximal Marginal Relevance): Diversifies retrieved passages by balancing relevance and novelty.

Tools & Frameworks (92-100)

Embedding Model: The specific model used to compute vectors (domain-specific choices matter).
Semantic Caching: Store previous prompts/answers/vectors to speed up repeated or similar queries.
Vibe Coding: Iterative, AI-paired coding style—describe intent, let the agent build, then refine based on results.
Claude Code: Anthropic’s agentic coding tool that edits files, runs commands, and opens PRs from your terminal/IDE.
OpenAI Codex: OpenAI’s early code-generation model (2021) that powered initial GitHub Copilot; foundational but largely superseded by newer GPT-4-class coding models.
DSPy: Programmatic framework to define, evaluate, and optimize LLM pipelines with signatures, modules, and optimizers.
LangChain: Toolkit to compose LLM workflows (chains, tools, agents) with integrations to models and stores.
LlamaIndex: Data framework for RAG (ingestion, indexing, querying, evaluation) across vector stores.
Hugging Face: Open ecosystem for models, datasets, Spaces, and libraries like Transformers and Diffusers.

Advanced Inference & Serving (101-112)

Multi-Query Attention (MQA): Shares keys/values across heads to cut KV memory and speed decoding with minimal quality loss.
Grouped-Query Attention (GQA): Shares K/V within head groups—quality closer to MHA with MQA-like memory savings; used in modern LLMs.
Sliding-Window Attention (SWA): Limits attention to a moving window, enabling long contexts with linear memory/runtime growth.
PagedAttention (vLLM): Memory manager that pages KV cache blocks to keep GPUs highly utilized and reduce fragmentation.
Continuous Batching: Dynamically merges requests of different lengths to maximize throughput in serving.
Tensor Parallelism: Splits large matrix ops across GPUs (e.g., by columns/rows) to fit and accelerate massive models.
Pipeline Parallelism: Partitions layers across devices and streams micro-batches to keep all stages busy.
FSDP / ZeRO Sharding: Shards parameters, gradients, and optimizer states across workers to train models beyond single-GPU memory.
FP8: 8-bit floating-point formats (e4m3/e5m2) that reduce bandwidth and memory while preserving training quality with calibration.
AWQ: Activation-aware weight quantization that preserves salient channels for accurate 4-bit inference on GPUs.
GPTQ: Post-training quantization method that minimizes layer-wise error for accurate 4-bit weights.
KV Cache Quantization: Compresses past keys/values (e.g., 8-bit/4-bit) to boost throughput and extend context length.

Emerging Techniques & Methods (113-120)

Mamba (SSM): Selective state-space model with linear-time sequence processing and strong long-context performance.
GraphRAG: Builds/uses knowledge graphs with retrieval to ground answers in entities and relations, improving factuality.
HyDE: Generates hypothetical passages from the query, embeds them, then retrieves real documents matching those pseudo-answers.
Self-Consistency: Samples multiple reasoning paths (CoT) and votes—often boosts math and logic accuracy.
Reflexion: Agent loop that critiques its own steps, writes takeaways, and retries to improve results.
ORPO: Odds-ratio preference optimization—single-stage alignment alternative to RLHF with simpler training.
IPO (Implicit Preference Optimization): Preference learning that avoids reward models by matching optimal policies implicitly.
Structured Outputs (JSON/Schema): Constrains generations to schemas for reliable tool calls and API-safe responses.

Tip: Bookmark this page. As stacks evolve—new adapters, longer contexts, faster attention kernels—these fundamentals stay useful no matter which model or provider you pick.