LLM MODELS, PROVIDERS AND TRAINING

AI/LLM Glossary: 120 Essential Terms

AI LLM Glossary book with 120 essential terms for Artificial Intelligence, Large Language Models, Machine Learning, Deep Learning, NLP Natural Language Processing, Transformers, GPT, BERT, Neural Networks, Prompt Engineering, RAG Retrieval Augmented Generation, Fine-tuning, Embeddings, Tokens, Tokenization, ChatGPT, Claude, AI Agents, Model Training, Inference, Optimization, and comprehensive AI ML terminology definitions dictionary

Looking for a fast, accurate guide to the most important words in AI and large language models? Here’s a clean, up-to-date glossary of 120 core terms, ordered roughly by how often people search, discuss, and use them in 2025.

Core LLM Concepts (1-10)

  1. Large Language Model (LLM): A Transformer-based neural network trained to predict the next token and perform language tasks.
  2. Transformer: The dominant neural architecture using attention instead of recurrence or convolutions.
  3. RAG (Retrieval-Augmented Generation): Adds external documents to prompts so models can ground answers in retrieved evidence.
  4. Vector Database: A database optimized for similarity search over embeddings (ANN indexes like HNSW/IVF, scalar filters, hybrid search).
  5. Embedding: Numeric vector representation of text, images, or audio used for search, clustering, or retrieval.
  6. Prompt: The input instruction and context given to an LLM to steer its behavior.
  7. System Prompt: Hidden, high-priority instructions that set overall behavior and guardrails.
  8. Agents: LLM programs that plan, call tools/APIs, and iterate toward a goal with feedback.
  9. Tool/Function Calling: Structured requests where a model outputs JSON arguments to call external tools reliably.
  10. Context Window: The maximum tokens the model attends to (input plus generated tokens).

Tokens & Text Processing (11-22)

  1. Token: The unit a model reads/writes (sub-words, symbols, or bytes, depending on the tokenizer).
  2. Tokenization (BPE/SentencePiece/WordPiece): Algorithms that split text into tokens for efficient modeling.
  3. Self-Attention: Mechanism letting each token attend to others to compute contextualized representations.
  4. Cross-Attention: Attention across two sequences (e.g., decoder attending to retrieved passages or image features).
  5. Softmax: Converts logits into a probability distribution; temperature scales it to control randomness.
  6. Logits: Raw scores before softmax; higher means more probable next tokens.
  7. Temperature: Scales logits to control randomness; lower is more deterministic, higher is more diverse.
  8. Top-p (Nucleus Sampling): Sample from the smallest set of tokens whose cumulative probability exceeds p.
  9. Top-k Sampling: Sample from the k most probable tokens after renormalization.
  10. Greedy Decoding: Always pick the highest-probability next token (most deterministic, least diverse).
  11. Beam Search: Keep the best few partial sequences to approximate the overall best output.
  12. Hallucination: Fluent but false or ungrounded output; mitigated by RAG, citations, and better evaluation.

Performance & Optimization (23-35)

  1. Grounding: Tying answers to retrieved or trusted sources (documents, tools, databases) to improve accuracy.
  2. KV Cache (Past Key-Values): Stores prior attention keys/values to speed up autoregressive decoding.
  3. Speculative Decoding: A small “draft” model proposes tokens the target model verifies in parallel for speedups.
  4. FlashAttention: Memory-efficient exact attention kernel that reduces HBM traffic and accelerates training/inference.
  5. Mixture of Experts (MoE): Sparse layers route tokens to a few specialized experts, increasing parameters without proportional compute.
  6. Router/Gating: Learns which expert(s) handle each token in MoE layers while balancing load.
  7. Perplexity: Exponential of average negative log-likelihood; lower implies better next-token prediction.
  8. MMLU: A broad knowledge benchmark used to evaluate model academic recall and reasoning.
  9. GSM8K: Grade-school math word-problem benchmark for reasoning and step-by-step calculation.
  10. HumanEval: Code-generation benchmark that executes tests against generated solutions.
  11. Chain-of-Thought (CoT): Prompting or training to produce step-by-step reasoning before final answers.
  12. Tree-of-Thought (ToT): Explore multiple reasoning branches, then select or vote on the best.
  13. ReAct: Interleaves reasoning traces with actions (tool calls) to solve tasks transparently.

Training & Fine-Tuning (36-49)

  1. Instruction Tuning: Supervised fine-tuning on instruction-response pairs to follow natural requests.
  2. SFT (Supervised Fine-Tuning): General term for supervised training on labeled input-output pairs.
  3. RLHF: Aligns models with human preferences using a reward model and reinforcement learning.
  4. DPO (Direct Preference Optimization): Preference learning without an explicit reward model; simpler than RLHF.
  5. RLAIF: Uses AI feedback instead of (or alongside) human feedback to scale preference learning.
  6. LoRA: Low-rank adapters that fine-tune a small set of parameters on top of a frozen base model.
  7. QLoRA: Memory-efficient fine-tuning with 4-bit quantization + LoRA adapters.
  8. PEFT (Parameter-Efficient Fine-Tuning): Family of techniques (LoRA, adapters, prefix/prompt tuning) to update few parameters.
  9. Prompt Tuning / Prefix Tuning: Learnable prompt vectors prepended to inputs instead of updating full weights.
  10. Few-Shot / Zero-Shot: Solve tasks with a handful of examples—or none—purely from instructions.
  11. Evaluation Harness: Scripts and datasets that run consistent, repeatable LLM evaluations.
  12. Guardrails/Moderation: Policies and filters that block unsafe, copyrighted, or private content.
  13. Function/Schema Validation: Forcing well-formed JSON or typed outputs (e.g., via adapters) to reduce parsing errors.
  14. Context Overflow: When input exceeds the window; requires truncation, summarization, or long-context models.

Model Architecture & Optimization (50-66)

  1. Long-Context Techniques: RoPE scaling, ALiBi, attention optimizations, and memory-efficient kernels.
  2. Quantization (INT8/INT4/NF4): Lower precision weights/activations to reduce memory and speed up inference.
  3. Distillation: Train a smaller “student” to mimic a larger “teacher” model’s behavior.
  4. Pruning: Remove weights or neurons with minimal impact to compress models.
  5. AdamW: Widely used optimizer combining Adam with decoupled weight decay.
  6. Learning Rate Schedule: Warmup and decay strategies for stable training.
  7. Batch Size / Micro-Batching: Number of examples per update; micro-batches simulate larger batches under memory limits.
  8. Weight Decay: L2-style regularization to reduce overfitting during training.
  9. Dropout: Randomly zero activations during training to improve generalization.
  10. Label Smoothing: Softens targets to improve calibration and reduce overconfidence.
  11. Residual Connections: Skip paths that ease optimization and stabilize deep networks.
  12. LayerNorm / RMSNorm: Normalization layers; RMSNorm removes mean-centering for efficiency.
  13. Positional Encoding (Sinusoidal/Learned): Injects position information so attention can model order.
  14. RoPE (Rotary Positional Embeddings): Encodes relative positions via rotations applied inside attention.
  15. ALiBi: Adds distance-based linear bias to attention for train-short, test-long extrapolation.
  16. FFN (Feed-Forward Network): Per-token MLP sublayer inside each Transformer block.
  17. SwiGLU/GEGLU: Gated activation variants used in modern FFNs for quality and efficiency.

Multimodal & Image Generation (67-79)

  1. Multimodal (VLM): Models that accept or produce multiple modalities (text, image, audio, video).
  2. Vision Transformer (ViT): Applies Transformer blocks to image patches for vision tasks.
  3. Diffusion Model: Learns to denoise noise step-by-step to synthesize images or audio.
  4. Latent Diffusion: Runs diffusion in a compressed latent space (via VAE) for speed and quality.
  5. VAE (Variational Autoencoder): Encoder-decoder that learns a latent space used by latent diffusion.
  6. U-Net: Encoder-decoder CNN backbone common in diffusion denoisers.
  7. DiT (Diffusion Transformer): Transformer-based denoiser replacing U-Nets in diffusion pipelines.
  8. Classifier-Free Guidance: Combines conditional and unconditional predictions to steer images toward prompts.
  9. Guidance Scale: Strength of prompt adherence versus diversity in diffusion sampling.
  10. Negative Prompt: Terms to avoid in image generation (e.g., “no text, no extra fingers”).
  11. Sampler (DDIM/DPM-Solver): Numerical solvers that trade steps for speed vs. fidelity in diffusion.
  12. Seed: Random initialization controlling reproducibility of generated outputs.
  13. Inference Steps: Number of denoising steps; fewer are faster, more can improve detail (up to a point).

Vector Search & Retrieval (80-91)

  1. HNSW: Hierarchical Navigable Small-World graph index for fast approximate nearest neighbor search.
  2. FAISS: Popular similarity-search library for vector indexes on CPU/GPU.
  3. Cosine Similarity: Measures angle between vectors; common for text embeddings.
  4. Dot Product (Inner Product): Similarity via unnormalized vector projection; common in ANN indexes.
  5. Euclidean Distance (L2): Straight-line distance between vectors; used in many ANN backends.
  6. Hybrid Search (Sparse + Dense): Combine BM25 keywords with embeddings for recall and precision.
  7. BM25: Classic keyword ranking function; strong baseline for sparse retrieval.
  8. Reranker (Cross-Encoder): Re-scores candidates with a more accurate but slower model.
  9. ColBERT (Late Interaction): Multi-vector retrieval that matches tokens for fine-grained relevance.
  10. Chunking: Split long documents into overlap-aware segments for better retrieval and context fit.
  11. Overlap / Stride: The sliding context shared between neighboring chunks to preserve continuity.
  12. MMR (Maximal Marginal Relevance): Diversifies retrieved passages by balancing relevance and novelty.

Tools & Frameworks (92-100)

  1. Embedding Model: The specific model used to compute vectors (domain-specific choices matter).
  2. Semantic Caching: Store previous prompts/answers/vectors to speed up repeated or similar queries.
  3. Vibe Coding: Iterative, AI-paired coding style—describe intent, let the agent build, then refine based on results.
  4. Claude Code: Anthropic’s agentic coding tool that edits files, runs commands, and opens PRs from your terminal/IDE.
  5. OpenAI Codex: OpenAI’s early code-generation model (2021) that powered initial GitHub Copilot; foundational but largely superseded by newer GPT-4-class coding models.
  6. DSPy: Programmatic framework to define, evaluate, and optimize LLM pipelines with signatures, modules, and optimizers.
  7. LangChain: Toolkit to compose LLM workflows (chains, tools, agents) with integrations to models and stores.
  8. LlamaIndex: Data framework for RAG (ingestion, indexing, querying, evaluation) across vector stores.
  9. Hugging Face: Open ecosystem for models, datasets, Spaces, and libraries like Transformers and Diffusers.

Advanced Inference & Serving (101-112)

  1. Multi-Query Attention (MQA): Shares keys/values across heads to cut KV memory and speed decoding with minimal quality loss.
  2. Grouped-Query Attention (GQA): Shares K/V within head groups—quality closer to MHA with MQA-like memory savings; used in modern LLMs.
  3. Sliding-Window Attention (SWA): Limits attention to a moving window, enabling long contexts with linear memory/runtime growth.
  4. PagedAttention (vLLM): Memory manager that pages KV cache blocks to keep GPUs highly utilized and reduce fragmentation.
  5. Continuous Batching: Dynamically merges requests of different lengths to maximize throughput in serving.
  6. Tensor Parallelism: Splits large matrix ops across GPUs (e.g., by columns/rows) to fit and accelerate massive models.
  7. Pipeline Parallelism: Partitions layers across devices and streams micro-batches to keep all stages busy.
  8. FSDP / ZeRO Sharding: Shards parameters, gradients, and optimizer states across workers to train models beyond single-GPU memory.
  9. FP8: 8-bit floating-point formats (e4m3/e5m2) that reduce bandwidth and memory while preserving training quality with calibration.
  10. AWQ: Activation-aware weight quantization that preserves salient channels for accurate 4-bit inference on GPUs.
  11. GPTQ: Post-training quantization method that minimizes layer-wise error for accurate 4-bit weights.
  12. KV Cache Quantization: Compresses past keys/values (e.g., 8-bit/4-bit) to boost throughput and extend context length.

Emerging Techniques & Methods (113-120)

  1. Mamba (SSM): Selective state-space model with linear-time sequence processing and strong long-context performance.
  2. GraphRAG: Builds/uses knowledge graphs with retrieval to ground answers in entities and relations, improving factuality.
  3. HyDE: Generates hypothetical passages from the query, embeds them, then retrieves real documents matching those pseudo-answers.
  4. Self-Consistency: Samples multiple reasoning paths (CoT) and votes—often boosts math and logic accuracy.
  5. Reflexion: Agent loop that critiques its own steps, writes takeaways, and retries to improve results.
  6. ORPO: Odds-ratio preference optimization—single-stage alignment alternative to RLHF with simpler training.
  7. IPO (Implicit Preference Optimization): Preference learning that avoids reward models by matching optimal policies implicitly.
  8. Structured Outputs (JSON/Schema): Constrains generations to schemas for reliable tool calls and API-safe responses.

Tip: Bookmark this page. As stacks evolve—new adapters, longer contexts, faster attention kernels—these fundamentals stay useful no matter which model or provider you pick.