DSPy 3: Build, Evaluate, and Optimize LLM Pipelines Amir Teymoori

If you’ve ever hand‑tuned prompts that break when you change models, DSPy 3 is for you. DSPy is a framework from Hazy Research at Stanford that lets you build LLM pipelines as code, then evaluate and optimize them with measurable objectives. Instead of fragile prompt crafting, you define signatures, wire modules, and let optimizers improve your program.

What DSPy solves

Brittle prompts → replace ad‑hoc prompt text with typed, reusable components.
Inconsistent outputs → enforce structured outputs (JSON, Pydantic types) with adapters.
Guess‑and‑check tuning → use optimizers (formerly “teleprompters”) to improve quality against metrics.
Hard‑to‑repro pipelines → one code path that compiles across different LMs and vendors.
Evaluation gaps → built‑in evaluators and metrics for quick, reliable feedback loops.

Who built it

DSPy is developed by Hazy Research (Stanford), with contributions from researchers including Omar Khattab, Matei Zaharia, Christopher Potts, and collaborators in the open‑source community.

Core ideas (v3)

Signatures: typed I/O contracts (instructions + fields) that describe what a step should do.
Modules: building blocks like Predict, ChainOfThought, ReAct, and custom classes that use a signature.
Adapters & Types: choose how the LM communicates: ChatAdapter, JSONAdapter, plus typed fields (e.g., JSON/Pydantic, images, tool calls).
Optimizers: program improvers such as MIPROv2, GEPA, SIMBA, BootstrapFewShot/Finetune—they search prompts, examples, and parameters to optimize your metric.
Evaluation: quick evaluators and metrics (exact match, F1, passage match) for regression‑style feedback.
Observability: optional logging/trace export (e.g., MLflow) for experiments and comparisons.

Deep dive: Signatures (the contract for a step)

A Signature defines inputs, outputs, and a short instruction. Treat it like the interface of your step.
Use Python classes with dspy.Input/dspy.Output, or concise string forms like "question, context -> answer".
Signatures are typed: strings, lists, dicts, enums (via Literal[...]), and even Pydantic models for strict JSON.
Keep the instruction direct and short; add 1‑line desc on tricky fields for clarity.
You can evolve a signature (e.g., add confidence: float) without changing calling code—DSPy handles propagation.

Example (typed signature)

from typing import Literal
import dspy

class Classify(dspy.Signature):
    text: str = dspy.Input(desc="short sentence")
    label: Literal["positive","negative"] = dspy.Output(desc="sentiment label")

clf = dspy.Predict(Classify)
print(clf(text="Great coffee!").label)

When to use DSPy

You’re building multi‑step LLM workflows (RAG, extraction, classification, agents).
You need metric‑driven improvements, not just prettier prompts.
You want portability across OpenAI, Anthropic, local models, etc.
You have a small dev/test set and want the system to self‑improve.

Quickstart (easiest path)

1) Install and configure the LM

pip install -U dspy

import dspy

# Example: set your provider/model once
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

2) Define a signature and a simple module

class AnswerSig(dspy.Signature):
    """Answer the user’s question using the given context."""
    context: str = dspy.Input(desc="relevant facts")
    question: str = dspy.Input()
    answer: str = dspy.Output()

class Answerer(dspy.Module):
    def __init__(self):
        super().__init__()
        self.step = dspy.Predict(AnswerSig)
    def forward(self, context, question):
        return self.step(context=context, question=question)

3) Add a metric and compile with an optimizer

def exact_match_metric(example, pred) -> float:
    return float(example.answer.strip().lower() == pred.answer.strip().lower())

program = Answerer()
optimizer = dspy.MIPROv2(metric=exact_match_metric, auto="light")
trainset = [
    dspy.Example(context="Stockholm is Sweden’s capital.", question="What is Sweden’s capital?", answer="Stockholm"),
]
compiled = optimizer.compile(program, trainset=trainset)

4) Run and evaluate

pred = compiled(context="Stockholm is Sweden’s capital.", question="Sweden’s capital?")
print(pred.answer)

evaluator = dspy.Evaluate(metric=exact_match_metric)
score = evaluator(compiled, dataset=trainset)
print({"exact_match": score})

Structured outputs (the 5‑minute win)

Use an adapter to force well‑formed JSON and avoid brittle string parsing.

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini", adapter=dspy.JSONAdapter()))

class ExtractSig(dspy.Signature):
    text: str = dspy.Input()
    fields: dict = dspy.Output(desc="{name: str, title: str}")

extract = dspy.Predict(ExtractSig)
pred = extract(text="Hi, I’m Amir Hossein Teymoori, AI Engineer at SBAB.")
print(pred.fields["name"], pred.fields["title"])  # structured, predictable

Notes on adapters

ChatAdapter (default): general‑purpose; formats fields clearly and parses outputs.
JSONAdapter: enforces valid JSON and is ideal for APIs/dashboards.
Set globally with dspy.configure(adapter=...) or temporarily via with dspy.context(adapter=...):.

Common use cases

RAG: retrieve passages, then answer with a Predict or ChainOfThought step.
Information Extraction: use JSON/Pydantic signatures for clean downstream consumption.
Classification: define a label schema and optimize for accuracy or F1.
Agents & Tools: ReAct/tool‑calling modules to plan, call tools, and verify steps.

Optimizers in practice

MIPROv2 (Multiprompt Instruction Proposal, v2)
Jointly optimizes instructions and few‑shot demos per module.
Typical flow: bootstrap demos → propose instructions → Bayesian optimization over combinations.
Works well with tiny devsets; supports auto="light" | "medium" | "heavy", mini‑batches, and threads.
Great default when you want strong gains without heavy manual tuning.
GEPA (feedback‑driven prompt evolution)
Uses reflection + textual feedback to evolve prompts; maintains a Pareto frontier of candidates.
Mutates the weakest module, keeps changes that help any example, and can merge good modules from different candidates.
Shines when your metric can return helpful messages (e.g., schema errors, unit‑test diffs), not just a number.
SIMBA: black‑box search that explores instruction space aggressively; helpful when other optimizers plateau.
BootstrapFewShot / BootstrapFinetune: rapidly collect demos and optionally finetune the LM when you have labeled data.

Deep dive: MIPROv2

def metric(example, pred, trace=None) -> float:
    return float(example.answer.strip().lower() == pred.answer.strip().lower())

opt = dspy.MIPROv2(metric=metric, auto="medium", num_threads=8)
compiled = opt.compile(program, trainset=trainset,
                       max_bootstrapped_demos=4, max_labeled_demos=4)

Under the hood

Bootstrap: run on trainset; keep high‑scoring trajectories as demo candidates.
Propose: generate instruction candidates using summaries + demos.
Search: evaluate combinations on mini‑batches; Bayesian Optimization picks better mixes; return the best program.

Deep dive: GEPA (feedback‑driven)

def feedback_metric(example, pred, trace=None) -> dspy.Prediction:
    score = float(pred.answer.strip().lower() == example.answer.strip().lower())
    fb = []
    if not score:
        fb.append("answer mismatch; ensure exact entity")
    if hasattr(pred, "reasoning") and len(pred.reasoning) > 300:
        fb.append("reasoning too long; be concise")
    return dspy.Prediction(score=score, feedback="; ".join(fb))

gepa = dspy.GEPA(metric=feedback_metric, auto="light")
compiled = gepa.compile(program, trainset=trainset)

GEPA intuition

Keep multiple frontier candidates instead of one best.
Reflect on traces + feedback, mutate one module, and keep improvements that help any example.
Periodically merge good modules across candidates to avoid regressions.

A slightly richer example (RAG‑style)

class QA(dspy.Signature):
    question: str = dspy.Input()
    passages: list[str] = dspy.Input(desc="retrieved context")
    answer: str = dspy.Output()

class QAProgram(dspy.Module):
    def __init__(self):
        super().__init__()
        self.answer = dspy.ChainOfThought(QA)
    def forward(self, question, passages):
        return self.answer(question=question, passages=passages)

program = QAProgram()
metric = dspy.metrics.answer_exact_match
optimizer = dspy.MIPROv2(metric=metric, auto="light")
compiled = optimizer.compile(program, trainset=[
    dspy.Example(question="Capital of Sweden?", passages=["... Stockholm ..."], answer="Stockholm"),
])

Evaluate and iterate like an engineer

Start with a tiny dev set (10–50 diverse examples); keep a held‑out test set for final reporting.
Use one simple metric first (exact match or F1). Add secondary checks after the pipeline stabilizes.
Make your metric program‑aware: metrics receive a trace; read intermediate outputs to penalize long reasoning, missing citations, or schema errors.
Use dspy.Evaluate for quick loops; log failures and turn them into new examples for the next compile.
Save the compiled program and version your data so results are reproducible across model changes.

Best practices

Keep signatures tight: minimal instructions; name outputs clearly; add short desc fields.
Use adapters: JSONAdapter for APIs/dashboards; ChatAdapter for open‑ended answers.
Start small: 10–50 diverse examples often beat massive, redundant sets.
Measure first: pick one simple metric (exact match, F1, etc.).
Guardrails via types: prefer JSON/Pydantic outputs for stability.
Separate dev/test: compile on dev, report on held‑out test.
Version your programs: save/load compiled programs; log runs for comparison.
Portability: switch dspy.configure to try other providers or local models without changing program code.
Prefer MIPROv2 to start: begin with auto="light", increase to "medium" if gains flatten.
Add GEPA when you have feedback: if you can surface textual errors (schema violations, test diffs), GEPA improves faster.
Freeze stable modules: once a step is solid, avoid changing its signature; optimize neighbors instead.
Keep top‑k variants: ensemble compiled programs when diversity helps robustness.

Summary

DSPy 3 turns LLM apps from “prompt art” into measurable, optimizable programs. Define what each step should do (signatures), compose steps (modules), pick an objective (metric), and let optimizers improve your pipeline. It’s cleaner, more reliable, and easier to maintain as your models, data, and requirements evolve.