If you’ve ever hand‑tuned prompts that break when you change models, DSPy 3 is for you. DSPy is a framework from Hazy Research at Stanford that lets you build LLM pipelines as code, then evaluate and optimize them with measurable objectives. Instead of fragile prompt crafting, you define signatures, wire modules, and let optimizers improve your program.
What DSPy solves
- Brittle prompts → replace ad‑hoc prompt text with typed, reusable components.
- Inconsistent outputs → enforce structured outputs (JSON, Pydantic types) with adapters.
- Guess‑and‑check tuning → use optimizers (formerly “teleprompters”) to improve quality against metrics.
- Hard‑to‑repro pipelines → one code path that compiles across different LMs and vendors.
- Evaluation gaps → built‑in evaluators and metrics for quick, reliable feedback loops.
Who built it
DSPy is developed by Hazy Research (Stanford), with contributions from researchers including Omar Khattab, Matei Zaharia, Christopher Potts, and collaborators in the open‑source community.
Core ideas (v3)
- Signatures: typed I/O contracts (instructions + fields) that describe what a step should do.
- Modules: building blocks like
Predict,ChainOfThought,ReAct, and custom classes that use a signature. - Adapters & Types: choose how the LM communicates:
ChatAdapter,JSONAdapter, plus typed fields (e.g., JSON/Pydantic, images, tool calls). - Optimizers: program improvers such as MIPROv2, GEPA, SIMBA, BootstrapFewShot/Finetune—they search prompts, examples, and parameters to optimize your metric.
- Evaluation: quick evaluators and metrics (exact match, F1, passage match) for regression‑style feedback.
- Observability: optional logging/trace export (e.g., MLflow) for experiments and comparisons.
Deep dive: Signatures (the contract for a step)
- A Signature defines inputs, outputs, and a short instruction. Treat it like the interface of your step.
- Use Python classes with
dspy.Input/dspy.Output, or concise string forms like"question, context -> answer". - Signatures are typed: strings, lists, dicts, enums (via
Literal[...]), and even Pydantic models for strict JSON. - Keep the instruction direct and short; add 1‑line
descon tricky fields for clarity. - You can evolve a signature (e.g., add
confidence: float) without changing calling code—DSPy handles propagation.
Example (typed signature)
from typing import Literal
import dspy
class Classify(dspy.Signature):
text: str = dspy.Input(desc="short sentence")
label: Literal["positive","negative"] = dspy.Output(desc="sentiment label")
clf = dspy.Predict(Classify)
print(clf(text="Great coffee!").label)
When to use DSPy
- You’re building multi‑step LLM workflows (RAG, extraction, classification, agents).
- You need metric‑driven improvements, not just prettier prompts.
- You want portability across OpenAI, Anthropic, local models, etc.
- You have a small dev/test set and want the system to self‑improve.
Quickstart (easiest path)
1) Install and configure the LM
pip install -U dspy
import dspy
# Example: set your provider/model once
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
2) Define a signature and a simple module
class AnswerSig(dspy.Signature):
"""Answer the user’s question using the given context."""
context: str = dspy.Input(desc="relevant facts")
question: str = dspy.Input()
answer: str = dspy.Output()
class Answerer(dspy.Module):
def __init__(self):
super().__init__()
self.step = dspy.Predict(AnswerSig)
def forward(self, context, question):
return self.step(context=context, question=question)
3) Add a metric and compile with an optimizer
def exact_match_metric(example, pred) -> float:
return float(example.answer.strip().lower() == pred.answer.strip().lower())
program = Answerer()
optimizer = dspy.MIPROv2(metric=exact_match_metric, auto="light")
trainset = [
dspy.Example(context="Stockholm is Sweden’s capital.", question="What is Sweden’s capital?", answer="Stockholm"),
]
compiled = optimizer.compile(program, trainset=trainset)
4) Run and evaluate
pred = compiled(context="Stockholm is Sweden’s capital.", question="Sweden’s capital?")
print(pred.answer)
evaluator = dspy.Evaluate(metric=exact_match_metric)
score = evaluator(compiled, dataset=trainset)
print({"exact_match": score})
Structured outputs (the 5‑minute win)
Use an adapter to force well‑formed JSON and avoid brittle string parsing.
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini", adapter=dspy.JSONAdapter()))
class ExtractSig(dspy.Signature):
text: str = dspy.Input()
fields: dict = dspy.Output(desc="{name: str, title: str}")
extract = dspy.Predict(ExtractSig)
pred = extract(text="Hi, I’m Amir Hossein Teymoori, AI Engineer at SBAB.")
print(pred.fields["name"], pred.fields["title"]) # structured, predictable
Notes on adapters
- ChatAdapter (default): general‑purpose; formats fields clearly and parses outputs.
- JSONAdapter: enforces valid JSON and is ideal for APIs/dashboards.
- Set globally with
dspy.configure(adapter=...)or temporarily viawith dspy.context(adapter=...):.
Common use cases
- RAG: retrieve passages, then answer with a
PredictorChainOfThoughtstep. - Information Extraction: use JSON/Pydantic signatures for clean downstream consumption.
- Classification: define a label schema and optimize for accuracy or F1.
- Agents & Tools:
ReAct/tool‑calling modules to plan, call tools, and verify steps.
Optimizers in practice
- MIPROv2 (Multiprompt Instruction Proposal, v2)
- Jointly optimizes instructions and few‑shot demos per module.
- Typical flow: bootstrap demos → propose instructions → Bayesian optimization over combinations.
- Works well with tiny devsets; supports
auto="light" | "medium" | "heavy", mini‑batches, and threads. - Great default when you want strong gains without heavy manual tuning.
- GEPA (feedback‑driven prompt evolution)
- Uses reflection + textual feedback to evolve prompts; maintains a Pareto frontier of candidates.
- Mutates the weakest module, keeps changes that help any example, and can merge good modules from different candidates.
- Shines when your metric can return helpful messages (e.g., schema errors, unit‑test diffs), not just a number.
- SIMBA: black‑box search that explores instruction space aggressively; helpful when other optimizers plateau.
- BootstrapFewShot / BootstrapFinetune: rapidly collect demos and optionally finetune the LM when you have labeled data.
Deep dive: MIPROv2
def metric(example, pred, trace=None) -> float:
return float(example.answer.strip().lower() == pred.answer.strip().lower())
opt = dspy.MIPROv2(metric=metric, auto="medium", num_threads=8)
compiled = opt.compile(program, trainset=trainset,
max_bootstrapped_demos=4, max_labeled_demos=4)
Under the hood
- Bootstrap: run on trainset; keep high‑scoring trajectories as demo candidates.
- Propose: generate instruction candidates using summaries + demos.
- Search: evaluate combinations on mini‑batches; Bayesian Optimization picks better mixes; return the best program.
Deep dive: GEPA (feedback‑driven)
def feedback_metric(example, pred, trace=None) -> dspy.Prediction:
score = float(pred.answer.strip().lower() == example.answer.strip().lower())
fb = []
if not score:
fb.append("answer mismatch; ensure exact entity")
if hasattr(pred, "reasoning") and len(pred.reasoning) > 300:
fb.append("reasoning too long; be concise")
return dspy.Prediction(score=score, feedback="; ".join(fb))
gepa = dspy.GEPA(metric=feedback_metric, auto="light")
compiled = gepa.compile(program, trainset=trainset)
GEPA intuition
- Keep multiple frontier candidates instead of one best.
- Reflect on traces + feedback, mutate one module, and keep improvements that help any example.
- Periodically merge good modules across candidates to avoid regressions.
A slightly richer example (RAG‑style)
class QA(dspy.Signature):
question: str = dspy.Input()
passages: list[str] = dspy.Input(desc="retrieved context")
answer: str = dspy.Output()
class QAProgram(dspy.Module):
def __init__(self):
super().__init__()
self.answer = dspy.ChainOfThought(QA)
def forward(self, question, passages):
return self.answer(question=question, passages=passages)
program = QAProgram()
metric = dspy.metrics.answer_exact_match
optimizer = dspy.MIPROv2(metric=metric, auto="light")
compiled = optimizer.compile(program, trainset=[
dspy.Example(question="Capital of Sweden?", passages=["... Stockholm ..."], answer="Stockholm"),
])
Evaluate and iterate like an engineer
- Start with a tiny dev set (10–50 diverse examples); keep a held‑out test set for final reporting.
- Use one simple metric first (exact match or F1). Add secondary checks after the pipeline stabilizes.
- Make your metric program‑aware: metrics receive a trace; read intermediate outputs to penalize long reasoning, missing citations, or schema errors.
- Use
dspy.Evaluatefor quick loops; log failures and turn them into new examples for the next compile. - Save the compiled program and version your data so results are reproducible across model changes.
Best practices
- Keep signatures tight: minimal instructions; name outputs clearly; add short
descfields. - Use adapters:
JSONAdapterfor APIs/dashboards;ChatAdapterfor open‑ended answers. - Start small: 10–50 diverse examples often beat massive, redundant sets.
- Measure first: pick one simple metric (exact match, F1, etc.).
- Guardrails via types: prefer JSON/Pydantic outputs for stability.
- Separate dev/test: compile on dev, report on held‑out test.
- Version your programs: save/load compiled programs; log runs for comparison.
- Portability: switch
dspy.configureto try other providers or local models without changing program code. - Prefer MIPROv2 to start: begin with
auto="light", increase to"medium"if gains flatten. - Add GEPA when you have feedback: if you can surface textual errors (schema violations, test diffs), GEPA improves faster.
- Freeze stable modules: once a step is solid, avoid changing its signature; optimize neighbors instead.
- Keep top‑k variants: ensemble compiled programs when diversity helps robustness.
Summary
DSPy 3 turns LLM apps from “prompt art” into measurable, optimizable programs. Define what each step should do (signatures), compose steps (modules), pick an objective (metric), and let optimizers improve your pipeline. It’s cleaner, more reliable, and easier to maintain as your models, data, and requirements evolve.