AI costs are crushing startups. One company I talked to was spending $47,000/month on LLM API calls—more than their entire engineering payroll. Another saw their OpenAI bill jump from $2,000 to $31,000 in a single month after going viral.
But here’s the thing: most companies are overspending by 60-80% without realizing it. With smart optimization, you can serve the same number of requests for a fraction of the cost—without users noticing any difference in quality.
Here’s how.
Understanding Your Current Costs
Before optimizing, you need to know where money is going. Most teams don’t have visibility into:
- Which features use the most tokens
- What percentage of requests use max context
- How many requests could use a cheaper model
- Which users generate the most API calls
Start by logging every LLM request with:
- Timestamp
- Model used
- Input tokens
- Output tokens
- Cost
- Feature/endpoint
- User ID
- Latency
After a week, you’ll see patterns like “80% of our costs come from the summarization feature” or “10% of users generate 60% of requests.”
Strategy 1: Model Routing (Saves 40-60%)
Not every request needs your most expensive model. Use a cascade approach:
The Cascade Pattern
- Try the cheap model first (e.g., Claude Haiku at $0.25/1M input tokens)
- Check if the response is good enough (use a classifier or simple heuristics)
- If not, retry with mid-tier model (e.g., GPT-4.5 Turbo)
- Only use premium model for truly complex queries (e.g., Claude Opus)
Real example from a customer support chatbot:
| Query Type | % of Requests | Model Used | Cost per 1M Tokens |
|---|---|---|---|
| Simple FAQ | 60% | Claude Haiku | $1 |
| Medium complexity | 30% | GPT-4.5 Turbo | $12.50 |
| Complex reasoning | 10% | Claude Opus | $18 |
Result: Average cost dropped from $18 per 1M tokens to $5.15 per 1M tokens—a 71% reduction.
How to Decide Which Model to Use
Train a small classifier (or use simple rules) to route requests:
- Cheap model: Short queries (<50 words), FAQ-like questions, greetings
- Mid model: Code generation, data extraction, medium-length content
- Expensive model: Multi-step reasoning, creative writing, complex analysis
Strategy 2: Prompt Caching (Saves 20-40%)
If you’re sending the same context repeatedly (like system instructions or knowledge base articles), use prompt caching.
How Prompt Caching Works
Claude 3.5+ and GPT-4.5 Turbo both support prompt caching. Mark parts of your prompt as “cacheable” and subsequent requests reuse the cached version at 90% off.
Example: A RAG application with a 50K token knowledge base
- Without caching: 50K input tokens × $3/1M = $0.15 per request
- With caching: First request $0.15, next 100 requests $0.015 each
Savings: If you get 100 requests within the 5-minute cache window, you save $13.35 (89% reduction on those requests).
What to Cache
- System instructions
- Few-shot examples
- Knowledge base documents
- Code snippets (for code review/analysis)
- Long-lived conversation context
What NOT to Cache
- User-specific data (changes every request)
- Real-time information
- Short prompts (overhead not worth it)
Strategy 3: Context Compression (Saves 30-50%)
Most developers send way more context than necessary. Two techniques help:
3a. Intelligent Retrieval
Instead of dumping your entire knowledge base into context, retrieve only the most relevant 3-5 chunks.
Before: 50 document chunks, 100K tokens, $0.30 per request
After: 5 relevant chunks, 10K tokens, $0.03 per request
Use hybrid search (vector + keyword) + reranking to find the best chunks.
3b. Summarize Old Context
In long conversations, summarize older messages instead of including them verbatim.
Example: A 20-turn conversation
- Naive approach: Send all 20 turns every time → grows to 50K tokens
- Smart approach: Summarize turns 1-15 into 2K tokens, keep last 5 turns verbatim → 12K tokens
Savings: 76% reduction in context size.
Strategy 4: Output Token Management (Saves 15-30%)
Output tokens cost 3-5× more than input tokens. Control them strictly.
Set Maximum Tokens
Always set max_tokens to the smallest value that works:
- Classification: 10 tokens
- Short answer: 50 tokens
- Paragraph: 200 tokens
- Full article: 2000 tokens
Default is often 4096 tokens. If you only need 100, you’re paying 40× more than necessary.
Use Structured Outputs
JSON responses are more token-efficient than natural language.
Bad (verbose):
"The sentiment is positive. The category is Product Feedback.
The priority is high. The customer seems satisfied but has suggestions."
Good (structured):
{
"sentiment": "positive",
"category": "product_feedback",
"priority": "high"
}
Same information, 70% fewer tokens.
Strategy 5: Batch Processing (Saves 20-40%)
If you have non-urgent requests, batch them and use OpenAI’s Batch API (50% cheaper) or similar offerings.
Good candidates for batching:
- Email summaries (overnight batch)
- Content moderation (delay okay)
- Data enrichment
- Report generation
Example: A company processes 1M customer reviews nightly for sentiment analysis.
- Real-time API: $3,000
- Batch API: $1,500
Savings: $1,500/day = $45K/month
Strategy 6: Self-Hosting Open Source Models (Saves 50-90%)
For high-volume, predictable workloads, self-hosting can be dramatically cheaper.
Break-Even Analysis
| Metric | API (GPT-4.5) | Self-Host (Llama 70B) |
|---|---|---|
| Cost per 1M tokens | $12.50 | ~$2-4 |
| Infrastructure cost | $0 | $2000/month |
| Break-even volume | – | ~200M tokens/month |
If you process 200M+ tokens/month, self-hosting saves thousands.
When Self-Hosting Makes Sense
- High volume (>100M tokens/month)
- Predictable load
- Data privacy requirements
- Team has ML infrastructure expertise
When to Stick with APIs
- Low/variable volume
- Small team
- Need latest models
- Fast iteration important
Strategy 7: Aggressive Caching of Final Responses (Saves 30-70%)
Many user queries are repetitive. Cache the actual LLM responses.
Example: “What are your pricing plans?”
This question might be asked 1000 times/day. Generate the answer once, cache it for 24 hours, and serve from cache.
Implementation
- Hash the user query
- Check Redis for cached response
- If cache hit, return instantly (free)
- If cache miss, call LLM and cache result
Real results: A documentation chatbot went from $8K/month to $2.4K/month with response caching (70% reduction).
Strategy 8: Reduce Unnecessary Requests
Sometimes the best optimization is not calling the API at all.
Techniques:
- Debouncing: Wait 500ms after user stops typing before calling API
- Client-side validation: Check input before sending to LLM
- Progressive loading: Show partial results from cache while LLM generates full response
- User throttling: Limit requests per user (e.g., 50/hour)
Real-World Case Study: 80% Cost Reduction
Company: AI-powered email assistant
Original cost: $31K/month
Final cost: $6.2K/month
Savings: 80%
What they did:
- Model routing: Used Haiku for 70% of requests → Saved $12K
- Prompt caching: Cached email templates → Saved $5K
- Response caching: Cached common email types → Saved $4K
- Context compression: Reduced email thread context → Saved $2.5K
- Output limits: Tighter max_tokens → Saved $1.3K
No quality degradation. Users didn’t notice any difference.
Monitoring and Optimization Workflow
Weekly:
- Review top 10 most expensive features
- Check cache hit rates
- Analyze failed requests (may indicate wasteful retries)
Monthly:
- Re-evaluate model routing rules
- Test if cheaper models have improved
- Review user feedback for quality issues
Quarterly:
- Consider self-hosting for high-volume features
- Benchmark against new model releases
- Audit for new optimization opportunities
The Checklist
Here’s what to implement, in order of impact:
- ☐ Set max_tokens appropriately (5 mins, 15-30% savings)
- ☐ Enable prompt caching (30 mins, 20-40% savings)
- ☐ Implement response caching (2 hours, 30-70% savings)
- ☐ Add model routing (1 day, 40-60% savings)
- ☐ Optimize context size (2 days, 30-50% savings)
- ☐ Use structured outputs (1 day, 10-20% savings)
- ☐ Evaluate batch processing (1 day, 20-40% for async)
- ☐ Consider self-hosting (1 week, 50-90% at scale)
Start with the quick wins. Even just #1-3 can cut costs in half.
Don’t Optimize Quality Away
The goal isn’t to minimize costs at all costs (pun intended). It’s to maximize value per dollar.
Always measure:
- User satisfaction
- Task success rate
- Response quality metrics
If an optimization hurts quality significantly, roll it back. But you’ll find that 80% of optimizations have zero quality impact—they just eliminate waste.
Your AI bill doesn’t have to be scary. With these strategies, you can scale to millions of users while keeping costs under control.
Now go optimize.