LLM Inference: Cut AI Costs by 80% Amir Teymoori

AI costs are crushing startups. One company I talked to was spending $47,000/month on LLM API calls—more than their entire engineering payroll. Another saw their OpenAI bill jump from $2,000 to $31,000 in a single month after going viral.

But here’s the thing: most companies are overspending by 60-80% without realizing it. With smart optimization, you can serve the same number of requests for a fraction of the cost—without users noticing any difference in quality.

Here’s how.

Understanding Your Current Costs

Before optimizing, you need to know where money is going. Most teams don’t have visibility into:

Which features use the most tokens
What percentage of requests use max context
How many requests could use a cheaper model
Which users generate the most API calls

Start by logging every LLM request with:

Timestamp
Model used
Input tokens
Output tokens
Cost
Feature/endpoint
User ID
Latency

After a week, you’ll see patterns like “80% of our costs come from the summarization feature” or “10% of users generate 60% of requests.”

Strategy 1: Model Routing (Saves 40-60%)

Not every request needs your most expensive model. Use a cascade approach:

The Cascade Pattern

Try the cheap model first (e.g., Claude Haiku at $0.25/1M input tokens)
Check if the response is good enough (use a classifier or simple heuristics)
If not, retry with mid-tier model (e.g., GPT-4.5 Turbo)
Only use premium model for truly complex queries (e.g., Claude Opus)

Real example from a customer support chatbot:

Query Type	% of Requests	Model Used	Cost per 1M Tokens
Simple FAQ	60%	Claude Haiku	$1
Medium complexity	30%	GPT-4.5 Turbo	$12.50
Complex reasoning	10%	Claude Opus	$18

Result: Average cost dropped from $18 per 1M tokens to $5.15 per 1M tokens—a 71% reduction.

How to Decide Which Model to Use

Train a small classifier (or use simple rules) to route requests:

Cheap model: Short queries (<50 words), FAQ-like questions, greetings
Mid model: Code generation, data extraction, medium-length content
Expensive model: Multi-step reasoning, creative writing, complex analysis

Strategy 2: Prompt Caching (Saves 20-40%)

If you’re sending the same context repeatedly (like system instructions or knowledge base articles), use prompt caching.

How Prompt Caching Works

Claude 3.5+ and GPT-4.5 Turbo both support prompt caching. Mark parts of your prompt as “cacheable” and subsequent requests reuse the cached version at 90% off.

Example: A RAG application with a 50K token knowledge base

Without caching: 50K input tokens × $3/1M = $0.15 per request
With caching: First request $0.15, next 100 requests $0.015 each

Savings: If you get 100 requests within the 5-minute cache window, you save $13.35 (89% reduction on those requests).

What to Cache

System instructions
Few-shot examples
Knowledge base documents
Code snippets (for code review/analysis)
Long-lived conversation context

What NOT to Cache

User-specific data (changes every request)
Real-time information
Short prompts (overhead not worth it)

Strategy 3: Context Compression (Saves 30-50%)

Most developers send way more context than necessary. Two techniques help:

3a. Intelligent Retrieval

Instead of dumping your entire knowledge base into context, retrieve only the most relevant 3-5 chunks.

Before: 50 document chunks, 100K tokens, $0.30 per request
After: 5 relevant chunks, 10K tokens, $0.03 per request

Use hybrid search (vector + keyword) + reranking to find the best chunks.

3b. Summarize Old Context

In long conversations, summarize older messages instead of including them verbatim.

Example: A 20-turn conversation

Naive approach: Send all 20 turns every time → grows to 50K tokens
Smart approach: Summarize turns 1-15 into 2K tokens, keep last 5 turns verbatim → 12K tokens

Savings: 76% reduction in context size.

Strategy 4: Output Token Management (Saves 15-30%)

Output tokens cost 3-5× more than input tokens. Control them strictly.

Set Maximum Tokens

Always set max_tokens to the smallest value that works:

Classification: 10 tokens
Short answer: 50 tokens
Paragraph: 200 tokens
Full article: 2000 tokens

Default is often 4096 tokens. If you only need 100, you’re paying 40× more than necessary.

Use Structured Outputs

JSON responses are more token-efficient than natural language.

Bad (verbose):

"The sentiment is positive. The category is Product Feedback.
The priority is high. The customer seems satisfied but has suggestions."

Good (structured):

{
  "sentiment": "positive",
  "category": "product_feedback",
  "priority": "high"
}

Same information, 70% fewer tokens.

Strategy 5: Batch Processing (Saves 20-40%)

If you have non-urgent requests, batch them and use OpenAI’s Batch API (50% cheaper) or similar offerings.

Good candidates for batching:

Email summaries (overnight batch)
Content moderation (delay okay)
Data enrichment
Report generation

Example: A company processes 1M customer reviews nightly for sentiment analysis.

Real-time API: $3,000
Batch API: $1,500

Savings: $1,500/day = $45K/month

Strategy 6: Self-Hosting Open Source Models (Saves 50-90%)

For high-volume, predictable workloads, self-hosting can be dramatically cheaper.

Break-Even Analysis

Metric	API (GPT-4.5)	Self-Host (Llama 70B)
Cost per 1M tokens	$12.50	~$2-4
Infrastructure cost	$0	$2000/month
Break-even volume	–	~200M tokens/month

If you process 200M+ tokens/month, self-hosting saves thousands.

When Self-Hosting Makes Sense

High volume (>100M tokens/month)
Predictable load
Data privacy requirements
Team has ML infrastructure expertise

When to Stick with APIs

Low/variable volume
Small team
Need latest models
Fast iteration important

Strategy 7: Aggressive Caching of Final Responses (Saves 30-70%)

Many user queries are repetitive. Cache the actual LLM responses.

Example: “What are your pricing plans?”

This question might be asked 1000 times/day. Generate the answer once, cache it for 24 hours, and serve from cache.

Implementation

Hash the user query
Check Redis for cached response
If cache hit, return instantly (free)
If cache miss, call LLM and cache result

Real results: A documentation chatbot went from $8K/month to $2.4K/month with response caching (70% reduction).

Strategy 8: Reduce Unnecessary Requests

Sometimes the best optimization is not calling the API at all.

Techniques:

Debouncing: Wait 500ms after user stops typing before calling API
Client-side validation: Check input before sending to LLM
Progressive loading: Show partial results from cache while LLM generates full response
User throttling: Limit requests per user (e.g., 50/hour)

Real-World Case Study: 80% Cost Reduction

Company: AI-powered email assistant
Original cost: $31K/month
Final cost: $6.2K/month
Savings: 80%

What they did:

Model routing: Used Haiku for 70% of requests → Saved $12K
Prompt caching: Cached email templates → Saved $5K
Response caching: Cached common email types → Saved $4K
Context compression: Reduced email thread context → Saved $2.5K
Output limits: Tighter max_tokens → Saved $1.3K

No quality degradation. Users didn’t notice any difference.

Monitoring and Optimization Workflow

Weekly:

Review top 10 most expensive features
Check cache hit rates
Analyze failed requests (may indicate wasteful retries)

Monthly:

Re-evaluate model routing rules
Test if cheaper models have improved
Review user feedback for quality issues

Quarterly:

Consider self-hosting for high-volume features
Benchmark against new model releases
Audit for new optimization opportunities

The Checklist

Here’s what to implement, in order of impact:

☐ Set max_tokens appropriately (5 mins, 15-30% savings)
☐ Enable prompt caching (30 mins, 20-40% savings)
☐ Implement response caching (2 hours, 30-70% savings)
☐ Add model routing (1 day, 40-60% savings)
☐ Optimize context size (2 days, 30-50% savings)
☐ Use structured outputs (1 day, 10-20% savings)
☐ Evaluate batch processing (1 day, 20-40% for async)
☐ Consider self-hosting (1 week, 50-90% at scale)

Start with the quick wins. Even just #1-3 can cut costs in half.

Don’t Optimize Quality Away

The goal isn’t to minimize costs at all costs (pun intended). It’s to maximize value per dollar.

Always measure:

User satisfaction
Task success rate
Response quality metrics

If an optimization hurts quality significantly, roll it back. But you’ll find that 80% of optimizations have zero quality impact—they just eliminate waste.

Your AI bill doesn’t have to be scary. With these strategies, you can scale to millions of users while keeping costs under control.

Now go optimize.