You’ve built an amazing AI application locally. Now you need to deploy it. Simple, right?
Except your laptop has 64GB RAM, local model files, cached embeddings, and environment variables scattered across three different files. Production has… none of that.
Here’s how to bridge the gap between “works on my machine” and “running reliably in production.”
The Production Deployment Checklist
Before the specifics, here’s what you need:
ComponentPurposeExample ToolsContainerizationReproducible environmentsDocker, PodmanOrchestrationManage multiple containersDocker Compose, KubernetesReverse ProxyHandle HTTPS, routingCaddy, Nginx, TraefikCI/CDAutomated testing & deploymentGitHub Actions, GitLab CISecrets ManagementSecure API keys, passwordsVault, AWS Secrets ManagerMonitoringKnow when things breakGrafana, Datadog, SentryLoggingDebug production issuesLoki, CloudWatch, Better Stack## Step 1: Dockerize Your Application
Docker ensures your app runs the same everywhere. Here’s a production-ready Dockerfile for a Python AI application:
# Multi-stage build for smaller images
FROM python:3.11-slim as builder
WORKDIR /app
# Install build dependencies
RUN apt-get update && apt-get install -y
gcc
g++
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# Final stage
FROM python:3.11-slim
WORKDIR /app
# Copy only what we need from builder
COPY --from=builder /root/.local /root/.local
COPY . .
# Make sure scripts are in PATH
ENV PATH=/root/.local/bin:$PATH
# Don't run as root
RUN useradd -m appuser && chown -R appuser /app
USER appuser
# Health check
HEALTHCHECK --interval=30s --timeout=3s
CMD python -c "import requests; requests.get('http://localhost:8000/health')"
# Run application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Key Docker Best Practices
- Multi-stage builds: Reduce final image size by 60-80%
- Don’t run as root: Security best practice
- Health checks: Let orchestrators know if container is healthy
- Specific base images: Use
python:3.11-slimnotpython:latest - .dockerignore: Exclude unnecessary files (node_modules, .git, cache)
Step 2: Environment Configuration
Never hardcode API keys or secrets. Use environment variables:
# .env.example (check this into git)
ANTHROPIC_API_KEY=sk-ant-xxx
DATABASE_URL=postgresql://localhost/mydb
REDIS_URL=redis://localhost:6379
LOG_LEVEL=info
# .env (never commit this!)
ANTHROPIC_API_KEY=sk-ant-real-key-here
DATABASE_URL=postgresql://user:pass@prod-db.com/prod
Load them in your app:
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
anthropic_api_key: str
database_url: str
redis_url: str = "redis://localhost:6379" # default
log_level: str = "info"
class Config:
env_file = ".env"
settings = Settings()
Step 3: Docker Compose for Local Development
Run your entire stack with one command:
version: '3.8'
services:
app:
build: .
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql://postgres:password@db:5432/mydb
- REDIS_URL=redis://redis:6379
env_file:
- .env
depends_on:
db:
condition: service_healthy
redis:
condition: service_started
volumes:
- ./app:/app # hot reload in dev
restart: unless-stopped
db:
image: postgres:15-alpine
environment:
- POSTGRES_PASSWORD=password
- POSTGRES_DB=mydb
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
# Vector database for RAG
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
volumes:
- qdrant_data:/qdrant/storage
volumes:
postgres_data:
redis_data:
qdrant_data:
Run everything:
docker compose up -d
Your app, database, Redis, and vector database are now running.
Step 4: Caddy for HTTPS and Reverse Proxy
Caddy automatically provisions SSL certificates from Let’s Encrypt. Configuration is beautifully simple:
# Caddyfile
ai.yourdomain.com {
# Automatic HTTPS!
reverse_proxy app:8000
# Rate limiting
rate_limit {
zone app_zone {
key {remote_host}
events 100
window 1m
}
}
# Logging
log {
output file /var/log/caddy/access.log
format json
}
}
# Separate domain for admin panel
admin.yourdomain.com {
reverse_proxy app:8000
# Basic auth
basicauth {
admin $2a$14$hashed_password_here
}
}
Add Caddy to docker-compose.yml:
caddy:
image: caddy:2-alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- caddy_data:/data
- caddy_config:/config
restart: unless-stopped
Boom. HTTPS, rate limiting, and load balancing in 20 lines.
Step 5: CI/CD Pipeline
Automate testing and deployment with GitHub Actions:
# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest pytest-cov
- name: Run tests
run: pytest --cov=app tests/
- name: Run LLM evals
run: python scripts/run_evals.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
deploy:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v3
- name: Build and push Docker image
uses: docker/build-push-action@v4
with:
push: true
tags: your-registry/ai-app:latest
- name: Deploy to server
uses: appleboy/ssh-action@master
with:
host: ${{ secrets.SERVER_HOST }}
username: ${{ secrets.SERVER_USER }}
key: ${{ secrets.SSH_PRIVATE_KEY }}
script: |
cd /opt/ai-app
docker compose pull
docker compose up -d
docker compose exec app python scripts/migrate.py
Now every push to main:
- Runs unit tests
- Runs LLM evaluations
- Builds Docker image
- Deploys to production
- Runs database migrations
All automatically.
Step 6: Monitoring and Observability
You need to know when things break. Set up Grafana + Prometheus:
# docker-compose.yml additions
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=your_password_here
Instrument your app:
from prometheus_client import Counter, Histogram
import time
llm_requests = Counter('llm_requests_total', 'Total LLM requests')
llm_latency = Histogram('llm_request_duration_seconds', 'LLM request latency')
llm_cost = Counter('llm_cost_dollars', 'Total LLM cost in dollars')
@llm_latency.time()
async def call_llm(prompt: str):
llm_requests.inc()
start = time.time()
response = await client.messages.create(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": prompt}]
)
# Track cost
cost = calculate_cost(response.usage)
llm_cost.inc(cost)
return response
Now you have dashboards showing:
- Request volume
- Latency (p50, p95, p99)
- Error rates
- Cost per hour
- Cache hit rates
Step 7: Blue-Green Deployments
Deploy new versions without downtime:
# Deploy new version (green)
docker compose -f docker-compose.green.yml up -d
# Test it on separate port
curl http://localhost:8001/health
# If good, switch traffic (update Caddy)
# If bad, kill green and keep blue
Or use Kubernetes for automatic rolling updates.
Common Production Issues (And Fixes)
Issue: Out of Memory
Symptom: Container keeps restarting
Fix: Set memory limits in docker-compose.yml:
services:
app:
deploy:
resources:
limits:
memory: 4G
reservations:
memory: 2G
Issue: Slow Performance
Symptom: Requests timing out
Fix: Add Redis caching, increase worker processes, use async I/O
Issue: Database Connection Exhaustion
Symptom: “Too many connections” errors
Fix: Use connection pooling (SQLAlchemy, asyncpg), increase DB max connections
Issue: Secrets Leaked in Logs
Symptom: API keys visible in logs
Fix: Scrub logs, use structured logging with sensitive field redaction
The Production Deployment Runbook
When deploying a major change:
- ☐ Test locally with
docker compose up - ☐ Run full test suite including LLM evals
- ☐ Deploy to staging environment first
- ☐ Run smoke tests on staging
- ☐ Deploy to 10% of prod traffic (canary)
- ☐ Monitor for 1 hour
- ☐ If metrics look good, deploy to 100%
- ☐ If anything breaks, rollback immediately
- ☐ Keep deployment window open for 24 hours
Cost Optimization
Running AI in production can get expensive. Optimize:
- Use spot instances for batch jobs (save 60-80%)
- Auto-scale workers based on queue depth
- Cache LLM responses aggressively (Redis with 1-hour TTL)
- Use smaller models where quality difference is minimal
- Batch similar requests to save on API calls
Security Hardening
Essential security practices:
- Run containers as non-root user
- Use secrets management (Vault, AWS Secrets Manager)
- Enable rate limiting (prevent abuse)
- Scan Docker images for vulnerabilities (Trivy, Snyk)
- Use network policies (isolate services)
- Enable audit logging (track all API calls)
- Rotate API keys regularly
Backup and Disaster Recovery
What happens if your server dies?
- Database backups: Automated daily backups to S3
- Vector DB backups: Regular snapshots of Qdrant/Weaviate
- Configuration backups: Store in git (Infrastructure as Code)
- Recovery time objective: Can you restore in under 1 hour?
The Bottom Line
Deploying AI applications is more complex than traditional apps because:
- They depend on external APIs (LLMs)
- They have ML-specific failure modes
- They can be expensive to run at scale
- They require continuous monitoring and improvement
But with the right DevOps practices, you can run AI in production with confidence.
Start simple (Docker + Docker Compose), add complexity only when needed (Kubernetes), and always measure what matters (latency, cost, user satisfaction).
Now go deploy that AI app. The world is waiting.
Frequently Asked Questions
Do I really need Docker for AI apps?
Yes for production. AI apps depend on specific Python, CUDA, and library versions that conflict on shared servers. Docker isolates the dependency tree and gives you reproducible deploys.
What’s the smallest production stack I can use?
A single VPS with Docker Compose, Caddy as reverse proxy and TLS, plus your app and a Postgres or Redis container. That setup handles thousands of daily users on a $20/month box.
How do I deploy LLM apps with zero downtime?
Run two containers behind Caddy, swap traffic with a config reload, then drain the old one. Most teams use Docker Compose plus a deploy script; Kubernetes is overkill for under 5 services.
What about GPU workloads?
Use the NVIDIA Container Toolkit to expose GPUs to Docker. For inference at scale, hosted endpoints (Replicate, Fireworks, Together) are usually cheaper than running your own GPUs.
How do I handle secrets like API keys?
Store them in an .env file outside the image and mount at runtime. Never bake secrets into the image. For larger setups, use Doppler, AWS Secrets Manager, or 1Password CLI.