API vs. Self-Hosted LLMs: A Cost Analysis Framework
When does it make sense to run your own models? A decision framework for enterprise architects with real cost modeling and trade-off analysis.
The question I get asked most often after discussing LLM costs: "Should we just host our own models?"
The assumption is that self-hosting must be cheaper at scale. The reality: it depends on factors most teams don't consider upfront.
This isn't about which is "better"—it's about when each makes sense. I'll share actual cost models from enterprise deployments and give you a framework, not a prescription.
Understanding the Two Cost Models
The API Model: Variable Cost, Zero Infrastructure
The formula is straightforward: Total Cost equals Input Tokens times Input Price plus Output Tokens times Output Price.
Characteristics:
- Pay for exactly what you use
- No upfront investment
- Cost scales linearly with usage
- Provider handles infrastructure, scaling, updates
- Model improvements included automatically
The Self-Hosted Model: Fixed Cost, Full Control
Total Cost equals Infrastructure plus Compute plus Storage plus Personnel plus Maintenance.
Characteristics:
- Pay for capacity, not usage
- Significant upfront investment (or committed cloud spend)
- Cost is mostly fixed regardless of utilization
- You handle scaling, updates, security, reliability
- Model improvements require your effort
The Mental Model
- API = Taxi (pay per ride, no car ownership)
- Self-Hosted = Owning a car (fixed costs, unlimited rides)
The question: How many "rides" do you need before owning makes sense?
API Cost Reality Check
Current Pricing Landscape (January 2026)
Here's what the major providers charge per million tokens:
OpenAI:
- GPT-4o: Input 2.50 / Output 10.00 — Best for complex reasoning
- GPT-4o-mini: Input 0.15 / Output 0.60 — Best for high-volume tasks
Anthropic:
- Claude 3.5 Sonnet: Input 3.00 / Output 15.00 — Best for long context
- Claude 3.5 Haiku: Input 0.80 / Output 4.00 — Fast and cheap
Google:
- Gemini 1.5 Pro: Input 1.25 / Output 5.00 — Best for multimodal
- Gemini 1.5 Flash: Input 0.075 / Output 0.30 — Highest volume
What Teams Underestimate
- Output tokens often cost 3-4x more than input
- Retry logic and error handling add hidden token spend
- Development and testing consume real tokens
- Prompt engineering iterations are expensive
Realistic Enterprise Scenario
- 50M tokens/day (modest enterprise deployment)
- 80% input, 20% output
- Model: GPT-4o-mini for cost efficiency
Monthly calculation: approximately 2,250 per month
Self-Hosted Cost Reality Check
Cloud GPU Instances
AWS Options:
- g5.xlarge with A10G (24GB): About 1.00/hour or 730/month — Runs 7B models
- g5.12xlarge with 4x A10G (96GB): About 5.67/hour or 4,140/month — Runs 70B quantized
- p4d.24xlarge with 8x A100 (320GB): About 32.77/hour or 23,900/month — Runs 70B+ full precision
GCP Options:
- a2-highgpu-1g with A100 (40GB): About 3.67/hour or 2,680/month — Runs 70B quantized
Hidden costs people forget:
- Reserved instances require 1-3 year commitments
- Spot instances have interruption risk
- Egress bandwidth for distributed inference
- Storage for model weights and checkpoints
The Personnel Factor
This is what most cost analyses miss:
- Someone has to manage the infrastructure
- Someone has to handle model updates and security patches
- Someone has to debug inference issues at 3 AM
- Someone has to optimize for your specific workloads
Estimate: 0.25-1.0 FTE depending on scale and complexity
At 150K fully loaded cost, that's 37K-150K per year just in personnel.
The Break-Even Analysis
When Does Self-Hosting Win?
Variables:
- T = Daily token volume
- P = API price per token
- I = Monthly infrastructure cost (self-hosted)
- U = Utilization rate (what percent of capacity you actually use)
Break-even when: T times P times 30 equals I divided by U
Worked Example: 50M tokens/day
API cost (GPT-4o-mini): approximately 2,250/month
Self-hosted (Llama 70B on 4x A10G):
- Infrastructure: 4,140/month
- Assumes 80% utilization
- Effective cost: 5,175/month
Result: API wins by 2.3x
Worked Example: 500M tokens/day
API cost (GPT-4o-mini): approximately 22,500/month
Self-hosted (same setup, now at capacity):
- Infrastructure: 4,140/month
- Assumes 95% utilization
- Effective cost: 4,360/month
Result: Self-hosted wins by 5x
Key insight: The crossover point is typically 100-200M tokens/day for cost-optimized models, lower for expensive models like GPT-4/Claude Opus.
Beyond Cost: The Hidden Trade-offs
What You Gain with Self-Hosting
Data privacy and sovereignty
- Data never leaves your infrastructure
- Critical for healthcare, finance, government
- Simplifies compliance (HIPAA, SOC2, GDPR)
Latency control
- No network round-trip to external API
- P99 latency more predictable
- Can co-locate with your data
Customization
- Fine-tune on your domain data
- Custom tokenizers for specialized vocabularies
- Inference optimizations for your workload
No rate limits
- Burst capacity limited only by your hardware
- No vendor throttling during peak demand
What You Lose with Self-Hosting
Model quality gap
- GPT-4o and Claude 3.5 still outperform open models on many tasks
- Gap is closing but not closed
Automatic improvements
- API models improve without your effort
- Self-hosted requires manual upgrades
Operational burden
- You're now in the AI infrastructure business
- Security patches, scaling, monitoring all on you
Flexibility
- Switching models requires infrastructure changes
- API lets you switch with a config change
The 5-Question Framework
Before choosing, answer these:
1. What's your token volume?
- Less than 10M/day: API (almost always)
- 10-100M/day: Depends on other factors
- More than 100M/day: Self-hosted worth serious analysis
2. What's your latency requirement?
- Under 100ms P99: Consider self-hosted or edge
- Under 500ms P99: API works fine
- Over 500ms acceptable: API definitely works
3. What are your data residency constraints?
- Strict (healthcare, gov): Self-hosted or private cloud API
- Moderate (enterprise): Either works with proper contracts
- Minimal: API simplest
4. Do you have ML infrastructure expertise?
- Strong team already: Self-hosted viable
- Limited expertise: API or managed service
- No expertise: Definitely API
5. What's your model quality requirement?
- State-of-the-art required: API (GPT-4, Claude)
- Good enough for task: Self-hosted viable
- Specialized domain: Fine-tuned self-hosted may be best
The Hybrid Reality
Most enterprises I work with end up with both:
- Self-hosted for: High-volume, privacy-sensitive, latency-critical
- API for: Complex reasoning, low-volume, experimental
Conclusion
The question isn't "API or self-hosted?" It's "what's the right mix for your specific constraints?"
Key takeaways:
- API is cheaper for most workloads under 100M tokens/day
- Self-hosting wins on cost only at very high volume
- Privacy, latency, and compliance often matter more than cost
- Personnel cost is the most underestimated factor
- Hybrid architectures are the enterprise norm
Our job as architects isn't to pick the cheapest option. It's to design systems that are sustainable at current scale, adaptable as requirements change, and aligned with organizational constraints.
The cost model should inform the architecture, not dictate it.