API vs. Self-Hosted LLMs: A Cost Analysis Framework

The question I get asked most often after discussing LLM costs: "Should we just host our own models?"

The assumption is that self-hosting must be cheaper at scale. The reality: it depends on factors most teams don't consider upfront.

This isn't about which is "better"—it's about when each makes sense. I'll share actual cost models from enterprise deployments and give you a framework, not a prescription.

Understanding the Two Cost Models

The API Model: Variable Cost, Zero Infrastructure

The formula is straightforward: Total Cost equals Input Tokens times Input Price plus Output Tokens times Output Price.

Characteristics:

Pay for exactly what you use
No upfront investment
Cost scales linearly with usage
Provider handles infrastructure, scaling, updates
Model improvements included automatically

The Self-Hosted Model: Fixed Cost, Full Control

Total Cost equals Infrastructure plus Compute plus Storage plus Personnel plus Maintenance.

Characteristics:

Pay for capacity, not usage
Significant upfront investment (or committed cloud spend)
Cost is mostly fixed regardless of utilization
You handle scaling, updates, security, reliability
Model improvements require your effort

The Mental Model

API = Taxi (pay per ride, no car ownership)
Self-Hosted = Owning a car (fixed costs, unlimited rides)

The question: How many "rides" do you need before owning makes sense?

API Cost Reality Check

Current Pricing Landscape (January 2026)

Here's what the major providers charge per million tokens:

OpenAI:

GPT-4o: Input 2.50 / Output 10.00 — Best for complex reasoning
GPT-4o-mini: Input 0.15 / Output 0.60 — Best for high-volume tasks

Anthropic:

Claude 3.5 Sonnet: Input 3.00 / Output 15.00 — Best for long context
Claude 3.5 Haiku: Input 0.80 / Output 4.00 — Fast and cheap

Google:

Gemini 1.5 Pro: Input 1.25 / Output 5.00 — Best for multimodal
Gemini 1.5 Flash: Input 0.075 / Output 0.30 — Highest volume

What Teams Underestimate

Output tokens often cost 3-4x more than input
Retry logic and error handling add hidden token spend
Development and testing consume real tokens
Prompt engineering iterations are expensive

Realistic Enterprise Scenario

50M tokens/day (modest enterprise deployment)
80% input, 20% output
Model: GPT-4o-mini for cost efficiency

Monthly calculation: approximately 2,250 per month

Self-Hosted Cost Reality Check

Cloud GPU Instances

AWS Options:

g5.xlarge with A10G (24GB): About 1.00/hour or 730/month — Runs 7B models
g5.12xlarge with 4x A10G (96GB): About 5.67/hour or 4,140/month — Runs 70B quantized
p4d.24xlarge with 8x A100 (320GB): About 32.77/hour or 23,900/month — Runs 70B+ full precision

GCP Options:

a2-highgpu-1g with A100 (40GB): About 3.67/hour or 2,680/month — Runs 70B quantized

Hidden costs people forget:

Reserved instances require 1-3 year commitments
Spot instances have interruption risk
Egress bandwidth for distributed inference
Storage for model weights and checkpoints

The Personnel Factor

This is what most cost analyses miss:

Someone has to manage the infrastructure
Someone has to handle model updates and security patches
Someone has to debug inference issues at 3 AM
Someone has to optimize for your specific workloads

Estimate: 0.25-1.0 FTE depending on scale and complexity

At 150K fully loaded cost, that's 37K-150K per year just in personnel.

The Break-Even Analysis

When Does Self-Hosting Win?

Variables:

T = Daily token volume
P = API price per token
I = Monthly infrastructure cost (self-hosted)
U = Utilization rate (what percent of capacity you actually use)

Break-even when: T times P times 30 equals I divided by U

Worked Example: 50M tokens/day

API cost (GPT-4o-mini): approximately 2,250/month

Self-hosted (Llama 70B on 4x A10G):

Infrastructure: 4,140/month
Assumes 80% utilization
Effective cost: 5,175/month

Result: API wins by 2.3x

Worked Example: 500M tokens/day

API cost (GPT-4o-mini): approximately 22,500/month

Self-hosted (same setup, now at capacity):

Infrastructure: 4,140/month
Assumes 95% utilization
Effective cost: 4,360/month

Result: Self-hosted wins by 5x

Key insight: The crossover point is typically 100-200M tokens/day for cost-optimized models, lower for expensive models like GPT-4/Claude Opus.

Beyond Cost: The Hidden Trade-offs

What You Gain with Self-Hosting

Data privacy and sovereignty

Data never leaves your infrastructure
Critical for healthcare, finance, government
Simplifies compliance (HIPAA, SOC2, GDPR)

Latency control

No network round-trip to external API
P99 latency more predictable
Can co-locate with your data

Customization

Fine-tune on your domain data
Custom tokenizers for specialized vocabularies
Inference optimizations for your workload

No rate limits

Burst capacity limited only by your hardware
No vendor throttling during peak demand

What You Lose with Self-Hosting

Model quality gap

GPT-4o and Claude 3.5 still outperform open models on many tasks
Gap is closing but not closed

Automatic improvements

API models improve without your effort
Self-hosted requires manual upgrades

Operational burden

You're now in the AI infrastructure business
Security patches, scaling, monitoring all on you

Flexibility

Switching models requires infrastructure changes
API lets you switch with a config change

The 5-Question Framework

Before choosing, answer these:

1. What's your token volume?

Less than 10M/day: API (almost always)
10-100M/day: Depends on other factors
More than 100M/day: Self-hosted worth serious analysis

2. What's your latency requirement?

Under 100ms P99: Consider self-hosted or edge
Under 500ms P99: API works fine
Over 500ms acceptable: API definitely works

3. What are your data residency constraints?

Strict (healthcare, gov): Self-hosted or private cloud API
Moderate (enterprise): Either works with proper contracts
Minimal: API simplest

4. Do you have ML infrastructure expertise?

Strong team already: Self-hosted viable
Limited expertise: API or managed service
No expertise: Definitely API

5. What's your model quality requirement?

State-of-the-art required: API (GPT-4, Claude)
Good enough for task: Self-hosted viable
Specialized domain: Fine-tuned self-hosted may be best

The Hybrid Reality

Most enterprises I work with end up with both:

Self-hosted for: High-volume, privacy-sensitive, latency-critical
API for: Complex reasoning, low-volume, experimental

Conclusion

The question isn't "API or self-hosted?" It's "what's the right mix for your specific constraints?"

Key takeaways:

API is cheaper for most workloads under 100M tokens/day
Self-hosting wins on cost only at very high volume
Privacy, latency, and compliance often matter more than cost
Personnel cost is the most underestimated factor
Hybrid architectures are the enterprise norm

Our job as architects isn't to pick the cheapest option. It's to design systems that are sustainable at current scale, adaptable as requirements change, and aligned with organizational constraints.

The cost model should inform the architecture, not dictate it.