AI Model Selection: Balancing Cost, Speed, and Quality
The short answer: use the cheapest model that meets your quality bar. For most teams in mid-2026, that means routing simple tasks through GPT-5.4 Mini ($0.75/$4.50 per 1M tokens) or Claude Haiku 4.5 ($1/$5), and reserving frontier models like GPT-5.5 or Claude Opus 4.7 for complex reasoning, coding, and high-stakes decisions. One model cannot do everything well.
The difference between the cheapest and most expensive model is now 15-25x on output tokens alone. Ignoring that spread is how teams burn five-figure monthly API bills on tasks a mini model could handle.
The best AI teams don’t pick a model. They build a model selection system. They route easy tasks to cheap models, hard tasks to strong models, and keep humans in the loop for high-risk decisions.
The Pricing Reality: May 2026
All prices below are per 1M tokens (USD), sourced directly from vendor pricing pages as of late May 2026. Batch pricing reflects 50% discount for asynchronous workloads.
Frontier Models
| Model | Input | Cached Input | Output | Batch Output | Context Window | Max Output |
|---|---|---|---|---|---|---|
| GPT-5.5 | $5.00 | $0.50 | $30.00 | $15.00 | 270K (short) / 1M (long) | 128K |
| GPT-5.4 | $2.50 | $0.25 | $15.00 | $7.50 | 270K (short) / 1M (long) | 128K |
| Claude Opus 4.7 | $5.00 | $0.50 | $25.00 | $12.50 | 1M | 128K |
| Claude Sonnet 4.6 | $3.00 | $0.30 | $15.00 | $7.50 | 1M | 64K |
| Mistral Large 3 | Contact | Contact | 256K |
Mini / Cost-Optimized Models
| Model | Input | Cached Input | Output | Batch Output | Context Window |
|---|---|---|---|---|---|
| GPT-5.4 Mini | $0.75 | $0.075 | $4.50 | $2.25 | 270K |
| GPT-5.4 Nano | $0.20 | $0.02 | $1.25 | $0.625 | 270K |
| Claude Haiku 4.5 | $1.00 | $0.10 | $5.00 | $2.50 | 200K |
| Gemini 3.1 Flash-Lite | ~$0.075* | ~$0.30* | 1M | ||
| Gemini 3.5 Flash | ~$0.15* | ~$0.60* | 1M |
*Gemini Flash pricing from Google AI Studio (ai.google.dev) tends to be lower than Vertex AI enterprise pricing. Always check the live page for current rates.
Google’s pricing structure differs by platform: Google AI Studio (developer tier) and Vertex AI (enterprise) have separate pricing. Gemini models on Vertex AI also support Provisioned Throughput for predictable costs at scale. OpenAI and Anthropic offer comparable batch discounts (50%) and prompt caching (90% reduction on cache hits).
OpenAI also offers Flex processing at ~50% discount (slower, lower availability) and Priority processing at ~2.5x premium (guaranteed throughput). Anthropic offers Fast mode on Opus 4.7 at 6x standard rates ($30/$150) for dramatically faster output. Claude Managed Agents adds $0.08 per session-hour on top of token costs.
Real Benchmark Scores
Google DeepMind published comparison benchmarks alongside the Gemini 3.5 Flash launch. These numbers come from their official model page (deepmind.google), cross-checked against Artificial Analysis independent evaluations.
| Benchmark | GPT-5.5 | Claude Opus 4.7 | Gemini 3.1 Pro | Gemini 3.5 Flash | Gemini 3 Flash |
|---|---|---|---|---|---|
| SWE-Bench Pro (coding, pass@1) | 58.6% | 64.3% | 54.2% | 55.1% | 49.6% |
| Terminal-Bench 2.1 (agentic coding) | 78.2% | 66.1% | 70.3% | 76.2% | 58.0% |
| MCP Atlas (multi-step tool use) | 75.3% | 79.1% | 78.2% | 83.6% | 62.0% |
| GDPval-AA (real work tasks, Elo) | 1769 | 1753 | 1314 | 1656 | 1204 |
| Humanity’s Last Exam (academic) | 41.4% | 46.9% | 44.4% | 40.2% | 33.7% |
| MMMU-Pro (multimodal reasoning) | 81.2% | 75.2% | 80.5% | 83.6% | 81.2% |
| CharXiv (chart reasoning) | 84.1% | 82.1% | 83.3% | 84.2% | 80.3% |
| ARC-AGI-2 (abstract reasoning) | 84.6% | 75.8% | 77.1% | 72.1% | 33.6% |
| OSWorld-Verified (computer use) | 78.7% | 78.0% | 76.2% | 78.4% | 65.1% |
Key takeaways:
- Claude Opus 4.7 leads coding benchmarks (SWE-Bench Pro: 64.3%) and academic reasoning (HLE: 46.9%).
- GPT-5.5 leads on abstract reasoning (ARC-AGI-2: 84.6%) and real-world GDP tasks (Elo 1769).
- Gemini 3.5 Flash punches above its weight class: near-frontier at Flash-tier pricing. Leads MCP Atlas (83.6%).
- Gemini 3 Flash is weaker on agentic and abstract tasks compared to its 3.5 successor.
Cost-to-Intelligence Ratio
Using the Artificial Analysis Intelligence Index and blended pricing:
| Model | AA Intelligence Rank | Cost to Run AA Index |
|---|---|---|
| GPT-5.5 | #1 | High |
| Claude Opus 4.7 | #2 | High |
| Gemini 3.5 Flash | #3�4 (Flash tier) | Low |
| Claude Sonnet 4.6 | #3�4 | Medium |
| GPT-5.4 | #5�6 | Medium |
| Gemini 3.1 Pro | #5�6 | Medium-High |
The standout is Gemini 3.5 Flash: frontier-level intelligence at Flash pricing, with Artificial Analysis noting it as “the new leader in intelligence versus speed.”
Latency: What Users Actually Experience
Speed means more than model inference time. It includes prompt construction, retrieval, tool calls, streaming behavior, and post-processing.
Approximate Output Speed Ranges (tokens/second, first-party API)
| Speed Tier | Models | Output t/s (typical) | Best For |
|---|---|---|---|
| Fastest | Haiku 4.5, GPT-5.4 Nano, Gemini Flash-Lite | 150�300+ | Classification, extraction, simple chat |
| Fast | Sonnet 4.6, GPT-5.4 Mini, Gemini 3.5 Flash | 80�150 | Drafting, summarization, most production work |
| Moderate | GPT-5.4, GPT-5.5, Opus 4.7, Gemini Pro | 40�100 | Complex reasoning, coding, multi-step analysis |
| Slow (reasoning) | GPT-5.5 with high reasoning effort | 15�50 | Deep research, math proofs, multi-file refactors |
Sources: Artificial Analysis independent speed benchmarks; vendor documentation.
Latency Targets by Use Case
- Under 2 seconds feels responsive for UI copilots, autocomplete, live chat. Use mini models.
- 2�8 seconds acceptable for thoughtful chat, search summaries, document Q&A. Streaming must start early.
- 10�60 seconds fine for complex analysis when users understand deeper work is happening. Reasoning models belong here.
- Minutes to hours batch processing. Save 50% with batch APIs. OpenAI and Anthropic both support this.
A faster model delivering a mediocre answer is worse than a slower model giving the right answer. Measure wall-clock time for the full pipeline, not just model inference.
The Cost-Per-Completed-Task Metric
Token price alone is misleading. Track this instead:
Cost per accepted, correct task = (total tokens x token price) + tool call costs + human review time + retry cost
A real example: processing 10,000 support tickets with Claude Haiku 4.5 at $1/$5 per 1M tokens, averaging ~3,700 tokens per conversation. Total: ~$37 per 10,000 tickets (per Anthropic’s own pricing documentation example).
If you used Opus 4.7 for the same task: ~$185 (5x more). If Haiku 4.5 is accurate enough, that’s $148 in wasted spend per 10,000 tickets.
What to measure before committing to a model:
- Average tokens per request (input + output, over 100+ real examples)
- Tool call count per request (web search: $10/1k calls on both OpenAI and Anthropic)
- Retry rate (how often does the first response fail quality review?)
- Human edit time (does a stronger model reduce editing enough to justify its cost?)
- Volume spikes (long documents or tool loops can 10x your token burn)
Model Routing: The Real Cost Saver
Model routing means using different models for different steps in the same workflow. It is the single highest-leverage practice for cutting costs without dropping quality.
Routing Pattern 1: Customer Support
- GPT-5.4 Nano or Haiku 4.5 classifies intent, urgency, and language.
- Retrieval system fetches relevant policy documents and past tickets.
- GPT-5.4 Mini or Sonnet 4.6 drafts the response.
- GPT-5.4 or Opus 4.7 reviews only escalated cases (refunds, legal, safety).
- Human final approval on all escalations.
Routing Pattern 2: Content Production
- Haiku 4.5 clusters research notes, extracts key quotes.
- Sonnet 4.6 drafts outlines and first-pass content.
- Opus 4.7 critiques gaps, checks factual consistency, suggests improvements.
- Human verifies facts, writes final judgment, publishes.
Routing Pattern 3: Engineering Workflows
- GPT-5.4 Nano labels and triages incoming bug reports.
- GPT-5.4 Mini explains likely root causes and suggests investigation paths.
- GPT-5.5 or Opus 4.7 generates code changes on verified bugs.
- CI tests + human review gate on acceptance.
Why this works: 70% of requests in any production system are simple enough for a mini model. Shipping all 100% to a frontier model is burning money. Build a router.
Caching, Context, and Long Documents
Prompt caching is the most underused cost optimization. Both OpenAI and Anthropic charge ~10% of the base input price for cache reads.
Anthropic’s pricing structure:
- 5-minute cache write: 1.25x base input price
- 1-hour cache write: 2x base input price
- Cache hit (read): 0.1x base input price
A 5-minute cache pays for itself after one cache read. An hour cache pays after two reads.
OpenAI uses a simpler model: cached input at 10% of standard input price, with no separate write charge.
When to cache:
- System prompts reused across thousands of requests.
- Knowledge base documents loaded into every prompt.
- Multi-turn conversations where context is shared.
When not to cache:
- Every request has a unique prompt.
- The cached content changes frequently.
- Your requests are already cheap enough that overhead outweighs savings.
Long context strategy: Both Opus 4.7 and Sonnet 4.6 support 1M-token context windows at standard pricing. But filling that window costs money. A 500K-token prompt on GPT-5.5 (long context: $10/$45 I/O) costs $5 in input alone. Use retrieval to pull only relevant chunks, then apply caching on the stable retrieval context.
The Selection Framework
- Define the workflow. Write down inputs, outputs, users, and failure tolerance.
- Split tasks by difficulty. Simple (classification, extraction, rewriting), moderate (drafting, summarization, explanation), complex (coding, multi-step analysis, legal review).
- Set latency requirements per step. Real-time? Near-real-time? Batch overnight?
- Estimate volume. Requests/day, average tokens, worst-case spikes.
- Identify compliance constraints. Data residency, model training opt-out, audit logging, regional processing. Both OpenAI and Anthropic charge ~10% uplift for data residency.
- Pick 2-3 candidate models per step. Always test at least two vendors.
- Build an evaluation set. Use real examples not demo prompts. Include easy, normal, edge, adversarial, and “I don’t know” cases.
- Score blind. Remove model names. Rate on correctness, completeness, format compliance, tone, safety, hallucination rate.
- Calculate cost per accepted task. Not cost per token.
- Test latency in your full pipeline. Retrieval, prompts, model, tools, validation, logging all of it.
- Design fallbacks. What happens when the primary model is slow, unavailable, or returns garbage? The secondary model should be from a different vendor.
- Monitor and re-evaluate quarterly. Prices drop. New models launch. Your routing needs change.
Common Mistakes
- Using GPT-5.5 for intent classification that GPT-5.4 Nano does perfectly at 1/25th the cost.
- Comparing models by input price only, ignoring output tokens (typically 3-6x more expensive).
- Forgetting tool call costs. Web search adds $10/1k calls on both OpenAI and Anthropic.
- Overloading prompts with 200K tokens when retrieval could fetch the 5K that actually matter.
- Not using batch APIs for overnight jobs. That’s a free 50% saving.
- Using one model for everything because “it’s the best.” Best is relative to the task.
- Not tracking cost by workflow. If you can’t see which pipeline is burning budget, you can’t fix it.
FAQ
Q: Which model is best for coding in mid-2026? Claude Opus 4.7 leads SWE-Bench Pro at 64.3% pass@1. GPT-5.5 leads Terminal-Bench 2.1 at 78.2%. For cost-conscious coding, Claude Sonnet 4.6 and GPT-5.4 offer strong performance at lower prices. Gemini 3.5 Flash is the dark horse: Flash pricing with near-Pro coding scores.
Q: What is the cheapest model that can actually do useful work? GPT-5.4 Nano at $0.20/$1.25 per 1M tokens. Excellent for classification, extraction, routing, and simple rewriting. Gemini 2.5 Flash-Lite is even cheaper at $0.075/$0.30.
Q: Should I use batch processing? Yes, for any workload that can wait hours. Both OpenAI and Anthropic offer 50% discount on batch. Ideal for overnight summarization, data enrichment, report generation, and backfill jobs.
Q: How often should I re-evaluate models? Every quarter. The model landscape shifts fast. A model that was the right choice in March may be overpriced or outperformed by June.
Q: Is prompt caching worth the complexity? Yes, if you reuse system prompts, knowledge bases, or long documents across requests. A single cache read recovers the write cost for 5-minute TTL caching on Anthropic.
Q: What about open-weight models like Llama, DeepSeek, or Mistral? For teams with the infrastructure to self-host, open-weight models eliminate per-token pricing entirely. DeepSeek-V3.2 and Llama 4 Maverick are competitive on benchmarks. The tradeoff is engineering overhead hosting, scaling, and maintaining your own inference. Mistral offers both API and enterprise deployment options, with European data residency as a key differentiator.
Reference Sources
All pricing and benchmark data verified May 27�28, 2026. OpenAI and Anthropic pricing confirmed through direct page fetches. Gemini Flash-tier pricing referenced from Google AI Studio public documentation. Benchmark data sourced from Google DeepMind’s official model comparison table.
- OpenAI API Pricing GPT-5.5, GPT-5.4, GPT-5.4 Mini, GPT-5.4 Nano, batch/flex/priority tiers
- OpenAI Platform Pricing Documentation detailed token pricing, tool costs, containers
- Anthropic API Pricing plans overview
- Anthropic API Pricing Documentation per-model rates, prompt caching, batch, fast mode, managed agents
- Anthropic Models Overview context windows, max output, latency tiers, knowledge cutoffs
- Google Cloud Vertex AI Generative AI Pricing enterprise Gemini model pricing, batch, grounding
- Google AI Studio Pricing developer-tier Gemini Flash and Pro pricing
- Google Cloud Vertex AI Models Documentation available models, partner models (Anthropic, Mistral, Meta, DeepSeek)
- Google DeepMind Gemini 3.5 benchmark comparison table, model capabilities
- Artificial Analysis independent intelligence index, speed benchmarks, pricing analysis, coding agent benchmarks
- Mistral AI Pricing subscription plans and API pricing