OpenCode Multi-Provider Model Routing Strategy That Cuts Your Monthly AI Coding Agent Bill by 40%+
Have you ever broken into a cold sweat looking at your end-of-month bill after using AI coding tools? When I started using Claude Sonnet as my primary model, I threw the same frontier model at everything — architecture design, boilerplate generation, you name it — and ended up with a charge nearly twice what I expected. In practice, working solo, after applying this strategy, I've kept my cloud API costs under $10 per month. This is based on moderate individual developer usage; the absolute numbers will differ for team environments, but the savings ratio should be roughly similar.
Honestly, there's no reason to process complex design-phase reasoning and generating a single line of test code with the same model. There's a way to maintain the same quality for half your current spend, and that's the multi-provider tiering strategy that automatically assigns models based on task complexity. An open-source terminal-based AI coding agent called OpenCode lets you implement this with a single JSON file.
This post covers patterns you can apply immediately in practice: three-tier model layering configuration, simplifying setup with a LiteLLM gateway, and privacy strategies for sensitive codebases. The structure lets you grasp the concepts first, then pick the example that fits your situation and follow along directly.
Core Concepts
OpenCode's Provider-Agnostic Architecture
OpenCode is a MIT-licensed open-source AI coding agent written in Go. It's a terminal-native tool, and unlike SaaS tools such as Claude Code or Cursor that are locked to specific cloud vendors, it's designed to freely mix any provider within the same agent loop. The key is that you can connect 75+ LLMs through a single opencode.json file.
Provider-Agnostic: An architectural design approach that avoids lock-in to a specific LLM vendor's API, instead enabling any model to be swapped in through a standard interface (typically an OpenAI-compatible API — an interface callable the same way as ChatGPT)
There are two key fields to focus on in the config file:
model: The primary model used for main taskssmall_model: A lightweight model automatically assigned to repetitive, simple tasks
How small_model gets triggered is probably the thing you're most curious about — I was confused about this at first too, wondering "what criteria decides when to use the smaller model?" OpenCode internally delegates auxiliary subtasks like file summarization, title generation, and simple completions to small_model. These are "peripheral tasks" that run alongside the main agent loop, not the primary tasks. The trigger is based on task type, not token count. So complex architecture design always goes to model unless you switch manually with /model.
{
"$schema": "https://opencode.ai/config.json",
"model": "anthropic/claude-sonnet-4-6",
"small_model": "ollama/qwen3:30b-a3b",
"provider": {
"ollama": {
"name": "Ollama",
"baseURL": "http://localhost:11434/v1",
"models": {
"qwen3:30b-a3b": { "name": "Qwen3 30B MoE" },
"devstral": { "name": "Devstral Small 24B" }
}
}
}
}Save this file as opencode.json in your project root to apply it only to that project, or place it at ~/.config/opencode/config.json to apply it globally across all projects.
The Cost Escalation Tiering Principle
Honestly, at first I thought "can't I just use one good model?" But when you actually analyze your task types, most time is spent on boilerplate writing, test generation, and simple refactoring — genuinely complex design decisions account for only 10–20% of total work. The key is using expensive models only for that 10–20%.
| Cost Tier | Model Type | Suitable Tasks |
|---|---|---|
| Free (Tier 1) | Ollama local models | Code editing, boilerplate, simple implementations |
| Low-cost (Tier 2) | Gemini Flash, Claude Haiku | Test generation, iterative processing, documentation |
| High-cost (Tier 3) | Claude Opus, Sonnet | Architecture design, complex reasoning, critical decisions |
Cost Escalation: A staged cost investment strategy that starts with the cheapest option and only escalates to a higher-tier model when task requirements demand it
The Maturity of Local Models in 2025–2026
Just one or two years ago, local models were at the "you can use them, but you'll end up going back to cloud anyway" stage — but that's changed now. On SWE-bench, Qwen3 30B-A3B scores 73.4% and Devstral Small 24B scores 68%. Ollama is the tool that lets you run these models locally; it has a built-in OpenAI-compatible API server accessible at http://localhost:11434/v1 — you can install it at ollama.com.
The "7x cost efficiency of Devstral vs. Claude Sonnet" figure is a comparison based on API pricing. It refers to the difference in per-token costs between using Devstral via cloud API versus using Sonnet, and running locally with Ollama makes the API cost itself zero. Hardware costs (electricity, GPU depreciation) are separate, of course. With an RTX 4090, the math works out to recouping the savings within 3–6 months, but if you already have a high-performance Mac or GPU, you can benefit immediately with no additional cost.
| Local Model | SWE-bench | Minimum Hardware | Characteristics |
|---|---|---|---|
| Qwen3 30B-A3B (MoE) | 73.4% | 24GB VRAM | General-purpose coding, balanced reasoning |
| Devstral Small 24B | 68% | 32GB RAM (Mac) / RTX 4090 | Coding-specialized, Mistral-based |
| Gemma 4 27B | - | 24GB VRAM | Google, verified OpenCode compatible |
Practical Application
Before looking at the examples, it's worth figuring out which configuration fits your situation first.
- Individual developer, setting up for the first time → Example 1 (3-tier layering, simplest)
- Multiple team members sharing the same config, or frequently switching providers → Example 2 (LiteLLM gateway)
- Want complex automated workflows → Example 3 (agent separation)
- Fintech, healthcare, or other environments where code must not leave your infrastructure → Example 4 (privacy-first)
Example 1: Three-Tier Model Layering Configuration
This is the simplest configuration to start with. Assign an Ollama local model to small_model, and switch to cloud only when complex tasks arise by using the /model command within the session.
{
"$schema": "https://opencode.ai/config.json",
"model": "anthropic/claude-sonnet-4-6",
"small_model": "ollama/qwen3:30b-a3b",
"provider": {
"anthropic": {
"apiKey": "$ANTHROPIC_API_KEY"
},
"ollama": {
"name": "Ollama",
"baseURL": "http://localhost:11434/v1",
"models": {
"qwen3:30b-a3b": { "name": "Qwen3 30B MoE" },
"devstral": { "name": "Devstral Small 24B" }
}
}
}
}The "apiKey": "$ANTHROPIC_API_KEY" part is how OpenCode reads shell environment variables at runtime. If you copy this value as-is, you must have export ANTHROPIC_API_KEY=sk-ant-... set in your shell, or have a .env file in the project root. Without it, the literal string $ANTHROPIC_API_KEY gets sent as the API key, causing an authentication error. I missed this myself initially and spent a while confused.
The baseURL of http://localhost:11434/v1 for ollama/qwen3:30b-a3b is the default address that opens when you run ollama serve after installing Ollama. If you're in a Docker environment or changed the port, you'll need to update this address accordingly.
| Task Type | Model Applied | Reason |
|---|---|---|
| Architecture planning, complex design | Claude Opus 4.7 (manual switch) | Requires advanced reasoning |
| General code editing, implementation | Qwen3 30B (Ollama, small_model) | Free, sufficient quality |
| Test generation, documentation | Claude Haiku (small_model alternative) | Low-cost iterative processing |
When you want to switch models during a session, enter the /model command in the TUI or use the variant_cycle keybinding for real-time switching.
Now that you have the concepts down, let's look more deeply at the configuration for each scenario.
Example 2: Simplifying Configuration with a LiteLLM Gateway
The most tedious part of multi-provider setup is managing API keys and URL configuration for each provider. By running a LiteLLM proxy locally, OpenCode only needs to point at a single endpoint, and LiteLLM handles the actual routing.
{
"$schema": "https://opencode.ai/config.json",
"model": "gateway/claude-sonnet-4-6",
"small_model": "gateway/ollama/qwen3:30b-a3b",
"provider": {
"gateway": {
"name": "LiteLLM Gateway",
"baseURL": "http://localhost:4000/v1",
"apiKey": "sk-local",
"models": {
"claude-sonnet-4-6": {},
"ollama/qwen3:30b-a3b": {},
"ollama/devstral": {}
}
}
}
}The "apiKey": "sk-local" might look odd, but that's because it's a local proxy. When you run LiteLLM locally, you can configure it to pass through any arbitrary value without actual API key validation. Since nothing is going external, there's no security risk — any string will work.
LiteLLM Proxy: An open-source gateway that bundles 100+ LLMs — Anthropic, OpenAI, Ollama, Gemini, and more — behind a single OpenAI-compatible endpoint. Install with
pip install litellmand run locally withlitellm --config config.yaml
The real advantage of this pattern is that you can swap providers or add new models by editing only the LiteLLM config file, without touching the OpenCode configuration at all. Especially useful for configs shared across a team.
Example 3: Role-Specialized Agent Configuration (Oh My OpenCode)
Going further, you can set up a "virtual team" structure where orchestrator, planner, executor, and researcher roles are each assigned to different models. This agents field is part of the Oh My OpenCode configuration schema (an agent workflow framework that sits on top of OpenCode). This is distinct from adding directly to the base OpenCode opencode.json — you need to install Oh My OpenCode separately for this configuration to work.
{
"$schema": "https://opencode.ai/config.json",
"agents": {
"orchestrator": {
"model": "anthropic/claude-haiku-4-5",
"description": "작업 분배 및 계획 수립"
},
"fixer": {
"model": "ollama/devstral",
"description": "버그 수정 및 코드 편집"
},
"oracle": {
"model": "ollama/qwen3:30b-a3b",
"description": "코드 분석 및 리뷰"
},
"architect": {
"model": "anthropic/claude-opus-4-7",
"description": "아키텍처 결정 및 설계"
}
}
}The description field isn't just a comment — it's metadata that Oh My OpenCode references when deciding which agent to route a task to. The high-frequency orchestrator is handled by inexpensive Haiku, while the actually heavy computation is handled by free local models.
Example 4: Privacy-First Configuration for Sensitive Codebases
In environments like fintech or healthcare where code must not be sent to external servers, it's safest to start with a 100% local Ollama configuration.
{
"$schema": "https://opencode.ai/config.json",
"model": "ollama/qwen3:30b-a3b",
"small_model": "ollama/devstral",
"provider": {
"ollama": {
"name": "Ollama (Local Only)",
"baseURL": "http://localhost:11434/v1",
"models": {
"qwen3:30b-a3b": { "name": "Qwen3 30B MoE" },
"devstral": { "name": "Devstral Small 24B" }
}
}
}
}The http://localhost:11434/v1 in baseURL is Ollama's default address. If you've installed it locally on the standard port, you can use this address as-is. If you're in a Docker Compose environment or changed the port, update it accordingly.
Removing the cloud provider section entirely eliminates any chance of sensitive code accidentally leaking to external services.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Cost reduction | 40–60% savings possible vs. uniform frontier model deployment; immediate benefit if you already own high-performance hardware |
| Privacy guarantee | With local models, code is never sent to external servers — suitable for confidential enterprise codebases |
| Offline usage | Work is possible with local models alone, without an internet connection |
| Provider neutrality | Freely choose and swap the most advantageous model without lock-in to any specific vendor |
| Flexible switching | Real-time model switching during a session via the /model command in TUI or keybindings |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Reduced token generation speed | GitHub issue #4182: reports of <0.5 tokens/sec via OpenCode vs. 12 tokens/sec running Ollama directly — an ongoing bug | Await official patch; try routing via LiteLLM gateway as a temporary workaround |
| Tool call quality variance | Some local models become confused on basic tool calls like file operations | Choose models based on instruction-following quality rather than SWE-bench scores |
| Hardware requirements | Qwen3 30B MoE needs 24GB VRAM; Devstral 24B needs 32GB RAM or RTX 4090; CPU inference is 10–50x slower | Use smaller cloud models (Haiku, Gemini Flash) as alternatives if hardware falls short |
| Context window limits | Models with fewer than 64K tokens are unsuitable for multi-file work | Check context size first when selecting a model |
| Initial setup complexity | Multi-provider configuration requires trial and error to understand each model's characteristics | Start with just the two fields model + small_model and expand gradually |
instruction-following: A model's ability to accurately follow given instructions (e.g., "only edit this file", "respond in JSON only"). In coding agents, this is as practically important a metric as SWE-bench scores
Common Pitfalls in Practice
-
Mistaking a speed issue for a model problem — There is currently an ongoing speed degradation bug with OpenCode's Ollama routing (#4182). If a local model seems unusually slow, check this issue before swapping models. Routing through a LiteLLM gateway has worked as a temporary workaround in some cases.
-
Choosing a model without checking the context window — I did this myself at first, picking based on benchmark scores alone, and ended up with the context getting truncated mid-way through a multi-file task, causing the agent to behave erratically. It's recommended to first confirm whether a model supports 64K tokens or more.
-
Attempting a multi-agent configuration from the start — The agent separation setup in Example 3 is powerful, but it's far more stable to start with just the two fields
model+small_model, measure actual cost and quality, and then expand incrementally. The more complex the configuration, the harder it is to trace where problems arise.
Closing Thoughts
More important than the model strategy itself is the habit of measuring your own work patterns with data. Real optimization begins only when you track what you're spending on which tasks.
Three steps you can take right now:
-
You can install Ollama and pull the Qwen3 30B-A3B model. Run
ollama pull qwen3:30b-a3bto download the model andollama serveto start the local server — it'll be accessible as an OpenAI-compatible API athttp://localhost:11434/v1. If your hardware doesn't reach 24GB VRAM, trydevstralor a smaller model first. -
You can add a single line
"small_model": "ollama/qwen3:30b-a3b"toopencode.jsonin your project root. Leave your existing cloud main model in place and only switchsmall_modelto Ollama — local processing will start handling simple repetitive tasks. Observe which tasks get offloaded locally and patterns will emerge. -
After about a week of use, check how your cloud API costs have changed. Anthropic lets you view daily and per-model token usage and costs in console.anthropic.com → Usage tab. If you feel the savings, consider gradually expanding to a LiteLLM gateway or agent separation configuration.
References
Official Documentation
- OpenCode Official Docs - Providers
- OpenCode Official Docs - Models
- OpenCode Official Docs - Config
- Ollama Official OpenCode Integration Guide
- LiteLLM + OpenCode Integration Quickstart | LiteLLM Official Docs
Community Guides
- OpenCode + Ollama Homelab Setup Guide | Virtualization Howto
- Building a Local AI Coding Environment with Ollama and OpenCode | DevelopersIO
- OpenCode + Ollama Privacy Guide | Lushbinary
- OpenCode Go + Oh My OpenAgent Model Routing | Medium
- Oh My OpenCode Specialized Agents Deep Dive | Medium
- Using OpenCode with Vercel AI Gateway | Vercel
- groxaxo/opencode-local-setup | GitHub
- OpenCode Model-Neutral AI Coding Assistant | Red Hat Developer
Benchmarks and Model Analysis