Running Qwen3-Coder Locally: Setting Up an SWE-bench 70% AI Coding Agent with a Single RTX 3090
After watching my cloud AI bills double two months in a row, I started seriously looking for alternatives. Honestly, it wasn't so much a bias of "how good could open source be?" — it was more that I'd been burned by a few open-source coding models before. Especially on legacy codebases, whenever I tried multi-file refactoring, the model would quickly lose context, re-define functions it had already created, or touch parts it shouldn't have. That's when I discovered Qwen3-Coder.
After setting it up myself, my opinion changed. This article covers Qwen3-Coder's architectural characteristics, how to choose a quantization level, and local setup using Ollama/llama.cpp/vLLM — all distilled from my own trial and error. The SWE-bench Verified 69.6% figure turned out not to be marketing hype, and I'll share concrete experiences of how the 256K context actually changes the way you work.
The prerequisites are simpler than you'd think. A single RTX 3090/4090-class GPU is enough to run the 30B model comfortably, and on M1/M2 Macs the GGUF quantized version delivers usable performance. If you're not sure how much VRAM your GPU has, run nvidia-smi (NVIDIA) or system_profiler SPDisplaysDataType (Mac) in your terminal to check immediately.
Core Concepts
This section briefly examines what makes Qwen3-Coder different from an architectural perspective. If you don't need the conceptual background right now, feel free to jump straight to the Practical Application section.
Why Is Qwen3-Coder "Coding-Specialized"?
Qwen3-Coder, built by Alibaba's Qwen team, is an open-weight LLM specialized for code generation, debugging, refactoring, and agentic software development. What sets it apart isn't just that it was trained on a lot of code data — the architecture itself was designed for agentic development scenarios.
Two model variants are currently available officially:
| Model | Total Parameters | Active Parameters | Context Window |
|---|---|---|---|
| Qwen3-Coder-480B-A35B-Instruct | 480B | 35B | 256K (up to 1M) |
| Qwen3-Coder-30B-A3B-Flash | 30B | 3B | 256K |
The "480B total but only 35B active parameters" part was confusing to me at first — this is the essence of the MoE architecture.
MoE (Mixture-of-Experts): A structure that divides the full parameters into multiple "expert" sub-networks and selectively activates only the relevant experts for each token. This means that despite the 480B scale, only 35B worth of computation actually occurs during inference.
What 256K Context Actually Means
I also thought at first "it's just a long context," but using it in practice makes a real difference. When refactoring a legacy codebase, you can dump 10–20 related files wholesale into the context and ask "change this pattern throughout." The manual work of splitting files, feeding them in pieces, and stitching results back together — which you had to do with older models — largely disappears.
Technically, it combines hybrid attention and the YaRN technique to achieve native 256K. Hybrid attention mixes local and global attention layer by layer instead of processing the full sequence uniformly, reducing the cost of handling long sequences. YaRN extends positional encoding to longer contexts than seen during training. Together, these two techniques allow far longer contexts to be processed within realistic memory budgets compared to conventional attention architectures.
However, there's one important trap. "Supporting 256K" and "actually being able to run 256K on your GPU" are completely different things.
KV Cache: The memory space that stores previous token information in attention layers. As the context grows longer, this cache grows linearly, so on a 24GB VRAM GPU with Q5_K_M quantization, around 32K tokens is the practical upper limit. I ran into OOM errors twice trying to push past 128K context before accepting this constraint.
Agentic Design: How It Differs from Simple Autocomplete
Qwen3-Coder's biggest differentiator is its agentic coding capability. It was trained via reinforcement learning (Agent RL) across 20,000 parallel environments simulating real software development workflows — in short, it learned to iterate through the "test fails → identify cause → fix → retest" loop.
On the 480B-Instruct model, it achieves SWE-bench Verified 69.6% (at 500 turns). This isn't just about generating good code snippets — the number comes from scenarios resolving real GitHub issues, which is why the difference is noticeable.
Practical Application
This covers the fastest way to get started for each environment. It's organized as Ollama (quick start), llama.cpp (consumer GPU/Mac), and vLLM (team server) — so feel free to read only the section that fits your setup.
Example 1: Quick Start with Ollama
This is the fastest way to get started. I used this when I first tested it, and there's one trap worth knowing in advance.
⚠️ Important: Ollama's default context size is 4096 tokens. Even though Qwen3-Coder supports 256K, running it with this default will cut off at 4K tokens. If the model runs fine but performance seems noticeably poor, this is almost always the reason. Be sure to specify the
--context-sizeoption.
# Model tags may vary, so it's safer to search first
ollama search qwen3-coder
# Download and run — extending the context window is essential
ollama pull qwen3-coder
ollama run qwen3-coder --context-size 32768Download time can be significant depending on model size. The 30B Q4_K_M version is about 20GB, so depending on your network it can take anywhere from 30 minutes to several hours. Setup itself takes 5 minutes, but factor in the download time.
If you prefer a GUI, LM Studio is another option. Install the app from lmstudio.ai, type qwen3-coder in the search bar, and you can download the Q4_K_M quantized version directly. Server mode automatically exposes an OpenAI-compatible API (http://localhost:1234/v1), which made this approach much easier when deploying to teammates less familiar with the code.
Example 2: llama.cpp + GGUF (Consumer GPU / Mac)
If you're low on GPU memory or working on a Mac, the GGUF quantized version is the practical choice. Unsloth is a third-party open-source team known for memory-efficient quantization and fine-tuning — they've published optimized GGUF versions on Hugging Face that you can download and use immediately.
If you don't have llama.cpp installed, you'll need to do that first.
# For Mac
brew install llama.cpp
# For Linux / manual build
# See https://github.com/ggerganov/llama.cpp#buildAfter installation, you can download the model and start the server as follows:
# Download Unsloth GGUF
huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-GGUF \
--include "Qwen3-Coder-30B-A3B-Q5_K_M.gguf" \
--local-dir ./models
# Start llama-server
llama-server \
-m ./models/Qwen3-Coder-30B-A3B-Q5_K_M.gguf \
--ctx-size 32768 \
--n-gpu-layers 99 \
--port 8080--n-gpu-layers 99 means load as many layers as possible onto the GPU. If VRAM is insufficient, lower this number to split processing between GPU and CPU. It will be slower, but works well even on M1/M2 Macs with unified memory.
Here's a summary of quantization level selection criteria:
| Quantization | VRAM (30B) | Quality | Recommended For |
|---|---|---|---|
| Q4_K_M | 16GB | Good | RTX 3090, general tasks |
| Q5_K_M | 20GB | Excellent | RTX 4090, code generation focus |
| Q8_0 | 32GB+ | Near full precision | A6000, accuracy-first |
Note: the K_M in Q4_K_M and Q5_K_M refers to K-means quantization. Rather than simple bit truncation, it compresses weights using clustering, so quality is better at the same bit width. You can apply this same criterion when choosing other GGUF models.
Example 3: Building a Shared Team API Server with vLLM
If multiple people will share the model, or you need agentic tool calling in your workflow, vLLM is a great choice. The best part when our team switched to this approach was that we could change just the endpoint without touching any of our existing OpenAI-based code.
The --tool-call-parser qwen3_coder option is supported in vLLM 0.6.x and above. Upgrade first if your version is older.
pip install "vllm>=0.6.0"
# Single GPU setup (RTX 4090 etc., for 30B model)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-Coder-30B-A3B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--port 8000
# Multi-GPU setup (8x A100 etc., for 480B model serving)
# python -m vllm.entrypoints.openai.api_server \
# --model Qwen/Qwen3-Coder-480B-A35B-Instruct \
# --tensor-parallel-size 8 \
# --max-model-len 131072 \
# ...Once the server is up, just change base_url in your existing OpenAI SDK code and you're good to go.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="local" # Any value works for a local server
)
response = client.chat.completions.create(
model="Qwen/Qwen3-Coder-30B-A3B-Instruct",
messages=[{"role": "user", "content": "Implement binary search in Python"}]
)
print(response.choices[0].message.content)IDE Integration
Once your inference server is ready, the next step is connecting it to your editor. If you're at the point of "I've picked my inference server and just need to hook up my editor," this section is all you need.
Connecting the Continue Extension
Continue is an open-source AI coding assistant that supports both VS Code and JetBrains. After installing it, add the following configuration to .continue/config.json:
{
"models": [{
"title": "Qwen3-Coder (Local)",
"provider": "ollama",
"model": "qwen3-coder",
"contextLength": 32768
}]
}If you're using vLLM instead of Ollama as the backend, change "provider": "openai" and add "apiBase": "http://localhost:8000/v1".
CLINE + vLLM (Agentic Workflow)
If you want to use agentic workflows inside your IDE — like editing files, running terminal commands, or opening a browser — the CLINE extension combined with vLLM was the most stable option I found. In CLINE settings, select "OpenAI Compatible" and enter your vLLM server address. If the agent misbehaves on the same model, it's usually due to unstable tool calling support in older versions of Ollama — switching to vLLM resolves it in most cases.
Pros and Cons
Here's what I observed after using it for a few weeks. The KV cache memory issue was my biggest pain point — I've elaborated on it below the table.
Pros
| Item | Details |
|---|---|
| Cost | Qwen API is roughly 17× cheaper per input token compared to Claude Opus 4.x; no ongoing cost after initial hardware investment for local runs |
| Performance | SWE-bench Verified 69.6% on the 480B model; competitive numbers for agentic multi-step reasoning |
| Context | 256K native context enables processing entire large codebases at once |
| Flexibility | Fully open-weight; fine-tunable and commercially usable; supported by all major inference frameworks |
| Language Support | Broad coverage: Python, JavaScript, Java, C++, Go, Rust, and more |
| Data Security | Sensitive code never leaves your machine to an external API server |
Cons and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Gap vs. top-tier models | Still trails Claude Opus 4.x (SWE-bench 80%+) and GPT-4.1 | Fine for everyday coding; use cloud for high-difficulty tasks |
| Hardware requirements | Minimum 16GB VRAM for 30B Q4_K_M | Adjust quantization level or use CPU hybrid inference |
| Ollama tool calling | Tool calling support for Qwen is unstable in older Ollama versions | Use LM Studio or vLLM for agentic workflows |
| KV cache memory | Memory grows proportionally with context length | 32K tokens is the practical ceiling on a 24GB GPU with Q5_K_M |
The KV cache entry in the cons table was my biggest headache. Running the Q5_K_M model on a 24GB GPU, I hit OOM twice trying to use contexts over 128K — and each time the in-progress inference was completely lost. After capping at 32K tokens and routing only tasks that truly needed a long context to the cloud API, things became much more stable.
Most Common Mistakes in Practice
-
Running with Ollama's default context: Qwen3-Coder running within a 4096-token limit can't deliver its true performance. If the model launches fine but performance seems off, check the context size first — that solves it most of the time.
-
Always choosing the 480B model: The 30B-A3B Flash model is also plenty capable for general coding tasks. Without checking your hardware first, many people get blocked by disk space or VRAM when trying to download the full model. The 480B Q4 quantized version alone requires over 200GB of disk space.
-
Using Ollama tool calling for agentic workflows: If you're using tool-call-based agentic workflows like CLINE or Qwen Code CLI, LM Studio or vLLM is far more stable than older Ollama versions. If the agent misbehaves on the same model, check this first — it resolves most issues.
Closing Thoughts
After three weeks of team use, drafting PR review comments, generating unit tests, and writing boilerplate code all felt indistinguishable from the cloud API. There's still a real gap compared to Claude Opus 4.x, and limitations show up on complex architectural design problems. But the ability to reduce cloud API dependency while keeping sensitive code local was a clear win for the team.
Here are three steps you can take right now:
- Try it with Ollama: Run
ollama pull qwen3-coderthenollama run qwen3-coder --context-size 32768. Setup itself takes 5 minutes, but downloading the 30B model can take 30 minutes to several hours depending on your network. If you have 16GB or more VRAM, the Q4_K_M version will be selected by default. - Connect it to your IDE and use it for real coding tasks: Install the Continue extension, add the
.continue/config.jsonconfig above, and you'll have inline AI assistance inside VS Code or JetBrains. - Switch to vLLM if you need a shared team server: Deploy a vLLM API server on a GPU machine, and your whole team can use the local model by changing just the endpoint in existing OpenAI SDK code.
References
- QwenLM/Qwen3-Coder | GitHub Official Repository
- Qwen3-Coder: Agentic Coding in the World | Qwen Official Blog
- Qwen/Qwen3-Coder-480B-A35B-Instruct | Hugging Face
- Qwen3-Coder: How to Run Locally | Unsloth Official Docs
- Qwen3-Coder: The Most Capable Agentic Coding Model | Together AI
- Qwen3-Coder, An Open-Weight Agentic Coding Model | DigitalOcean
- Qwen3-Coder-Next Technical Report | arXiv
- Seeing What's Possible with OpenCode + Ollama + Qwen3-Coder | KDnuggets
- Qwen Official Docs — vLLM Deployment