Running Qwen3-Coder Locally: Setting Up an SWE-bench 70% AI Coding Agent with a Single RTX 3090

After watching my cloud AI bills double two months in a row, I started seriously looking for alternatives. Honestly, it wasn't so much a bias of "how good could open source be?" — it was more that I'd been burned by a few open-source coding models before. Especially on legacy codebases, whenever I tried multi-file refactoring, the model would quickly lose context, re-define functions it had already created, or touch parts it shouldn't have. That's when I discovered Qwen3-Coder.

After setting it up myself, my opinion changed. This article covers Qwen3-Coder's architectural characteristics, how to choose a quantization level, and local setup using Ollama/llama.cpp/vLLM — all distilled from my own trial and error. The SWE-bench Verified 69.6% figure turned out not to be marketing hype, and I'll share concrete experiences of how the 256K context actually changes the way you work.

The prerequisites are simpler than you'd think. A single RTX 3090/4090-class GPU is enough to run the 30B model comfortably, and on M1/M2 Macs the GGUF quantized version delivers usable performance. If you're not sure how much VRAM your GPU has, run nvidia-smi (NVIDIA) or system_profiler SPDisplaysDataType (Mac) in your terminal to check immediately.

Core Concepts

This section briefly examines what makes Qwen3-Coder different from an architectural perspective. If you don't need the conceptual background right now, feel free to jump straight to the Practical Application section.

Why Is Qwen3-Coder "Coding-Specialized"?

Qwen3-Coder, built by Alibaba's Qwen team, is an open-weight LLM specialized for code generation, debugging, refactoring, and agentic software development. What sets it apart isn't just that it was trained on a lot of code data — the architecture itself was designed for agentic development scenarios.

Two model variants are currently available officially:

Model	Total Parameters	Active Parameters	Context Window
Qwen3-Coder-480B-A35B-Instruct	480B	35B	256K (up to 1M)
Qwen3-Coder-30B-A3B-Flash	30B	3B	256K

The "480B total but only 35B active parameters" part was confusing to me at first — this is the essence of the MoE architecture.

MoE (Mixture-of-Experts): A structure that divides the full parameters into multiple "expert" sub-networks and selectively activates only the relevant experts for each token. This means that despite the 480B scale, only 35B worth of computation actually occurs during inference.

What 256K Context Actually Means

I also thought at first "it's just a long context," but using it in practice makes a real difference. When refactoring a legacy codebase, you can dump 10–20 related files wholesale into the context and ask "change this pattern throughout." The manual work of splitting files, feeding them in pieces, and stitching results back together — which you had to do with older models — largely disappears.

Technically, it combines hybrid attention and the YaRN technique to achieve native 256K. Hybrid attention mixes local and global attention layer by layer instead of processing the full sequence uniformly, reducing the cost of handling long sequences. YaRN extends positional encoding to longer contexts than seen during training. Together, these two techniques allow far longer contexts to be processed within realistic memory budgets compared to conventional attention architectures.

However, there's one important trap. "Supporting 256K" and "actually being able to run 256K on your GPU" are completely different things.

KV Cache: The memory space that stores previous token information in attention layers. As the context grows longer, this cache grows linearly, so on a 24GB VRAM GPU with Q5_K_M quantization, around 32K tokens is the practical upper limit. I ran into OOM errors twice trying to push past 128K context before accepting this constraint.

Agentic Design: How It Differs from Simple Autocomplete

Qwen3-Coder's biggest differentiator is its agentic coding capability. It was trained via reinforcement learning (Agent RL) across 20,000 parallel environments simulating real software development workflows — in short, it learned to iterate through the "test fails → identify cause → fix → retest" loop.

On the 480B-Instruct model, it achieves SWE-bench Verified 69.6% (at 500 turns). This isn't just about generating good code snippets — the number comes from scenarios resolving real GitHub issues, which is why the difference is noticeable.

Practical Application

This covers the fastest way to get started for each environment. It's organized as Ollama (quick start), llama.cpp (consumer GPU/Mac), and vLLM (team server) — so feel free to read only the section that fits your setup.

Example 1: Quick Start with Ollama

This is the fastest way to get started. I used this when I first tested it, and there's one trap worth knowing in advance.

⚠️ Important: Ollama's default context size is 4096 tokens. Even though Qwen3-Coder supports 256K, running it with this default will cut off at 4K tokens. If the model runs fine but performance seems noticeably poor, this is almost always the reason. Be sure to specify the --context-size option.

bash

# Model tags may vary, so it's safer to search first
ollama search qwen3-coder
 
# Download and run — extending the context window is essential
ollama pull qwen3-coder
ollama run qwen3-coder --context-size 32768

Download time can be significant depending on model size. The 30B Q4_K_M version is about 20GB, so depending on your network it can take anywhere from 30 minutes to several hours. Setup itself takes 5 minutes, but factor in the download time.

If you prefer a GUI, LM Studio is another option. Install the app from lmstudio.ai, type qwen3-coder in the search bar, and you can download the Q4_K_M quantized version directly. Server mode automatically exposes an OpenAI-compatible API (http://localhost:1234/v1), which made this approach much easier when deploying to teammates less familiar with the code.

Example 2: llama.cpp + GGUF (Consumer GPU / Mac)

If you're low on GPU memory or working on a Mac, the GGUF quantized version is the practical choice. Unsloth is a third-party open-source team known for memory-efficient quantization and fine-tuning — they've published optimized GGUF versions on Hugging Face that you can download and use immediately.

If you don't have llama.cpp installed, you'll need to do that first.

bash

# For Mac
brew install llama.cpp
 
# For Linux / manual build
# See https://github.com/ggerganov/llama.cpp#build

After installation, you can download the model and start the server as follows:

bash

# Download Unsloth GGUF
huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-GGUF \
  --include "Qwen3-Coder-30B-A3B-Q5_K_M.gguf" \
  --local-dir ./models
 
# Start llama-server
llama-server \
  -m ./models/Qwen3-Coder-30B-A3B-Q5_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --port 8080

--n-gpu-layers 99 means load as many layers as possible onto the GPU. If VRAM is insufficient, lower this number to split processing between GPU and CPU. It will be slower, but works well even on M1/M2 Macs with unified memory.

Here's a summary of quantization level selection criteria:

Quantization	VRAM (30B)	Quality	Recommended For
Q4_K_M	16GB	Good	RTX 3090, general tasks
Q5_K_M	20GB	Excellent	RTX 4090, code generation focus
Q8_0	32GB+	Near full precision	A6000, accuracy-first

Note: the K_M in Q4_K_M and Q5_K_M refers to K-means quantization. Rather than simple bit truncation, it compresses weights using clustering, so quality is better at the same bit width. You can apply this same criterion when choosing other GGUF models.

Example 3: Building a Shared Team API Server with vLLM

If multiple people will share the model, or you need agentic tool calling in your workflow, vLLM is a great choice. The best part when our team switched to this approach was that we could change just the endpoint without touching any of our existing OpenAI-based code.

The --tool-call-parser qwen3_coder option is supported in vLLM 0.6.x and above. Upgrade first if your version is older.

bash

pip install "vllm>=0.6.0"
 
# Single GPU setup (RTX 4090 etc., for 30B model)
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --port 8000
 
# Multi-GPU setup (8x A100 etc., for 480B model serving)
# python -m vllm.entrypoints.openai.api_server \
#   --model Qwen/Qwen3-Coder-480B-A35B-Instruct \
#   --tensor-parallel-size 8 \
#   --max-model-len 131072 \
#   ...

Once the server is up, just change base_url in your existing OpenAI SDK code and you're good to go.

python

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="local"  # Any value works for a local server
)
 
response = client.chat.completions.create(
    model="Qwen/Qwen3-Coder-30B-A3B-Instruct",
    messages=[{"role": "user", "content": "Implement binary search in Python"}]
)
print(response.choices[0].message.content)

IDE Integration

Once your inference server is ready, the next step is connecting it to your editor. If you're at the point of "I've picked my inference server and just need to hook up my editor," this section is all you need.

Connecting the Continue Extension

Continue is an open-source AI coding assistant that supports both VS Code and JetBrains. After installing it, add the following configuration to .continue/config.json:

json

{
  "models": [{
    "title": "Qwen3-Coder (Local)",
    "provider": "ollama",
    "model": "qwen3-coder",
    "contextLength": 32768
  }]
}

If you're using vLLM instead of Ollama as the backend, change "provider": "openai" and add "apiBase": "http://localhost:8000/v1".

CLINE + vLLM (Agentic Workflow)

If you want to use agentic workflows inside your IDE — like editing files, running terminal commands, or opening a browser — the CLINE extension combined with vLLM was the most stable option I found. In CLINE settings, select "OpenAI Compatible" and enter your vLLM server address. If the agent misbehaves on the same model, it's usually due to unstable tool calling support in older versions of Ollama — switching to vLLM resolves it in most cases.

Pros and Cons

Here's what I observed after using it for a few weeks. The KV cache memory issue was my biggest pain point — I've elaborated on it below the table.

Pros

Item	Details
Cost	Qwen API is roughly 17× cheaper per input token compared to Claude Opus 4.x; no ongoing cost after initial hardware investment for local runs
Performance	SWE-bench Verified 69.6% on the 480B model; competitive numbers for agentic multi-step reasoning
Context	256K native context enables processing entire large codebases at once
Flexibility	Fully open-weight; fine-tunable and commercially usable; supported by all major inference frameworks
Language Support	Broad coverage: Python, JavaScript, Java, C++, Go, Rust, and more
Data Security	Sensitive code never leaves your machine to an external API server

Cons and Caveats

Item	Details	Mitigation
Gap vs. top-tier models	Still trails Claude Opus 4.x (SWE-bench 80%+) and GPT-4.1	Fine for everyday coding; use cloud for high-difficulty tasks
Hardware requirements	Minimum 16GB VRAM for 30B Q4_K_M	Adjust quantization level or use CPU hybrid inference
Ollama tool calling	Tool calling support for Qwen is unstable in older Ollama versions	Use LM Studio or vLLM for agentic workflows
KV cache memory	Memory grows proportionally with context length	32K tokens is the practical ceiling on a 24GB GPU with Q5_K_M

The KV cache entry in the cons table was my biggest headache. Running the Q5_K_M model on a 24GB GPU, I hit OOM twice trying to use contexts over 128K — and each time the in-progress inference was completely lost. After capping at 32K tokens and routing only tasks that truly needed a long context to the cloud API, things became much more stable.

Most Common Mistakes in Practice

Running with Ollama's default context: Qwen3-Coder running within a 4096-token limit can't deliver its true performance. If the model launches fine but performance seems off, check the context size first — that solves it most of the time.
Always choosing the 480B model: The 30B-A3B Flash model is also plenty capable for general coding tasks. Without checking your hardware first, many people get blocked by disk space or VRAM when trying to download the full model. The 480B Q4 quantized version alone requires over 200GB of disk space.
Using Ollama tool calling for agentic workflows: If you're using tool-call-based agentic workflows like CLINE or Qwen Code CLI, LM Studio or vLLM is far more stable than older Ollama versions. If the agent misbehaves on the same model, check this first — it resolves most issues.

Closing Thoughts

After three weeks of team use, drafting PR review comments, generating unit tests, and writing boilerplate code all felt indistinguishable from the cloud API. There's still a real gap compared to Claude Opus 4.x, and limitations show up on complex architectural design problems. But the ability to reduce cloud API dependency while keeping sensitive code local was a clear win for the team.

Here are three steps you can take right now:

Try it with Ollama: Run ollama pull qwen3-coder then ollama run qwen3-coder --context-size 32768. Setup itself takes 5 minutes, but downloading the 30B model can take 30 minutes to several hours depending on your network. If you have 16GB or more VRAM, the Q4_K_M version will be selected by default.
Connect it to your IDE and use it for real coding tasks: Install the Continue extension, add the .continue/config.json config above, and you'll have inline AI assistance inside VS Code or JetBrains.
Switch to vLLM if you need a shared team server: Deploy a vLLM API server on a GPU machine, and your whole team can use the local model by changing just the endpoint in existing OpenAI SDK code.

References

#Qwen3-Coder#LLM로컬실행#MoE#vLLM#llama.cpp#GGUF양자화#Ollama#에이전틱AI#SWE-bench#KV캐시

Running Qwen3-Coder Locally: Setting Up an SWE-bench 70% AI Coding Agent with a Single RTX 3090

Core Concepts

Why Is Qwen3-Coder "Coding-Specialized"?

Two model variants are currently available officially:

Model	Total Parameters	Active Parameters	Context Window
Qwen3-Coder-480B-A35B-Instruct	480B	35B	256K (up to 1M)
Qwen3-Coder-30B-A3B-Flash	30B	3B	256K

The "480B total but only 35B active parameters" part was confusing to me at first — this is the essence of the MoE architecture.

MoE (Mixture-of-Experts): A structure that divides the full parameters into multiple "expert" sub-networks and selectively activates only the relevant experts for each token. This means that despite the 480B scale, only 35B worth of computation actually occurs during inference.

What 256K Context Actually Means

However, there's one important trap. "Supporting 256K" and "actually being able to run 256K on your GPU" are completely different things.

KV Cache: The memory space that stores previous token information in attention layers. As the context grows longer, this cache grows linearly, so on a 24GB VRAM GPU with Q5_K_M quantization, around 32K tokens is the practical upper limit. I ran into OOM errors twice trying to push past 128K context before accepting this constraint.

Agentic Design: How It Differs from Simple Autocomplete

Practical Application

Example 1: Quick Start with Ollama

This is the fastest way to get started. I used this when I first tested it, and there's one trap worth knowing in advance.

⚠️ Important: Ollama's default context size is 4096 tokens. Even though Qwen3-Coder supports 256K, running it with this default will cut off at 4K tokens. If the model runs fine but performance seems noticeably poor, this is almost always the reason. Be sure to specify the --context-size option.

bash

# Model tags may vary, so it's safer to search first
ollama search qwen3-coder
 
# Download and run — extending the context window is essential
ollama pull qwen3-coder
ollama run qwen3-coder --context-size 32768

Example 2: llama.cpp + GGUF (Consumer GPU / Mac)

If you don't have llama.cpp installed, you'll need to do that first.

bash

# For Mac
brew install llama.cpp
 
# For Linux / manual build
# See https://github.com/ggerganov/llama.cpp#build

After installation, you can download the model and start the server as follows:

bash

# Download Unsloth GGUF
huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-GGUF \
  --include "Qwen3-Coder-30B-A3B-Q5_K_M.gguf" \
  --local-dir ./models
 
# Start llama-server
llama-server \
  -m ./models/Qwen3-Coder-30B-A3B-Q5_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --port 8080

Here's a summary of quantization level selection criteria:

Quantization	VRAM (30B)	Quality	Recommended For
Q4_K_M	16GB	Good	RTX 3090, general tasks
Q5_K_M	20GB	Excellent	RTX 4090, code generation focus
Q8_0	32GB+	Near full precision	A6000, accuracy-first

Example 3: Building a Shared Team API Server with vLLM

The --tool-call-parser qwen3_coder option is supported in vLLM 0.6.x and above. Upgrade first if your version is older.

bash

pip install "vllm>=0.6.0"
 
# Single GPU setup (RTX 4090 etc., for 30B model)
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-Coder-30B-A3B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --port 8000
 
# Multi-GPU setup (8x A100 etc., for 480B model serving)
# python -m vllm.entrypoints.openai.api_server \
#   --model Qwen/Qwen3-Coder-480B-A35B-Instruct \
#   --tensor-parallel-size 8 \
#   --max-model-len 131072 \
#   ...

Once the server is up, just change base_url in your existing OpenAI SDK code and you're good to go.

python

from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="local"  # Any value works for a local server
)
 
response = client.chat.completions.create(
    model="Qwen/Qwen3-Coder-30B-A3B-Instruct",
    messages=[{"role": "user", "content": "Implement binary search in Python"}]
)
print(response.choices[0].message.content)

IDE Integration

Connecting the Continue Extension

Continue is an open-source AI coding assistant that supports both VS Code and JetBrains. After installing it, add the following configuration to .continue/config.json:

json

{
  "models": [{
    "title": "Qwen3-Coder (Local)",
    "provider": "ollama",
    "model": "qwen3-coder",
    "contextLength": 32768
  }]
}

If you're using vLLM instead of Ollama as the backend, change "provider": "openai" and add "apiBase": "http://localhost:8000/v1".

CLINE + vLLM (Agentic Workflow)

Pros and Cons

Here's what I observed after using it for a few weeks. The KV cache memory issue was my biggest pain point — I've elaborated on it below the table.

Pros

Item	Details
Cost	Qwen API is roughly 17× cheaper per input token compared to Claude Opus 4.x; no ongoing cost after initial hardware investment for local runs
Performance	SWE-bench Verified 69.6% on the 480B model; competitive numbers for agentic multi-step reasoning
Context	256K native context enables processing entire large codebases at once
Flexibility	Fully open-weight; fine-tunable and commercially usable; supported by all major inference frameworks
Language Support	Broad coverage: Python, JavaScript, Java, C++, Go, Rust, and more
Data Security	Sensitive code never leaves your machine to an external API server

Cons and Caveats

Item	Details	Mitigation
Gap vs. top-tier models	Still trails Claude Opus 4.x (SWE-bench 80%+) and GPT-4.1	Fine for everyday coding; use cloud for high-difficulty tasks
Hardware requirements	Minimum 16GB VRAM for 30B Q4_K_M	Adjust quantization level or use CPU hybrid inference
Ollama tool calling	Tool calling support for Qwen is unstable in older Ollama versions	Use LM Studio or vLLM for agentic workflows
KV cache memory	Memory grows proportionally with context length	32K tokens is the practical ceiling on a 24GB GPU with Q5_K_M

Most Common Mistakes in Practice

Running with Ollama's default context: Qwen3-Coder running within a 4096-token limit can't deliver its true performance. If the model launches fine but performance seems off, check the context size first — that solves it most of the time.
Always choosing the 480B model: The 30B-A3B Flash model is also plenty capable for general coding tasks. Without checking your hardware first, many people get blocked by disk space or VRAM when trying to download the full model. The 480B Q4 quantized version alone requires over 200GB of disk space.
Using Ollama tool calling for agentic workflows: If you're using tool-call-based agentic workflows like CLINE or Qwen Code CLI, LM Studio or vLLM is far more stable than older Ollama versions. If the agent misbehaves on the same model, check this first — it resolves most issues.

Closing Thoughts

Here are three steps you can take right now:

Try it with Ollama: Run ollama pull qwen3-coder then ollama run qwen3-coder --context-size 32768. Setup itself takes 5 minutes, but downloading the 30B model can take 30 minutes to several hours depending on your network. If you have 16GB or more VRAM, the Q4_K_M version will be selected by default.
Connect it to your IDE and use it for real coding tasks: Install the Continue extension, add the .continue/config.json config above, and you'll have inline AI assistance inside VS Code or JetBrains.
Switch to vLLM if you need a shared team server: Deploy a vLLM API server on a GPU machine, and your whole team can use the local model by changing just the endpoint in existing OpenAI SDK code.

References

#Qwen3-Coder#LLM로컬실행#MoE#vLLM#llama.cpp#GGUF양자화#Ollama#에이전틱AI#SWE-bench#KV캐시

Core Concepts

Why Is Qwen3-Coder "Coding-Specialized"?

What 256K Context Actually Means

Agentic Design: How It Differs from Simple Autocomplete

Practical Application

Example 1: Quick Start with Ollama

Example 2: llama.cpp + GGUF (Consumer GPU / Mac)

Example 3: Building a Shared Team API Server with vLLM

IDE Integration

Connecting the Continue Extension

CLINE + vLLM (Agentic Workflow)

Pros and Cons

Pros

Cons and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Why Is Qwen3-Coder "Coding-Specialized"?

What 256K Context Actually Means

Agentic Design: How It Differs from Simple Autocomplete

Practical Application

Example 1: Quick Start with Ollama

Example 2: llama.cpp + GGUF (Consumer GPU / Mac)

Example 3: Building a Shared Team API Server with vLLM

IDE Integration

Connecting the Continue Extension

CLINE + vLLM (Agentic Workflow)

Pros and Cons

Pros

Cons and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Open-Weight vs Closed AI 2026: Now That the Benchmark Gap Has Narrowed, the Criteria for Choosing Has Changed

7 Major Patterns of Agentic AI Design

How to Make LLMs Directly Call Your Internal REST APIs: TypeScript MCP Server Implementation and the Gateway Pattern

Why AI Agent LLM Costs Explode and Strategies to Cut Them by 60–80%

LLM Agent Output Validation: Why Hallucinations Pass JSON Schema and How to Design a 3-Layer Defense

AI Agent State Management Architecture — Achieving Production Reliability with LangGraph Checkpointing