How to Fine-Tune a Domain-Specific SLM with QLoRA on a Single Consumer GPU

The moment GPT-4o API costs start piling up, you naturally wonder: "Could we train this ourselves for our domain?" I had the same thought. It started when I saw a shocking bill after using GPT-4 in a medical record summarization pipeline, and then I began taking it seriously after seeing benchmark results showing that a single fine-tuned Qwen3-4B could outperform much larger general-purpose models on specific domain tasks.

A well-fine-tuned SLM can outperform a general-purpose LLM tens of times its size on specific domain tasks, with inference costs at 1/10 to 1/100 of the price. As of 2026, a single RTX 4070 Ti and half a day are enough to transform a 7B model into a domain expert. This post walks through the core concepts, actually runnable code, and lessons learned from hard-won experience. By the end, you'll have the foundation to train your first domain-specific SLM using the Unsloth + QLoRA combination.

Core Concepts

What Is an SLM, and Why Now

SLM (Small Language Model) generally refers to small language models in the 1B–13B parameter range. The definition itself is simple, but what matters is that "small" no longer means "worse."

There are three reasons why the SLM fine-tuning ecosystem is exciting in 2026: the hardware barrier has dropped dramatically (a single RTX 4070 Ti is enough), base model quality has improved significantly, and framework maturity has reached a level sufficient for production use. GlobalData projects the SLM market will reach $20.7 billion by 2030, growing at a 15.1% CAGR.

SLM vs LLM: This is a difference in usage strategy, not just scale. LLMs excel at general-purpose reasoning, while fine-tuned SLMs beat LLMs on specific domain tasks. It's about choosing the right tool, not ranking one above the other.

3 Fine-Tuning Methodologies: What Should You Choose

When I first looked into fine-tuning, SFT, LoRA, and QLoRA all appeared at the same time, and I was confused for a while. The key is to choose based on memory constraints and acceptable performance trade-offs.

Method	How It Works	VRAM Requirement	Catastrophic Forgetting Risk
SFT (Supervised Fine-Tuning)	Updates all weights	Very high (40GB+)	High
LoRA	Inserts small trainable matrices into key layers, freezes originals	Medium (16–24GB)	Low
QLoRA	LoRA + 4-bit quantization	Low (8–12GB)	Low

Quantization here refers to the technique of compressing model weights from the traditional 32-bit or 16-bit floating point into 4-bit integers for storage. Some information loss occurs, but VRAM usage is dramatically reduced, making it possible to load a 7B model on a consumer GPU. QLoRA combines quantization with LoRA on top, making it the best starting point for realistic on-premises environments.

Catastrophic Forgetting: The phenomenon where a model loses the general-purpose reasoning capabilities it previously had when fine-tuned on domain-specific data. Because LoRA freezes the original weights and only trains small matrices, this phenomenon occurs far less than with full SFT.

Mathematical Intuition Behind LoRA

No need to overthink it. Instead of directly updating the original weight matrix W, the change is approximated by the product of two much smaller matrices A and B. Here, r is the rank.

W_new = W_original + (B × A) × α/r

Even if this formula looks intimidating, the meaning is simple: "Don't touch the original weights—just train two small matrices." The smaller the rank, the fewer trainable parameters and the more VRAM savings, but also less expressiveness. In practice, lowering the rank to 4 caused unstable convergence, and raising it to 32 ran out of VRAM on an RTX 4070 Ti. r=16 is the practical starting point for domain adaptation.

lora_alpha is also worth noting. Setting alpha=16 with r=16 gives alpha/r = 1, fixing the LoRA adapter's scaling factor at 1. Keeping alpha and r at the same value makes the effective impact of the learning rate and LoRA intuitively aligned, keeping the starting point simple during hyperparameter search.

Practical Application

Example 1: Training a Financial Domain SLM with QLoRA

This is a basic setup for fine-tuning Qwen3-4B for a financial QA task. Using Unsloth makes the same work about 2x faster and reduces VRAM by 70% compared to using HuggingFace Transformers alone.

There's one pitfall worth noting. When fine-tuning models with chat templates like Qwen3-Instruct, the data format has a major impact on results. A simple text format like {"text": "question\nanswer"} differs from the input structure the model expects, significantly reducing fine-tuning effectiveness. It is recommended to use the tokenizer's apply_chat_template to convert data into the format appropriate for the model.

python

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
 
# 1. Load model (4-bit quantization)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen3-4B-Instruct",
    max_seq_length=2048,
    dtype=None,         # auto-detect
    load_in_4bit=True,  # enable 4-bit quantization (QLoRA)
)
 
# 2. Configure LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    r=16,           # rank: balance between expressiveness and VRAM savings
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_alpha=16,  # neutralize scaling factor with alpha/r = 1
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
)
 
# 3. Dataset — convert using Qwen3-Instruct chat template
raw_dataset = load_dataset("json", data_files="finance_qa.jsonl")["train"]
 
def format_with_chat_template(example):
    messages = [
        {"role": "user", "content": example["question"]},
        {"role": "assistant", "content": example["answer"]},
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
 
dataset = raw_dataset.map(format_with_chat_template)
 
# 4. Training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # effective batch size = 2 × 4 = 8, maintain VRAM
        warmup_steps=50,                # gradually increase learning rate early to prevent unstable convergence
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        fp16=True,
        logging_steps=10,
        output_dir="./finance-qwen3-4b",
        report_to="wandb",
    ),
)
 
trainer.train()
 
# 5. Save LoRA adapter and tokenizer
model.save_pretrained("./finance-qwen3-4b-lora")
tokenizer.save_pretrained("./finance-qwen3-4b-lora")

Setting	Reason
`r=16`	Standard starting point for domain adaptation, balances expressiveness and VRAM savings
`lora_alpha=16`	Neutralizes scaling with `alpha/r = 1`, enabling intuitive control of learning rate effect
`gradient_accumulation_steps=4`	Trick to quadruple effective batch size while maintaining VRAM
`learning_rate=2e-4`	Optimal range for LoRA training, can be set higher than full SFT
`cosine` scheduler	Improves convergence stability in later stages of training
`warmup_steps=50`	Gradually increases initial learning rate to prevent early divergence

After training, here is how to load the saved LoRA adapter and use it for inference.

python

from unsloth import FastLanguageModel
 
# Load saved LoRA adapter
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./finance-qwen3-4b-lora",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
 
# Apply chat template and run inference
messages = [{"role": "user", "content": "What is the interest rate outlook for Q4 2025?"}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to("cuda")
 
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.1)
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Example 2: RAG + SLM Hybrid Pipeline

Fine-tuning alone has its limits. The model can't know about recent information not in the training data or vast internal documents. That's why our team chose a dual strategy: inject format and specialized terminology through fine-tuning, and supplement with recent information and internal data via RAG. We've seen quite a few cases fail by trying to cram both roles into fine-tuning alone.

Below is an example written using the LCEL approach based on LangChain v0.2+. The older langchain.llms and langchain.vectorstores paths have moved to langchain_community, and RetrievalQA is deprecated, so the approach below is recommended.

python

from langchain_community.llms import HuggingFacePipeline
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from transformers import pipeline
import torch
 
# Load fine-tuned SLM
slm_pipeline = pipeline(
    "text-generation",
    model="./finance-qwen3-4b-lora",
    torch_dtype=torch.float16,
    device_map="auto",
    max_new_tokens=512,
)
llm = HuggingFacePipeline(pipeline=slm_pipeline)
 
# Internal document vector DB (Chroma)
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-m3")
vectorstore = Chroma(
    persist_directory="./company_docs_db",
    embedding_function=embeddings,
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
 
# LCEL-based RAG chain
prompt = ChatPromptTemplate.from_template(
    """Answer the question using the following context.
 
Context:
{context}
 
Question: {question}"""
)
 
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
 
result = chain.invoke("What are the risk management standards for Q4 2025?")
print(result)

The key to this pattern is separation of responsibilities. The SLM handles "how to answer" (format, terminology, tone), while RAG handles "what to reference" (recent information, internal data).

Pros and Cons Analysis

Advantages

Item	Description
Cost efficiency	Inference costs at 1/10 to 1/100 of GPT-4o, savings amplify with higher traffic
Response latency	No API call overhead with on-premises deployment, advantageous for real-time services
Data privacy	Fully contained within internal infrastructure, essential for regulated industries like finance, healthcare, and legal
Domain accuracy	Can outperform general-purpose models tens of times larger on specific domain tasks
Low data threshold	Meaningful specialization achievable with 1,000–5,000 high-quality samples (varies by domain and task complexity)

Disadvantages and Caveats

Item	Description
Catastrophic forgetting	General-purpose reasoning capability may degrade with domain training
Data quality dependency	Biased or unrepresentative data → overfitting
Evaluation pipeline construction	Difficult to establish domain-specific benchmarks
Hyperparameter search	Optimal combinations of rank, target layers, and learning rate vary by domain
General task performance degradation	General QA capability may decline after fine-tuning

Catastrophic forgetting is greatly mitigated with LoRA compared to full SFT, since the original weights are frozen. If you want to address it more proactively, there is also the SA-SFT technique, where the model itself generates general conversation data before fine-tuning to use as preservation data. It's an approach that can be applied without additional external data—if you're interested, you can find more details in the arXiv 2025 paper.

The Most Common Mistakes in Practice

Fixating on data volume alone: Diminishing returns set in beyond 5,000 samples. Carefully curating 1,000 samples is often far more effective than indiscriminately collecting 10,000. There have been cases where a curated dataset of 800 samples produced better results than a noisy dataset of 3,000.
Defining evaluation criteria after the fact: If you don't define "what counts as success" before training, there is no way to tell whether the model got better or worse. It is recommended to design quality evaluation—alongside automatic metrics like ROUGE or BERTScore, including a human read of actual outputs—before training begins.
Choosing a base model arbitrarily: For coding and math tasks, Phi-4 Mini has the edge; for multilingual and general NLP, Qwen3; for community resources, Llama 3.x. Benchmarks consistently show that base model selection has a greater impact on fine-tuning results than hyperparameter tuning.

Closing Thoughts

SLM fine-tuning is no longer the domain of ML researchers. It is an accessible option for any development team with domain data and a single GPU. At first, the many concepts can feel overwhelming, but ultimately it comes down to three pillars: QLoRA + good data + a pre-defined evaluation pipeline. The fastest path to real-world success was letting go of two misconceptions: "collecting more data will solve it" and "we can define evaluation criteria after training."

Three steps you can start right now:

Prepare a base model and data: Choose a base model suited to your purpose (Qwen3-4B-Instruct or microsoft/Phi-4-mini-instruct), and curate 1,000 domain QA pairs into a JSONL file in {"question": "...", "answer": "..."} format. Since data quality matters more than quantity, remove ambiguous or incorrect answers in advance.
Run your first training with Unsloth + QLoRA: After pip install unsloth trl, try running the example code above. With an RTX 4070 Ti or better, a batch size of 2 and gradient accumulation of 4 should complete 3 epochs of training in about 3–6 hours. Connecting wandb at the same time lets you monitor the learning curve in real time.
Compare performance with a domain benchmark: Evaluate both the base model before training and the fine-tuned model after on the same set of 20–50 test examples. Combining automatic metrics like ROUGE-L or BERTScore with direct comparison of actual outputs lets you see firsthand what fine-tuning brings to the table.

References

Essential Reading (Recommended Starting Points)

Further Reading

#QLoRA#SLM#LoRA#파인튜닝#Unsloth#RAG#양자화#LangChain#HuggingFace#SFT

How to Fine-Tune a Domain-Specific SLM with QLoRA on a Single Consumer GPU

Core Concepts

What Is an SLM, and Why Now

SLM (Small Language Model) generally refers to small language models in the 1B–13B parameter range. The definition itself is simple, but what matters is that "small" no longer means "worse."

SLM vs LLM: This is a difference in usage strategy, not just scale. LLMs excel at general-purpose reasoning, while fine-tuned SLMs beat LLMs on specific domain tasks. It's about choosing the right tool, not ranking one above the other.

3 Fine-Tuning Methodologies: What Should You Choose

Method	How It Works	VRAM Requirement	Catastrophic Forgetting Risk
SFT (Supervised Fine-Tuning)	Updates all weights	Very high (40GB+)	High
LoRA	Inserts small trainable matrices into key layers, freezes originals	Medium (16–24GB)	Low
QLoRA	LoRA + 4-bit quantization	Low (8–12GB)	Low

Catastrophic Forgetting: The phenomenon where a model loses the general-purpose reasoning capabilities it previously had when fine-tuned on domain-specific data. Because LoRA freezes the original weights and only trains small matrices, this phenomenon occurs far less than with full SFT.

Mathematical Intuition Behind LoRA

No need to overthink it. Instead of directly updating the original weight matrix W, the change is approximated by the product of two much smaller matrices A and B. Here, r is the rank.

W_new = W_original + (B × A) × α/r

Practical Application

Example 1: Training a Financial Domain SLM with QLoRA

This is a basic setup for fine-tuning Qwen3-4B for a financial QA task. Using Unsloth makes the same work about 2x faster and reduces VRAM by 70% compared to using HuggingFace Transformers alone.

python

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
 
# 1. Load model (4-bit quantization)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen3-4B-Instruct",
    max_seq_length=2048,
    dtype=None,         # auto-detect
    load_in_4bit=True,  # enable 4-bit quantization (QLoRA)
)
 
# 2. Configure LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    r=16,           # rank: balance between expressiveness and VRAM savings
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_alpha=16,  # neutralize scaling factor with alpha/r = 1
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
)
 
# 3. Dataset — convert using Qwen3-Instruct chat template
raw_dataset = load_dataset("json", data_files="finance_qa.jsonl")["train"]
 
def format_with_chat_template(example):
    messages = [
        {"role": "user", "content": example["question"]},
        {"role": "assistant", "content": example["answer"]},
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
 
dataset = raw_dataset.map(format_with_chat_template)
 
# 4. Training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # effective batch size = 2 × 4 = 8, maintain VRAM
        warmup_steps=50,                # gradually increase learning rate early to prevent unstable convergence
        num_train_epochs=3,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        fp16=True,
        logging_steps=10,
        output_dir="./finance-qwen3-4b",
        report_to="wandb",
    ),
)
 
trainer.train()
 
# 5. Save LoRA adapter and tokenizer
model.save_pretrained("./finance-qwen3-4b-lora")
tokenizer.save_pretrained("./finance-qwen3-4b-lora")

Setting	Reason
`r=16`	Standard starting point for domain adaptation, balances expressiveness and VRAM savings
`lora_alpha=16`	Neutralizes scaling with `alpha/r = 1`, enabling intuitive control of learning rate effect
`gradient_accumulation_steps=4`	Trick to quadruple effective batch size while maintaining VRAM
`learning_rate=2e-4`	Optimal range for LoRA training, can be set higher than full SFT
`cosine` scheduler	Improves convergence stability in later stages of training
`warmup_steps=50`	Gradually increases initial learning rate to prevent early divergence

After training, here is how to load the saved LoRA adapter and use it for inference.

python

from unsloth import FastLanguageModel
 
# Load saved LoRA adapter
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./finance-qwen3-4b-lora",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
 
# Apply chat template and run inference
messages = [{"role": "user", "content": "What is the interest rate outlook for Q4 2025?"}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to("cuda")
 
outputs = model.generate(inputs, max_new_tokens=256, temperature=0.1)
response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Example 2: RAG + SLM Hybrid Pipeline

python

from langchain_community.llms import HuggingFacePipeline
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from transformers import pipeline
import torch
 
# Load fine-tuned SLM
slm_pipeline = pipeline(
    "text-generation",
    model="./finance-qwen3-4b-lora",
    torch_dtype=torch.float16,
    device_map="auto",
    max_new_tokens=512,
)
llm = HuggingFacePipeline(pipeline=slm_pipeline)
 
# Internal document vector DB (Chroma)
embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-m3")
vectorstore = Chroma(
    persist_directory="./company_docs_db",
    embedding_function=embeddings,
)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
 
# LCEL-based RAG chain
prompt = ChatPromptTemplate.from_template(
    """Answer the question using the following context.
 
Context:
{context}
 
Question: {question}"""
)
 
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
 
result = chain.invoke("What are the risk management standards for Q4 2025?")
print(result)

The key to this pattern is separation of responsibilities. The SLM handles "how to answer" (format, terminology, tone), while RAG handles "what to reference" (recent information, internal data).

Pros and Cons Analysis

Advantages

Item	Description
Cost efficiency	Inference costs at 1/10 to 1/100 of GPT-4o, savings amplify with higher traffic
Response latency	No API call overhead with on-premises deployment, advantageous for real-time services
Data privacy	Fully contained within internal infrastructure, essential for regulated industries like finance, healthcare, and legal
Domain accuracy	Can outperform general-purpose models tens of times larger on specific domain tasks
Low data threshold	Meaningful specialization achievable with 1,000–5,000 high-quality samples (varies by domain and task complexity)

Disadvantages and Caveats

Item	Description
Catastrophic forgetting	General-purpose reasoning capability may degrade with domain training
Data quality dependency	Biased or unrepresentative data → overfitting
Evaluation pipeline construction	Difficult to establish domain-specific benchmarks
Hyperparameter search	Optimal combinations of rank, target layers, and learning rate vary by domain
General task performance degradation	General QA capability may decline after fine-tuning

The Most Common Mistakes in Practice

Fixating on data volume alone: Diminishing returns set in beyond 5,000 samples. Carefully curating 1,000 samples is often far more effective than indiscriminately collecting 10,000. There have been cases where a curated dataset of 800 samples produced better results than a noisy dataset of 3,000.
Defining evaluation criteria after the fact: If you don't define "what counts as success" before training, there is no way to tell whether the model got better or worse. It is recommended to design quality evaluation—alongside automatic metrics like ROUGE or BERTScore, including a human read of actual outputs—before training begins.
Choosing a base model arbitrarily: For coding and math tasks, Phi-4 Mini has the edge; for multilingual and general NLP, Qwen3; for community resources, Llama 3.x. Benchmarks consistently show that base model selection has a greater impact on fine-tuning results than hyperparameter tuning.

Closing Thoughts

Three steps you can start right now:

Prepare a base model and data: Choose a base model suited to your purpose (Qwen3-4B-Instruct or microsoft/Phi-4-mini-instruct), and curate 1,000 domain QA pairs into a JSONL file in {"question": "...", "answer": "..."} format. Since data quality matters more than quantity, remove ambiguous or incorrect answers in advance.
Run your first training with Unsloth + QLoRA: After pip install unsloth trl, try running the example code above. With an RTX 4070 Ti or better, a batch size of 2 and gradient accumulation of 4 should complete 3 epochs of training in about 3–6 hours. Connecting wandb at the same time lets you monitor the learning curve in real time.
Compare performance with a domain benchmark: Evaluate both the base model before training and the fine-tuned model after on the same set of 20–50 test examples. Combining automatic metrics like ROUGE-L or BERTScore with direct comparison of actual outputs lets you see firsthand what fine-tuning brings to the table.

References

Essential Reading (Recommended Starting Points)

Further Reading

#QLoRA#SLM#LoRA#파인튜닝#Unsloth#RAG#양자화#LangChain#HuggingFace#SFT

Core Concepts

What Is an SLM, and Why Now

3 Fine-Tuning Methodologies: What Should You Choose

Mathematical Intuition Behind LoRA

Practical Application

Example 1: Training a Financial Domain SLM with QLoRA

Example 2: RAG + SLM Hybrid Pipeline

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

What Is an SLM, and Why Now

3 Fine-Tuning Methodologies: What Should You Choose

Mathematical Intuition Behind LoRA

Practical Application

Example 1: Training a Financial Domain SLM with QLoRA

Example 2: RAG + SLM Hybrid Pipeline

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Local LLM TCO Analysis: How to Calculate the On-Premises Break-Even Point and GPU Utilization Optimization Strategies

How AI Coding Agents Are Reshaping Dev Team Structure: How to Transition into an Orchestrator

AI Writes It, AI Reviews It: Building a `/code-review ultra` Multi-Agent Pipeline

Cutting Long-Horizon Agent Costs by 60–90%: Caching, Compression, and Routing Strategies

Type-Safe LLM Response Validation with Pydantic AI

How to Make LLMs Directly Call Your Internal REST APIs: TypeScript MCP Server Implementation and the Gateway Pattern