When Pre-Trained Models Aren’t Enough

A pre-trained 7B model can write Python, answer questions, and follow instructions. But ask it to generate API responses matching your internal format, write documentation in your team’s style, or classify support tickets using your category taxonomy, and it stumbles. The model’s general knowledge doesn’t cover your specific patterns.

Fine-tuning teaches a pre-trained model your specific domain. You provide examples of the input-output pairs you care about, and the model adjusts its weights to produce better results for those patterns. The general capabilities remain. The domain-specific accuracy improves.

This post covers the practical landscape of fine-tuning as of late 2025: which methods exist, what hardware they need, which tools work, and what the actual cost looks like in time and compute.


Fine-Tuning Methods: From Heavy to Lightweight

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. For a 7B model, that’s 7 billion floating-point numbers being adjusted during training. This requires loading the model weights, the optimizer states, and the gradients all in memory simultaneously.

Memory requirement: Roughly 4x the model’s inference memory. A 7B model at FP16 needs 14 GB for weights, plus optimizer states and gradients. Total: approximately 56 GB of VRAM for full fine-tuning.

When to use: When you have a large, high-quality dataset (50,000+ examples) and hardware to match. Full fine-tuning produces the highest quality results but is impractical on consumer hardware for anything above 3B parameters.

Practical for: Phi-3 3.8B on an RTX 4090 (24 GB). Anything larger needs multi-GPU setups or cloud instances.

LoRA (Low-Rank Adaptation)

LoRA freezes the original model weights and injects small trainable matrices (called adapters) into specific layers. Instead of updating 7 billion parameters, you train 1-10 million adapter parameters. The original model stays unchanged. The adapter stores only the delta.

Memory requirement: Roughly 1.1-1.3x inference memory. A 7B model at Q4 needs about 5-6 GB for inference, so LoRA training needs 6-8 GB. The 70B range becomes feasible on 80 GB GPUs.

When to use: Most fine-tuning scenarios. LoRA achieves 90-95% of full fine-tuning quality with 10-50x less compute. The adapter file is small (10-100 MB) and can be swapped in/out at runtime.

Key parameters:

  • r (rank): Typical range 8-64. Higher rank = more capacity = more memory. Start with 16.
  • alpha: Scaling factor, usually set to 2x rank.
  • target_modules: Which layers get adapters. For most models: q_proj, v_proj, k_proj, o_proj.

QLoRA (Quantized LoRA)

QLoRA combines LoRA with 4-bit quantization of the base model. The frozen weights are quantized to 4-bit NormalFloat (NF4), while the LoRA adapters train in 16-bit. This dramatically reduces memory.

Memory requirement: A 7B model needs about 5-6 GB. A 13B model fits in 10-12 GB. A 70B model fits in 40-48 GB. QLoRA made fine-tuning 70B models possible on a single A100 80 GB.

When to use: When your GPU memory is the constraint. QLoRA is the most memory-efficient method that produces genuinely good results. The quality gap vs standard LoRA is small (typically 1-3% on benchmarks).

Paper: QLoRA: Efficient Finetuning of Quantized Language Models (Dettmers et al., 2023)

Comparison: Methods at a Glance

Method Trainable Params VRAM (7B) VRAM (70B) Quality vs Full Training Speed
Full Fine-Tuning 100% (7B) 56 GB 560 GB Baseline Slowest
LoRA 0.1-1% (7-70M) 8 GB 80 GB 90-95% 3-5x faster
QLoRA 0.1-1% (7-70M) 5 GB 48 GB 87-93% 4-6x faster

Tools That Work

Unsloth

Unsloth is the fastest path to fine-tuning on consumer hardware. It provides optimized training kernels that are 2-5x faster than standard implementations while using 60-70% less memory.

from unsloth import FastLanguageModel

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-7B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    use_gradient_checkpointing="unsloth",  # 60% memory reduction
)

Supported models include Llama 3.x, Qwen 2.5, Mistral, Phi-3/4, Gemma 2, and DeepSeek. Free and open-source. Runs on a single GPU.

Hugging Face Transformers + PEFT + TRL

The standard open-source stack. More control, more boilerplate, wider community support.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./output",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        bf16=True,
    ),
)
trainer.train()

Axolotl

Axolotl wraps the Hugging Face stack with YAML-based configuration. Define your training run in a config file instead of writing Python.

base_model: Qwen/Qwen2.5-7B
model_type: AutoModelForCausalLM
load_in_4bit: true

adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj

datasets:
  - path: my_training_data.jsonl
    type: alpaca

sequence_len: 2048
micro_batch_size: 2
gradient_accumulation_steps: 4
num_epochs: 3
learning_rate: 0.0002
optimizer: adamw_bnb_8bit
bf16: true

Run with accelerate launch -m axolotl.cli.train config.yaml. Axolotl handles dataset loading, formatting, and distributed training setup.

LLaMA-Factory

LLaMA-Factory provides both a CLI and a web UI for fine-tuning. Good for iterating quickly without writing code.

# Install
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

# Launch web UI
llamafactory-cli webui

The web interface lets you select base model, dataset, LoRA parameters, and training hyperparameters through dropdown menus. Practical for experimentation when you want to test different configurations rapidly.

Cloud Options (When Local Isn’t Enough)

Platform Cost (approx.) GPU Best For
RunPod $0.40-$2.00/hr A100 40/80 GB Short training runs, pay-per-hour
Lambda Labs $1.10/hr A100 80 GB Multi-GPU training
Vast.ai $0.30-$1.50/hr Mixed (RTX 3090 to A100) Budget training, spot pricing
Google Colab Pro $10/month T4/A100 Prototyping, small datasets
Together AI $0.50-$5.00/hr A100 cluster Managed fine-tuning API

For a single QLoRA training run on a 7B model with 10,000 examples (3 epochs), expect 1-3 hours on an A100. Total cloud cost: roughly $2-$10.


Preparing Training Data

Dataset Format

Most fine-tuning tools expect one of two formats:

Instruction format (Alpaca-style):

{
  "instruction": "Write a SQL query to find users who signed up in the last 30 days",
  "input": "",
  "output": "SELECT * FROM users WHERE created_at >= NOW() - INTERVAL '30 days' ORDER BY created_at DESC;"
}

Chat format (OpenAI-style):

{
  "messages": [
    {"role": "system", "content": "You are a database expert."},
    {"role": "user", "content": "Write a SQL query to find users who signed up in the last 30 days"},
    {"role": "assistant", "content": "SELECT * FROM users WHERE created_at >= NOW() - INTERVAL '30 days' ORDER BY created_at DESC;"}
  ]
}

Store as JSONL (one JSON object per line). Most tools accept both formats.

Dataset Size Guidelines

Dataset Size Quality Impact Training Time (7B, QLoRA, A100) Use Case
100-500 Minimal style shift 5-15 minutes Quick experiments
1,000-5,000 Noticeable domain adaptation 30-90 minutes Specific task tuning
5,000-20,000 Strong domain specialization 2-6 hours Production fine-tuning
20,000-100,000 Deep domain expertise 6-24 hours Enterprise models
100,000+ Diminishing returns per example 1-5 days Only with high-quality, diverse data

Quality matters more than quantity. 1,000 carefully curated examples often outperform 50,000 noisy examples. Each example should demonstrate the exact behavior you want from the model.

Data Sources

Your own data: Internal documentation, code repositories, support transcripts, API logs. Clean it, format it, use it. This is the highest-value training data because it captures your specific domain.

Open datasets: Hugging Face hosts thousands of instruction-tuning datasets. Some notable ones:

Synthetic data generation: Use a stronger model (GPT-4, Claude) to generate training examples for a weaker model. This is a common and effective pattern. Prompt the strong model with your domain context and have it generate input-output pairs in your desired format.


Hardware vs Training Outcome

Here’s what different hardware setups can realistically train, how long it takes, and what quality to expect.

Consumer Hardware

GPU VRAM Largest Model (QLoRA) Training Speed Example Run
RTX 3060 12 GB 7B ~3 tokens/sec 5,000 examples, 3 epochs: 3-4 hours
RTX 3090 24 GB 13B ~8 tokens/sec 5,000 examples, 3 epochs: 1.5-2 hours
RTX 4090 24 GB 13B ~15 tokens/sec 5,000 examples, 3 epochs: 45-60 min
M3 Max (48 GB) 48 GB shared 32B ~5 tokens/sec 5,000 examples, 3 epochs: 4-6 hours
M3 Ultra (96 GB) 96 GB shared 70B ~3 tokens/sec 5,000 examples, 3 epochs: 10-15 hours

Cloud Hardware

GPU VRAM Largest Model (QLoRA) Training Speed Cost per Run (5K examples)
A100 40 GB 40 GB 32B ~20 tokens/sec $2-$5
A100 80 GB 80 GB 70B ~15 tokens/sec $5-$15
H100 80 GB 80 GB 70B ~30 tokens/sec $8-$20
4x A100 80 GB 320 GB 70B (full/LoRA) ~50 tokens/sec $15-$40

Quality After Fine-Tuning

The quality improvement from fine-tuning depends on the gap between what the base model knows and what you need. General observations:

Format compliance: Fine-tuning is extremely effective at teaching models to produce output in a specific format (JSON schemas, API responses, structured reports). Expect 40-60% improvement in format adherence with 1,000-2,000 examples.

Domain terminology: Models learn specialized vocabulary quickly. Medical terms, legal language, internal product names. 500-1,000 examples usually suffice.

Reasoning patterns: Harder to learn. Teaching a 7B model to reason about your domain like a 70B model isn’t realistic. Fine-tuning helps the model apply its existing reasoning to your domain, but doesn’t fundamentally improve reasoning capacity.

Factual knowledge: Fine-tuning can inject specific facts, but the model may hallucinate or confuse them. For factual accuracy, retrieval-augmented generation (RAG) is more reliable than fine-tuning.


A Complete Training Example

Here’s an end-to-end example: fine-tuning Qwen 2.5 7B to generate API documentation from function signatures.

1. Prepare Data

# generate_training_data.py
import json

examples = [
    {
        "instruction": "Generate API documentation for this function",
        "input": "def get_user(user_id: str) -> User:",
        "output": "## GET /users/{user_id}\n\nRetrieves a user by their unique identifier.\n\n**Parameters:**\n- `user_id` (string, required): The unique identifier of the user.\n\n**Returns:** User object\n\n**Status Codes:**\n- 200: User found\n- 404: User not found"
    },
    # ... 2000+ similar examples
]

with open("api_docs_training.jsonl", "w") as f:
    for ex in examples:
        f.write(json.dumps(ex) + "\n")

2. Train with Unsloth

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen2.5-7B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
)

dataset = load_dataset("json", data_files="api_docs_training.jsonl", split="train")

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        output_dir="outputs",
        bf16=True,
        logging_steps=10,
    ),
    max_seq_length=2048,
)

trainer.train()
model.save_pretrained("api-docs-qwen-7b-lora")

3. Export for Ollama

# Save as GGUF for Ollama
model.save_pretrained_gguf("api-docs-qwen-7b", tokenizer, quantization_method="q4_k_m")

Create a Modelfile:

FROM ./api-docs-qwen-7b-Q4_K_M.gguf

PARAMETER temperature 0.3
PARAMETER top_p 0.9

SYSTEM "You are an API documentation generator. Given a function signature, produce comprehensive API documentation in markdown format."
ollama create api-docs-model -f Modelfile
ollama run api-docs-model

Your fine-tuned model is now running locally through Ollama, ready for integration into your development workflow.


When Not to Fine-Tune

Fine-tuning isn’t always the answer. Consider alternatives first:

Prompt engineering: If 10 well-crafted examples in the system prompt get you 80% of the way, the remaining 20% might not justify a training run. Few-shot prompting is free, instant, and doesn’t lock you to a specific model version.

RAG (Retrieval-Augmented Generation): If the issue is factual accuracy or access to specific documents, retrieval is more reliable than baking knowledge into model weights. Fine-tuning teaches patterns. RAG provides facts.

Larger base model: Before fine-tuning a 7B model, try the 32B or 70B variant of the same family. The larger model’s general capabilities already exceed what fine-tuning can teach the smaller model.

Fine-tuning makes sense when you need consistent stylistic or format compliance, domain-specific terminology, or task-specific behavior patterns that prompting alone can’t achieve.


The tooling for local fine-tuning matured rapidly through 2025. QLoRA on a consumer GPU, Unsloth for speed, Axolotl for configuration management. The barrier is no longer hardware or software. It’s having clean, representative training data that captures exactly what you want the model to learn.