Skip to content

Fine-Tuning and LoRA

Intermediate

Fine-tuning adapts a pre-trained model to specific tasks, domains, or output styles. LoRA and PEFT methods make this practical on consumer hardware by training only a tiny fraction of parameters.

Key Facts

  • RAG for adding knowledge, fine-tuning for changing behavior/style - often combined in production
  • Before fine-tuning, establish baselines: zero-shot, few-shot, RAG performance
  • 100 high-quality examples > 10,000 noisy examples (quality >> quantity)
  • LoRA trains 0.1-1% of total parameters, reducing GPU memory by 4-8x
  • QLoRA combines LoRA with 4-bit quantization: 7B model fine-tuning on ~6GB VRAM

When to Fine-Tune vs RAG

Approach Best For Not For
RAG Domain knowledge, frequently updated data Changing model behavior/style
Fine-tuning Behavior, output format, domain adaptation Real-time knowledge updates
Both Complex production systems needing both

OpenAI Fine-Tuning

# 1. Prepare JSONL training data
# Each line: {"messages": [{"role": "system",...}, {"role": "user",...}, {"role": "assistant",...}]}

# 2. Upload training file
file = client.files.create(file=open("training.jsonl"), purpose="fine-tune")

# 3. Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": 3}
)

# 4. Use fine-tuned model
response = client.chat.completions.create(
    model="ft:gpt-4o-mini:my-org::abc123",
    messages=[...]
)

Data requirements: minimum 10 examples (50-100+ recommended), diverse, consistent format.

LoRA (Low-Rank Adaptation)

Full fine-tuning updates ALL parameters. For a 7B model, that's 7 billion weights requiring massive GPU memory. LoRA decomposes weight updates into two small matrices:

W_new = W_original + A * B

W_original: frozen (e.g., 4096 x 4096)
A: trainable (e.g., 4096 x 16) - rank=16
B: trainable (e.g., 16 x 4096)

Result: ~130K trainable parameters per layer instead of 16M. 99% fewer parameters.

Rank (r): controls expressiveness. Typical: 8, 16, 32, 64. Higher = more capacity, more memory.

LoRA with HuggingFace PEFT

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable: 4,194,304 || all: 8,030,261,248 || trainable%: 0.05

Target Modules

  • q_proj, v_proj (attention queries/values) - most common, good default
  • k_proj (attention keys) - added for more expressiveness
  • o_proj (attention output)
  • gate_proj, up_proj, down_proj (FFN) - for deeper adaptation

Training Configuration

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

QLoRA (Quantized LoRA)

Combines LoRA with quantization: 1. Quantize base model to 4-bit (NF4 format) 2. Add LoRA adapters in FP16 3. Train only LoRA parameters

Memory savings: 7B model goes from ~28GB (full) to ~6GB (QLoRA). Enables fine-tuning on consumer GPUs.

QLoRA Training with SFTTrainer

from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig
import torch

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"],
                         lora_dropout=0.05, task_type="CAUSAL_LM")

sft_config = SFTConfig(
    output_dir="./qlora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    report_to="wandb",  # Weights & Biases integration
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    peft_config=lora_config,
    processing_class=tokenizer
)
trainer.train()

TRL library (Transformers Reinforcement Learning) provides SFTTrainer which wraps HuggingFace Trainer with fine-tuning-specific features: chat template handling, dataset formatting, LoRA integration.

GPU Selection for QLoRA

GPU VRAM QLoRA 7B QLoRA 13B Notes
T4 16GB Comfortable Tight Free tier on Colab
A100 40GB 40GB Fast Comfortable ~$1/hr on Colab Pro
RTX 4090 24GB Good Possible with small batch Consumer option

For T4: reduce per_device_train_batch_size to 1-2 and increase gradient_accumulation_steps.

Monitoring with Weights & Biases

import wandb
wandb.login()  # or set WANDB_API_KEY environment variable

# In TrainingArguments / SFTConfig:
# report_to="wandb"
# run_name="qlora-llama3-experiment-1"

W&B automatically tracks: loss curves, learning rate schedule, GPU utilization, gradient norms. For OpenAI fine-tuning: link your W&B API key in the OpenAI dashboard Settings -> Integrations section to visualize training progress.

PEFT Methods Comparison

Method Approach Trainable Params
LoRA Low-rank weight update decomposition 0.1-1%
QLoRA LoRA + 4-bit quantization 0.1-1%
Prefix Tuning Trainable prefix tokens per layer Very small
Prompt Tuning Trainable soft prompt vectors Tiny
Adapter Layers Small trainable layers between frozen layers 1-5%

LoRA Hyperparameter Tuning

Rank (r)

Controls the capacity of LoRA adaptation. Higher rank = more expressiveness but more memory and risk of overfitting.

Rank Use Case Notes
4-8 Simple task adaptation (classification, formatting) Start here
16 General-purpose fine-tuning Good default
32-64 Complex domain adaptation, multi-task When lower ranks underperform
128+ Rarely needed Diminishing returns

Alpha (lora_alpha)

Scaling factor for LoRA updates. Effective scaling = alpha / r. Common practice: set alpha = 2 * r (e.g., r=16, alpha=32).

Dropout (lora_dropout)

Regularization for LoRA layers. Typical: 0.05-0.1. Helps prevent overfitting on small datasets.

Target Module Selection Strategy

  • Minimum viable: q_proj, v_proj - attention queries and values only. Good starting point
  • More capacity: add k_proj, o_proj - full attention adaptation
  • Maximum: add gate_proj, up_proj, down_proj - adapts FFN layers too. Use when attention-only is insufficient

Quantization Impact on Fine-Tuning Quality

Quantization reduces model precision to save memory, but impacts performance differently across tasks:

  • Full precision (FP32/BF16): baseline quality, highest memory
  • 8-bit quantization: ~1-2% quality drop, 2x memory reduction
  • 4-bit (NF4/GPTQ): ~3-5% quality drop, 4x memory reduction
  • Impact varies by task: simple classification barely affected, complex reasoning shows more degradation

Run your evaluation suite across quantization levels to find the sweet spot:

from transformers import BitsAndBytesConfig

# 4-bit quantization config for QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True  # nested quantization saves extra memory
)

model = AutoModelForCausalLM.from_pretrained(
    model_name, quantization_config=bnb_config
)

Dataset Preparation for Fine-Tuning

Finding and Crafting Datasets

  • Existing datasets: HuggingFace Hub, Kaggle. Filter for task-specific quality
  • Synthetic generation: use a stronger model (GPT-4) to generate training data for a smaller model
  • Manual curation: highest quality, most expensive. Even 100 expert-curated examples can be transformative
  • Format consistency: every example must follow the exact format expected during inference

Conversation Format

{"messages": [
  {"role": "system", "content": "You are a pricing assistant..."},
  {"role": "user", "content": "Price for item X?"},
  {"role": "assistant", "content": "$42.99"}
]}

System prompt should be identical across all training examples and match inference-time system prompt.

Data Quality Guidelines

  • Each example should demonstrate the exact behavior you want
  • Remove duplicates, contradictions, low-quality samples
  • Hold out 10-20% as test set
  • Measure task-specific metrics (accuracy, BLEU, F1)
  • Compare against baseline to verify improvement
  • Check for overfitting (training metric improves but test doesn't)

Gotchas

  • Fine-tuning on small datasets risks overfitting - always validate on held-out set
  • Fine-tuned models inherit the base model's limitations (hallucination, reasoning failures)
  • LoRA adapters can be composed (merge multiple LoRA) but quality may degrade
  • Hyperparameter tuning (rank, learning rate, epochs) significantly affects results
  • Fine-tuned model quality degrades if training data format doesn't match inference format
  • Always measure: sometimes prompt engineering + RAG outperforms fine-tuning
  • Quantization affects tasks unevenly - always benchmark YOUR specific task at each precision level, not just overall perplexity
  • lora_alpha/r ratio matters more than absolute values - alpha=32/r=16 and alpha=64/r=32 behave similarly
  • System prompt mismatch - if training uses a system prompt but inference doesn't (or vice versa), quality drops significantly
  • Double quantization (bnb_4bit_use_double_quant=True) provides additional memory savings with minimal quality impact

See Also