Back to white papers
White Paper

Fine-Tuning with Claude and Unsloth: QLoRA for AI Engineers

Jake McCluskey
Fine-Tuning with Claude and Unsloth: QLoRA for AI Engineers

Source topic: "Fine-tuning" called out as a core 2026 skill in the AI Engineer Roadmap.

Stack: Unsloth + QLoRA + HuggingFace, with Claude comparison guidance.

Start here: do you actually need to fine-tune?

Most of the time, no. Try these first, in order:

  1. Better prompting: examples, role, output format
  2. Prompt caching: if you're reusing long context
  3. Retrieval (RAG): if you need domain knowledge
  4. Tool use: if you need specific actions
  5. Fine-tuning: only if the above fail and you have 500+ high-quality examples

When fine-tuning actually wins:

  • Consistent output format or style that prompting can't reliably enforce
  • Domain-specific language the base model struggles with (legal, medical, rare code)
  • Cost: a small fine-tuned open model is cheaper at scale than Claude per call
  • Latency: self-hosted 7B beats any API for p50

When Claude (no fine-tune) wins:

  • Reasoning-heavy tasks
  • Broad-domain knowledge
  • You can't gather 500+ training examples
  • You need state-of-the-art quality

The fast path: Unsloth + QLoRA on Llama 3.1 8B

Unsloth is a wrapper that makes fine-tuning 2-5x faster with less memory. QLoRA is 4-bit quantized LoRA. You train a small adapter instead of the whole model.

1. Install (Colab free tier works, T4 GPU)

pip install unsloth
pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes

2. Load model in 4-bit

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,          # auto-detect
    load_in_4bit=True,
)

# Attach LoRA adapters — only these weights train
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                              # rank
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

3. Prepare data: instruction format

Your dataset needs to be (instruction, input, output) triples. For a support-ticket classifier:

from datasets import Dataset
import json

raw = [
    {"instruction": "Classify this support ticket into: billing, technical, general.",
     "input": "My card was charged twice for the same order.",
     "output": "billing"},
    # ... 500+ more
]

prompt_template = """Below is an instruction describing a task. Write a response.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

EOS = tokenizer.eos_token

def format_example(ex):
    return {"text": prompt_template.format(**ex) + EOS}

ds = Dataset.from_list(raw).map(format_example)

4. Train

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,     # effective batch = 8
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="./lora_out",
    ),
)

trainer.train()

On a T4 with 500 examples, this finishes in about 15 minutes.

5. Inference

FastLanguageModel.for_inference(model)   # 2x faster inference mode

prompt = prompt_template.format(
    instruction="Classify this support ticket into: billing, technical, general.",
    input="The app crashes when I try to export my data as CSV.",
    output="",
)

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=8, temperature=0)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# → "technical"

6. Save and share: to HuggingFace or local

# Local (just the LoRA adapter — ~40MB)
model.save_pretrained("my_ticket_classifier_lora")
tokenizer.save_pretrained("my_ticket_classifier_lora")

# To HF Hub
model.push_to_hub("yourname/llama-ticket-classifier", token="hf_...")

# Or: merge LoRA into base model and export to GGUF for llama.cpp / Ollama
model.save_pretrained_gguf("my_ticket_classifier", tokenizer, quantization_method="q4_k_m")

The GGUF file drops directly into Ollama:

ollama create ticket-classifier -f Modelfile
# Modelfile contains: FROM ./my_ticket_classifier.Q4_K_M.gguf
ollama run ticket-classifier

Data quality matters more than data quantity

500 excellent examples beats 50,000 noisy ones. Every bad label the model sees, it learns.

Rules:

  • Each output must be what you'd accept if a human wrote it
  • Diversity of inputs beats volume of near-duplicates
  • Hold out 10% for eval, never train on it
  • Deduplicate. Exact AND near-duplicates poison generalization

Pro tip: use Claude to generate or clean your training set:

import anthropic
claude = anthropic.Anthropic()

def generate_training_pair(domain: str, label: str):
    resp = claude.messages.create(
        model="claude-opus-4-7", max_tokens=500,
        messages=[{"role": "user", "content": (
            f"Generate a realistic {domain} example that should be classified as '{label}'. "
            f"Return JSON: {{\"input\": \"...\"}}"
        )}],
    )
    import json
    return json.loads(resp.content[0].text)

# Generate 100 examples per class, then human-review in a spreadsheet

Evaluation: the part beginners skip

Fine-tuning looks like it worked because loss went down. That doesn't mean your model is good.

test_examples = [ ... ]   # 10% held out

correct = 0
for ex in test_examples:
    prompt = prompt_template.format(**ex, output="")
    pred = generate(prompt)
    if pred.strip() == ex["output"].strip():
        correct += 1

print(f"Accuracy: {correct / len(test_examples):.2%}")

Compare against:

  • Base Llama 3.1 8B (no fine-tune) with a good prompt
  • Claude Sonnet 4.6 with the same prompt

If Claude wins with zero examples, fine-tuning wasn't worth it for THIS task. Put your energy into prompt engineering instead.

Cost comparison at scale (rough 2026 numbers)

ApproachPer 1K callsLatencyQuality
Claude Opus 4.7 API~$152-5sHighest
Claude Sonnet 4.6 API~$31-2sVery high
Claude Haiku 4.5 API~$0.25<1sHigh for simple tasks
Fine-tuned Llama 8B (self-host)~$0.05 (GPU amort.)<500msTask-specific, brittle
Fine-tuned Llama 8B (Replicate/Together)~$0.201sSame

Fine-tuning wins on cost plus latency for narrow, high-volume tasks. Claude wins on everything else.

Resume angle

"Shipped a fine-tuned Llama 3.1 8B for ticket classification using Unsloth + QLoRA: 500 curated examples, LoRA adapters merged to GGUF, deployed via Ollama. Matched Claude Sonnet accuracy on this task at 1/15th the per-call cost, with an honest eval showing where the fine-tune lost (open-ended reasoning) vs. won (consistent-format classification)."