Fine-Tuning with Claude and Unsloth: QLoRA for AI Engineers

Source topic: "Fine-tuning" called out as a core 2026 skill in the AI Engineer Roadmap.
Stack: Unsloth + QLoRA + HuggingFace, with Claude comparison guidance.
Start here: do you actually need to fine-tune?
Most of the time, no. Try these first, in order:
- Better prompting: examples, role, output format
- Prompt caching: if you're reusing long context
- Retrieval (RAG): if you need domain knowledge
- Tool use: if you need specific actions
- Fine-tuning: only if the above fail and you have 500+ high-quality examples
When fine-tuning actually wins:
- Consistent output format or style that prompting can't reliably enforce
- Domain-specific language the base model struggles with (legal, medical, rare code)
- Cost: a small fine-tuned open model is cheaper at scale than Claude per call
- Latency: self-hosted 7B beats any API for p50
When Claude (no fine-tune) wins:
- Reasoning-heavy tasks
- Broad-domain knowledge
- You can't gather 500+ training examples
- You need state-of-the-art quality
The fast path: Unsloth + QLoRA on Llama 3.1 8B
Unsloth is a wrapper that makes fine-tuning 2-5x faster with less memory. QLoRA is 4-bit quantized LoRA. You train a small adapter instead of the whole model.
1. Install (Colab free tier works, T4 GPU)
pip install unsloth
pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes
2. Load model in 4-bit
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.1-8b-bnb-4bit",
max_seq_length=2048,
dtype=None, # auto-detect
load_in_4bit=True,
)
# Attach LoRA adapters — only these weights train
model = FastLanguageModel.get_peft_model(
model,
r=16, # rank
target_modules=["q_proj","k_proj","v_proj","o_proj",
"gate_proj","up_proj","down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
)
3. Prepare data: instruction format
Your dataset needs to be (instruction, input, output) triples. For a support-ticket classifier:
from datasets import Dataset
import json
raw = [
{"instruction": "Classify this support ticket into: billing, technical, general.",
"input": "My card was charged twice for the same order.",
"output": "billing"},
# ... 500+ more
]
prompt_template = """Below is an instruction describing a task. Write a response.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}"""
EOS = tokenizer.eos_token
def format_example(ex):
return {"text": prompt_template.format(**ex) + EOS}
ds = Dataset.from_list(raw).map(format_example)
4. Train
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=ds,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # effective batch = 8
warmup_steps=5,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="./lora_out",
),
)
trainer.train()
On a T4 with 500 examples, this finishes in about 15 minutes.
5. Inference
FastLanguageModel.for_inference(model) # 2x faster inference mode
prompt = prompt_template.format(
instruction="Classify this support ticket into: billing, technical, general.",
input="The app crashes when I try to export my data as CSV.",
output="",
)
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=8, temperature=0)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# → "technical"
6. Save and share: to HuggingFace or local
# Local (just the LoRA adapter — ~40MB)
model.save_pretrained("my_ticket_classifier_lora")
tokenizer.save_pretrained("my_ticket_classifier_lora")
# To HF Hub
model.push_to_hub("yourname/llama-ticket-classifier", token="hf_...")
# Or: merge LoRA into base model and export to GGUF for llama.cpp / Ollama
model.save_pretrained_gguf("my_ticket_classifier", tokenizer, quantization_method="q4_k_m")
The GGUF file drops directly into Ollama:
ollama create ticket-classifier -f Modelfile
# Modelfile contains: FROM ./my_ticket_classifier.Q4_K_M.gguf
ollama run ticket-classifier
Data quality matters more than data quantity
500 excellent examples beats 50,000 noisy ones. Every bad label the model sees, it learns.
Rules:
- Each output must be what you'd accept if a human wrote it
- Diversity of inputs beats volume of near-duplicates
- Hold out 10% for eval, never train on it
- Deduplicate. Exact AND near-duplicates poison generalization
Pro tip: use Claude to generate or clean your training set:
import anthropic
claude = anthropic.Anthropic()
def generate_training_pair(domain: str, label: str):
resp = claude.messages.create(
model="claude-opus-4-7", max_tokens=500,
messages=[{"role": "user", "content": (
f"Generate a realistic {domain} example that should be classified as '{label}'. "
f"Return JSON: {{\"input\": \"...\"}}"
)}],
)
import json
return json.loads(resp.content[0].text)
# Generate 100 examples per class, then human-review in a spreadsheet
Evaluation: the part beginners skip
Fine-tuning looks like it worked because loss went down. That doesn't mean your model is good.
test_examples = [ ... ] # 10% held out
correct = 0
for ex in test_examples:
prompt = prompt_template.format(**ex, output="")
pred = generate(prompt)
if pred.strip() == ex["output"].strip():
correct += 1
print(f"Accuracy: {correct / len(test_examples):.2%}")
Compare against:
- Base Llama 3.1 8B (no fine-tune) with a good prompt
- Claude Sonnet 4.6 with the same prompt
If Claude wins with zero examples, fine-tuning wasn't worth it for THIS task. Put your energy into prompt engineering instead.
Cost comparison at scale (rough 2026 numbers)
| Approach | Per 1K calls | Latency | Quality |
|---|---|---|---|
| Claude Opus 4.7 API | ~$15 | 2-5s | Highest |
| Claude Sonnet 4.6 API | ~$3 | 1-2s | Very high |
| Claude Haiku 4.5 API | ~$0.25 | <1s | High for simple tasks |
| Fine-tuned Llama 8B (self-host) | ~$0.05 (GPU amort.) | <500ms | Task-specific, brittle |
| Fine-tuned Llama 8B (Replicate/Together) | ~$0.20 | 1s | Same |
Fine-tuning wins on cost plus latency for narrow, high-volume tasks. Claude wins on everything else.
Resume angle
"Shipped a fine-tuned Llama 3.1 8B for ticket classification using Unsloth + QLoRA: 500 curated examples, LoRA adapters merged to GGUF, deployed via Ollama. Matched Claude Sonnet accuracy on this task at 1/15th the per-call cost, with an honest eval showing where the fine-tune lost (open-ended reasoning) vs. won (consistent-format classification)."