Fine-Tuning with Claude and Unsloth: QLoRA for AI Engineers

Source topic: "Fine-tuning" called out as a core 2026 skill in the AI Engineer Roadmap.

Stack: Unsloth + QLoRA + HuggingFace, with Claude comparison guidance.

Start here: do you actually need to fine-tune?

Most of the time, no. Try these first, in order:

Better prompting: examples, role, output format
Prompt caching: if you're reusing long context
Retrieval (RAG): if you need domain knowledge
Tool use: if you need specific actions
Fine-tuning: only if the above fail and you have 500+ high-quality examples

When fine-tuning actually wins:

Consistent output format or style that prompting can't reliably enforce
Domain-specific language the base model struggles with (legal, medical, rare code)
Cost: a small fine-tuned open model is cheaper at scale than Claude per call
Latency: self-hosted 7B beats any API for p50

When Claude (no fine-tune) wins:

Reasoning-heavy tasks
Broad-domain knowledge
You can't gather 500+ training examples
You need state-of-the-art quality

The fast path: Unsloth + QLoRA on Llama 3.1 8B

Unsloth is a wrapper that makes fine-tuning 2-5x faster with less memory. QLoRA is 4-bit quantized LoRA. You train a small adapter instead of the whole model.

1. Install (Colab free tier works, T4 GPU)

pip install unsloth
pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes

2. Load model in 4-bit

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,          # auto-detect
    load_in_4bit=True,
)

# Attach LoRA adapters — only these weights train
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                              # rank
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

3. Prepare data: instruction format

Your dataset needs to be (instruction, input, output) triples. For a support-ticket classifier:

from datasets import Dataset
import json

raw = [
    {"instruction": "Classify this support ticket into: billing, technical, general.",
     "input": "My card was charged twice for the same order.",
     "output": "billing"},
    # ... 500+ more
]

prompt_template = """Below is an instruction describing a task. Write a response.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

EOS = tokenizer.eos_token

def format_example(ex):
    return {"text": prompt_template.format(**ex) + EOS}

ds = Dataset.from_list(raw).map(format_example)

4. Train

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,     # effective batch = 8
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="./lora_out",
    ),
)

trainer.train()

On a T4 with 500 examples, this finishes in about 15 minutes.

5. Inference

FastLanguageModel.for_inference(model)   # 2x faster inference mode

prompt = prompt_template.format(
    instruction="Classify this support ticket into: billing, technical, general.",
    input="The app crashes when I try to export my data as CSV.",
    output="",
)

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=8, temperature=0)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# → "technical"

6. Save and share: to HuggingFace or local

# Local (just the LoRA adapter — ~40MB)
model.save_pretrained("my_ticket_classifier_lora")
tokenizer.save_pretrained("my_ticket_classifier_lora")

# To HF Hub
model.push_to_hub("yourname/llama-ticket-classifier", token="hf_...")

# Or: merge LoRA into base model and export to GGUF for llama.cpp / Ollama
model.save_pretrained_gguf("my_ticket_classifier", tokenizer, quantization_method="q4_k_m")

The GGUF file drops directly into Ollama:

ollama create ticket-classifier -f Modelfile
# Modelfile contains: FROM ./my_ticket_classifier.Q4_K_M.gguf
ollama run ticket-classifier

Data quality matters more than data quantity

500 excellent examples beats 50,000 noisy ones. Every bad label the model sees, it learns.

Rules:

Each output must be what you'd accept if a human wrote it
Diversity of inputs beats volume of near-duplicates
Hold out 10% for eval, never train on it
Deduplicate. Exact AND near-duplicates poison generalization

Pro tip: use Claude to generate or clean your training set:

import anthropic
claude = anthropic.Anthropic()

def generate_training_pair(domain: str, label: str):
    resp = claude.messages.create(
        model="claude-opus-4-7", max_tokens=500,
        messages=[{"role": "user", "content": (
            f"Generate a realistic {domain} example that should be classified as '{label}'. "
            f"Return JSON: {{\"input\": \"...\"}}"
        )}],
    )
    import json
    return json.loads(resp.content[0].text)

# Generate 100 examples per class, then human-review in a spreadsheet

Evaluation: the part beginners skip

Fine-tuning looks like it worked because loss went down. That doesn't mean your model is good.

test_examples = [ ... ]   # 10% held out

correct = 0
for ex in test_examples:
    prompt = prompt_template.format(**ex, output="")
    pred = generate(prompt)
    if pred.strip() == ex["output"].strip():
        correct += 1

print(f"Accuracy: {correct / len(test_examples):.2%}")

Compare against:

Base Llama 3.1 8B (no fine-tune) with a good prompt
Claude Sonnet 4.6 with the same prompt

If Claude wins with zero examples, fine-tuning wasn't worth it for THIS task. Put your energy into prompt engineering instead.

Cost comparison at scale (rough 2026 numbers)

Approach	Per 1K calls	Latency	Quality
Claude Opus 4.7 API	~$15	2-5s	Highest
Claude Sonnet 4.6 API	~$3	1-2s	Very high
Claude Haiku 4.5 API	~$0.25	<1s	High for simple tasks
Fine-tuned Llama 8B (self-host)	~$0.05 (GPU amort.)	<500ms	Task-specific, brittle
Fine-tuned Llama 8B (Replicate/Together)	~$0.20	1s	Same

Fine-tuning wins on cost plus latency for narrow, high-volume tasks. Claude wins on everything else.

Resume angle

"Shipped a fine-tuned Llama 3.1 8B for ticket classification using Unsloth + QLoRA: 500 curated examples, LoRA adapters merged to GGUF, deployed via Ollama. Matched Claude Sonnet accuracy on this task at 1/15th the per-call cost, with an honest eval showing where the fine-tune lost (open-ended reasoning) vs. won (consistent-format classification)."