Fine-Tuning with Claude and Unsloth: QLoRA for AI Engineers
White Paper

Fine-Tuning with Claude and Unsloth: QLoRA for AI Engineers

Jake McCluskeyUpdated
Back to white papers

Source topic: "Fine-tuning" called out as a core 2026 skill in the AI Engineer Roadmap.

Stack: Unsloth + QLoRA + HuggingFace, with Claude comparison guidance.

Start here: do you actually need to fine-tune?

Most of the time, no. Try these first, in order:

  1. Better prompting: examples, role, output format
  2. Prompt caching: if you're reusing long context
  3. Retrieval (RAG): if you need domain knowledge
  4. Tool use: if you need specific actions
  5. Fine-tuning: only if the above fail and you have 500+ high-quality examples

When fine-tuning actually wins:

  • Consistent output format or style that prompting can't reliably enforce
  • Domain-specific language the base model struggles with (legal, medical, rare code)
  • Cost: a small fine-tuned open model is cheaper at scale than Claude per call
  • Latency: self-hosted 7B beats any API for p50

When Claude (no fine-tune) wins:

  • Reasoning-heavy tasks
  • Broad-domain knowledge
  • You can't gather 500+ training examples
  • You need state-of-the-art quality

The fast path: Unsloth + QLoRA on Llama 3.1 8B

Unsloth is a wrapper that makes fine-tuning 2-5x faster with less memory. QLoRA is 4-bit quantized LoRA. You train a small adapter instead of the whole model.

1. Install (Colab free tier works, T4 GPU)

pip install unsloth
pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes

2. Load model in 4-bit

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,          # auto-detect
    load_in_4bit=True,
)

# Attach LoRA adapters, only these weights train
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                              # rank
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

3. Prepare data: instruction format

Your dataset needs to be (instruction, input, output) triples. For a support-ticket classifier:

from datasets import Dataset
import json

raw = [
    {"instruction": "Classify this support ticket into: billing, technical, general.",
     "input": "My card was charged twice for the same order.",
     "output": "billing"},
    # ... 500+ more
]

prompt_template = """Below is an instruction describing a task. Write a response.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

EOS = tokenizer.eos_token

def format_example(ex):
    return {"text": prompt_template.format(**ex) + EOS}

ds = Dataset.from_list(raw).map(format_example)

4. Train

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,     # effective batch = 8
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="./lora_out",
    ),
)

trainer.train()

On a T4 with 500 examples, this finishes in about 15 minutes.

5. Inference

FastLanguageModel.for_inference(model)   # 2x faster inference mode

prompt = prompt_template.format(
    instruction="Classify this support ticket into: billing, technical, general.",
    input="The app crashes when I try to export my data as CSV.",
    output="",
)

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=8, temperature=0)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# → "technical"

6. Save and share: to HuggingFace or local

# Local (just the LoRA adapter, ~40MB)
model.save_pretrained("my_ticket_classifier_lora")
tokenizer.save_pretrained("my_ticket_classifier_lora")

# To HF Hub
model.push_to_hub("yourname/llama-ticket-classifier", token="hf_...")

# Or: merge LoRA into base model and export to GGUF for llama.cpp / Ollama
model.save_pretrained_gguf("my_ticket_classifier", tokenizer, quantization_method="q4_k_m")

The GGUF file drops directly into Ollama:

ollama create ticket-classifier -f Modelfile
# Modelfile contains: FROM ./my_ticket_classifier.Q4_K_M.gguf
ollama run ticket-classifier

Data quality matters more than data quantity

500 excellent examples beats 50,000 noisy ones. Every bad label the model sees, it learns.

Rules:

  • Each output must be what you'd accept if a human wrote it
  • Diversity of inputs beats volume of near-duplicates
  • Hold out 10% for eval, never train on it
  • Deduplicate. Exact AND near-duplicates poison generalization

Pro tip: use Claude to generate or clean your training set:

import anthropic
claude = anthropic.Anthropic()

def generate_training_pair(domain: str, label: str):
    resp = claude.messages.create(
        model="claude-opus-4-7", max_tokens=500,
        messages=[{"role": "user", "content": (
            f"Generate a realistic {domain} example that should be classified as '{label}'. "
            f"Return JSON: {{\"input\": \"...\"}}"
        )}],
    )
    import json
    return json.loads(resp.content[0].text)

# Generate 100 examples per class, then human-review in a spreadsheet

Evaluation: the part beginners skip

Fine-tuning looks like it worked because loss went down. That doesn't mean your model is good.

test_examples = [ ... ]   # 10% held out

correct = 0
for ex in test_examples:
    prompt = prompt_template.format(**ex, output="")
    pred = generate(prompt)
    if pred.strip() == ex["output"].strip():
        correct += 1

print(f"Accuracy: {correct / len(test_examples):.2%}")

Compare against:

  • Base Llama 3.1 8B (no fine-tune) with a good prompt
  • Claude Sonnet 4.6 with the same prompt

If Claude wins with zero examples, fine-tuning wasn't worth it for THIS task. Put your energy into prompt engineering instead.

Cost comparison at scale (rough 2026 numbers)

ApproachPer 1K callsLatencyQuality
Claude Opus 4.7 API~$152-5sHighest
Claude Sonnet 4.6 API~$31-2sVery high
Claude Haiku 4.5 API~$0.25<1sHigh for simple tasks
Fine-tuned Llama 8B (self-host)~$0.05 (GPU amort.)<500msTask-specific, brittle
Fine-tuned Llama 8B (Replicate/Together)~$0.201sSame

Fine-tuning wins on cost plus latency for narrow, high-volume tasks. Claude wins on everything else.

Resume angle

"Shipped a fine-tuned Llama 3.1 8B for ticket classification using Unsloth + QLoRA: 500 curated examples, LoRA adapters merged to GGUF, deployed via Ollama. Matched Claude Sonnet accuracy on this task at 1/15th the per-call cost, with an honest eval showing where the fine-tune lost (open-ended reasoning) vs. won (consistent-format classification)."

Common questions

Frequently asked

When should I use fine-tuning instead of Claude or prompt engineering?

Fine-tuning makes sense only after you've tried better prompting, prompt caching, RAG, and tool use first, and you have 500+ high-quality examples. It wins when you need consistent output format or style that prompting can't reliably enforce, domain-specific language the base model struggles with, lower cost at scale, or lower latency with self-hosted models. Claude without fine-tuning wins for reasoning-heavy tasks, broad-domain knowledge, when you can't gather 500+ examples, or when you need state-of-the-art quality.

How long does it take to fine-tune a Llama 3.1 8B model with Unsloth on a T4 GPU?

Training finishes in about 15 minutes on a T4 GPU with 500 examples using the Unsloth and QLoRA setup described in the article. Unsloth makes fine-tuning 2 to 5 times faster with less memory by using 4-bit quantized LoRA, where you train a small adapter instead of the whole model.

What is the cost difference between fine-tuned Llama 8B and Claude API for 1,000 calls?

A self-hosted fine-tuned Llama 8B costs roughly $0.05 per 1,000 calls (GPU amortized) with under 500ms latency, compared to Claude Opus 4.7 at around $15 per 1,000 calls, Claude Sonnet 4.6 at around $3, or Claude Haiku 4.5 at around $0.25. Fine-tuning wins on cost and latency for narrow, high-volume tasks, while Claude wins on everything else.

How do I evaluate whether my fine-tuned model actually performs better than the base model or Claude?

Hold out 10 percent of your data for evaluation and never train on it. Calculate accuracy by running inference on these test examples and comparing predictions to expected outputs. Then compare your fine-tuned model against the base Llama 3.1 8B with a good prompt and Claude Sonnet 4.6 with the same prompt. If Claude wins with zero examples, fine-tuning was not worth it for that task.

How do I export a fine-tuned Llama model to run locally with Ollama?

Use Unsloth's save_pretrained_gguf method to merge the LoRA adapter into the base model and export it as a GGUF file with quantization such as q4_k_m. The GGUF file can then be loaded directly into Ollama by creating a Modelfile that points to the GGUF path and running ollama create followed by ollama run with your model name.

READY TO IMPLEMENT

Want to talk through this in your business?

The paper above is the thinking. Let's spend 30 minutes on what it would actually look like to ship in your shop, no pitch, just a real scoping conversation.

Fine-Tuning with Claude and Unsloth | Elite AI Advantage