Fine-tuning in Practice

Fine-tuning in Practice — CRIN

The difference between fine-tuning that works and fine-tuning that fails is almost always in the dataset. 1,000 high-quality examples beat 100,000 mediocre ones. LoRA reduces the compute bill by 99%. Careful evaluation catches catastrophic forgetting before deployment. And knowing when NOT to fine-tune saves weeks.

Course: Advanced.

This lesson covers 5 concepts: Dataset Construction, LoRA Hyperparameters, The Training Run, Evaluating the Result, Common Failures.

Dataset Construction

Fine-tuning dataset quality is the primary determinant of fine-tuning quality. 1,000 carefully curated examples consistently outperform 100,000 mediocre ones. Diversity across task variants and edge cases matters as much as volume.

Most fine-tuning failures are data failures. Garbage in, garbage out — at a scale where the garbage is subtle and hard to spot until the model is deployed.

Imagine teaching a person by example. 1,000 perfect examples with clear correct responses teaches better than 100,000 sloppy examples with inconsistent answers.

LIMA result: 1,000 hand-curated SFT examples on LLaMA-65B outperformed models trained on 100K lower-quality examples in human preference evaluations. The 100× data advantage was erased by quality.

LoRA Hyperparameters

Starting LoRA defaults: rank=16, alpha=32, lr=1e-4, epochs=2. Rank controls adapter capacity. Alpha scales the update magnitude. Learning rate controls update step size. Epochs controls how many passes over the training data.

Rank is the most impactful LoRA hyperparameter. Rank 4–8 for style/format fine-tuning. Rank 16–64 for domain knowledge. Rank > 64 rarely improves over full fine-tuning.

Start with rank=16, alpha=32, learning rate=0.0001, 2 epochs. If the model is not learning enough, increase rank. If it is forgetting general capability, reduce learning rate and epochs.

Customer support fine-tune: rank=8, alpha=16, lr=2e-4, 2 epochs. Code generation fine-tune: rank=64, alpha=128, lr=1e-4, 3 epochs. Different tasks need different capacity.

The Training Run

A disciplined training run: start with 10% of data for a smoke test, monitor train/val loss curves for overfitting, checkpoint every 500 steps, and select the checkpoint with best held-out task accuracy — not the one with lowest training loss.

Training instability can destroy hours of compute in seconds. Frequent checkpointing is cheap insurance — the extra storage cost is trivial compared to restarting from epoch 0.

Training loss going down is a good sign. Validation loss going up while training loss goes down is a bad sign — the model is memorising rather than learning. Stop and fix the data.

Common mistake: run for 10 epochs, take the final checkpoint. Correct: checkpoint every 500 steps, eval each checkpoint, take the one with highest held-out accuracy. Often epoch 2 beats epoch 10 due to overfitting.

Evaluating the Result

Fine-tuned model eval: task accuracy 0.94 (excellent), format correct 0.98 (near-perfect), general QA 0.81 (slight degradation from base), instruction follow 0.96 (strong). Catastrophic forgetting score 0.19 — 19% of general capability tests show degradation. This needs investigation.

Evaluating the fine-tuned model against the base model is mandatory. Improvements on the target task but degradation on general capability may be acceptable or unacceptable depending on the application.

Fine-tuning is a tradeoff. The model becomes more specialised. Some general capability loss is expected and acceptable. Catastrophic loss of general capability is not.

Medical fine-tune: task accuracy 0.97, general QA 0.68 (significant degradation). The model specialised so heavily it lost general reasoning ability. Solution: reduce training epochs or add diverse general-purpose examples to the fine-tuning set.

Common Failures

Four common fine-tuning failures: (1) catastrophic forgetting from too many epochs or too high LR. (2) overfitting from insufficient data diversity. (3) format drift where the model starts in the right format then abandons it. (4) contamination where eval set overlaps training set and inflates scores.

Recognising the failure mode determines the fix. Treating overfitting as forgetting and reducing epochs makes it worse. Correct diagnosis is the first step.

Most fine-tuning failures are predictable and fixable. The failure mode tells you exactly what to change — if you know what to look for.

Format drift pattern: first 10 tokens perfect JSON, then switches to prose. Root cause: training examples that start with format but then deviate. Fix: audit training examples for format consistency throughout, not just at the start.