---
name: Fine-Tuning Assistant
slug: fine-tuning-assistant
description: Guide model fine-tuning processes for customized AI performance
category: ai-ml
complexity: advanced
version: "1.0.0"
author: "ID8Labs"
triggers:
- "fine-tune model"
- "fine-tuning"
- "customize LLM"
- "train custom model"
- "adapt model"
tags:
- fine-tuning
- training
- customization
- LLM
- machine-learning
---
# Fine-Tuning Assistant
The Fine-Tuning Assistant skill guides you through the process of adapting pre-trained models to your specific use case. Fine-tuning can dramatically improve model performance on specialized tasks, teach models your preferred style, and add capabilities that prompting alone cannot achieve.
This skill covers when to fine-tune versus prompt engineer, preparing training data, selecting base models, configuring training parameters, evaluating results, and deploying fine-tuned models. It applies modern techniques including LoRA, QLoRA, and instruction tuning to make fine-tuning practical and cost-effective.
Whether you are fine-tuning GPT models via API, running local training with open-source models, or using platforms like Hugging Face, this skill ensures you approach fine-tuning strategically and effectively.
## Core Workflows
### Workflow 1: Decide Whether to Fine-Tune
1. **Assess** the problem:
- Can prompting achieve the goal?
- Is the task format or style consistent?
- Do you have quality training data?
- Is this worth the investment?
2. **Compare** approaches:
| Approach | When to Use | Investment |
|----------|-------------|------------|
| Better prompts | First attempt, variable tasks | Low |
| Few-shot examples | Consistent format, limited data | Low |
| RAG | Knowledge-intensive, dynamic data | Medium |
| Fine-tuning | Consistent style, specialized task | High |
3. **Evaluate** requirements:
- Minimum 100-1000 quality examples
- Clear evaluation criteria
- Budget for training and hosting
4. **Decision**: Fine-tune only if prompting/RAG insufficient
### Workflow 2: Prepare Fine-Tuning Dataset
1. **Collect** training examples:
- Representative of target use case
- High quality (no errors in outputs)
- Diverse coverage of task variations
2. **Format** for training:
```jsonl
{"messages": [
{"role": "system", "content": "You are a helpful assistant..."},
{"role": "user", "content": "User input here"},
{"role": "assistant", "content": "Ideal response here"}
]}
```
3. **Quality assurance**:
- Review sample of examples manually
- Check for consistency in style/format
- Remove duplicates and low-quality entries
4. **Split** train/validation/test sets
5. **Validate** dataset format
### Workflow 3: Execute Fine-Tuning
1. **Select** base model:
- Consider size vs capability tradeoff
- Match model to task complexity
- Check licensing for your use case
2. **Configure** training:
```python
# OpenAI fine-tuning
training_config = {
"model": "gpt-4o-mini-2024-07-18",
"training_file": "file-xxx",
"hyperparameters": {
"n_epochs": 3,
"batch_size": "auto",
"learning_rate_multiplier": "auto"
}
}
# LoRA fine-tuning (local)
lora_config = {
"r": 16, # Rank
"lora_alpha": 32,
"lora_dropout": 0.05,
"target_modules": ["q_proj", "v_proj"]
}
```
3. **Monitor** training:
- Watch loss curves
- Check for overfitting
- Validate on held-out set
4. **Evaluate** results:
- Compare to baseline model
- Test on diverse inputs
- Check for regressions
## Quick Reference
| Action | Command/Trigger |
|--------|-----------------|
| Decide approach | "Should I fine-tune for [task]" |
| Prepare data | "Format data for fine-tuning" |
| Choose model | "Which model to fine-tune for [task]" |
| Configure training | "Fine-tuning parameters for [goal]" |
| Evaluate results | "Evaluate fine-tuned model" |
| Debug training | "Fine-tuning loss not decreasing" |
## Best Practices
- **Start with Prompting**: Fine-tuning is expensive; exhaust cheaper options first
- Can better prompts achieve 80% of the goal?
- Try few-shot examples in the prompt
- Consider RAG for knowledge tasks
- **Quality Over Quantity**: 100 excellent examples beat 10,000 mediocre ones
- Each example should be a gold standard
- Better to have humans verify examples
- Remove anything you wouldn't want the model to learn
- **Match Format to Use Case**: Training examples should mirror real usage
- Same prompt structure as production
- Realistic input variations
- Cover edge cases explicitly
- **Don't Over-Train**: More epochs isn't always better
- Watch validation loss for overfitting
- Start with 1-3 epochs
- Early stopping when validation plateaus
- **Evaluate Properly**: Training loss isn't the goal
- Use held-out test set
- Compare to baseline on same tests
- Check for capability regressions
- Test on edge cases explicitly
- **Version Everything**: Fine-tuning is iterative
- Version your training data
- Track experiment configurations
- Document what worked and what didn't
## Advanced Techniques
### LoRA (Low-Rank Adaptation)
Efficient fine-tuning for large models:
```python
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # Rank of update matrices
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA to base model
model = get_peft_model(base_model, lora_config)
# Only ~0.1% of parameters are trainable
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
```
### QLoRA (Quantized LoRA)
Fine-tune large models on consumer hardware:
```python
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config
)
# Apply LoRA on top
model = get_peft_model(model, lora_config)
```
### Instruction Tuning Dataset Creation
Convert raw data to instruction format:
```python
def create_instruction_example(raw_data):
return {
"messages": [
{
"role": "system",
"content": "You are a customer service agent for TechCorp..."
},
{
"role": "user",
"content": f"Customer inquiry: {raw_data['inquiry']}"
},
{
"role": "assistant",
"content": raw_data['ideal_response']
}
]
}
# Apply to dataset
instruction_dataset = [create_instruction_example(d) for d in raw_dataset]
```
### Evaluation Framework
Comprehensive assessment of fine-tuned models:
```python
def evaluate_fine_tuned_model(model, test_set, baseline_model=None):
results = {
"task_accuracy": [],
"format_compliance": [],
"style_match": [],
"regression_check": []
}
for example in test_set:
output = model.generate(example.input)
# Task-specific accuracy
results["task_accuracy"].append(
check_correctness(output, example.expected)
)
# Format compliance
results["format_compliance"].append(
matches_expected_format(output)
)
# Style matching (for style transfer tasks)
results["style_match"].append(
style_similarity(output, example.expected)
)
# Regression on general capabilities
if baseline_model:
results["regression_check"].append(
compare_general_capability(model, baseline_model, example)
)
return {k: np.mean(v) for k, v in results.items()}
```
### Curriculum Learning
Order training data by difficulty:
```python
def create_curriculum(dataset):
# Score examples by complexity
scored = [(score_complexity(ex), ex) for ex in dataset]
scored.sort(key=lambda x: x[0])
# Create epochs with increasing difficulty
n = len(scored)
curriculum = {
"epoch_1": [ex for _, ex in scored[:n//3]], # Easy
"epoch_2": [ex for _, ex in scored[:2*n//3]], # Easy + Medium
"epoch_3": [ex for _, ex in scored], # All
}
return curriculum
```
## Common Pitfalls to Avoid
- Fine-tuning when better prompting would suffice
- Using low-quality or inconsistent training examples
- Not holding out a proper test set
- Training for too many epochs (overfitting)
- Ignoring capability regressions from fine-tuning
- Not versioning training data and configurations
- Expecting fine-tuning to add factual knowledge (use RAG instead)
- Fine-tuning on data that doesn't match production use