Fine-tuning a model — training it on your domain-specific data — promises a model that speaks your language, understands your formats, and generates outputs that fit your standards without lengthy prompts. The reality is more nuanced. Fine-tuning is genuinely the right choice in specific circumstances and the wrong choice in many more. Getting this decision right is worth understanding carefully, because getting it wrong is expensive.
What Prompt Engineering and Fine-Tuning Actually Do
Prompt engineering shapes model behaviour through the input: instructions, examples, context, and output format specifications. It requires no training infrastructure, costs nothing beyond inference, and can be iterated in minutes. The model's underlying capability and knowledge remain unchanged — you are steering a ship that is already built.
Fine-tuning modifies the model itself: you train an existing base model on examples specific to your domain, adjusting its weights to encode your patterns, style, and domain knowledge. The resulting model has the desired behaviour baked in — it costs less per call (shorter prompts), responds faster (no lengthy instructions), and is more consistent on the trained patterns.
When Prompt Engineering Is the Right Answer
Prompt engineering covers the majority of business AI use cases. Start here, always, and only move to fine-tuning when you have exhausted what prompting can achieve on your evaluation criteria.
- Your task is novel, experimental, or still evolving — fine-tuning a moving target is expensive
- You do not have 500+ high-quality labelled examples of desired output
- Response format or instructions change frequently
- Latency and cost constraints are manageable with prompt-based approaches
- The task requires general world knowledge that fine-tuning would dilute
When Fine-Tuning Genuinely Pays Off
Fine-tuning makes sense when three conditions are met simultaneously: you have a stable, high-volume task; you have high-quality training data; and you have the infrastructure to evaluate, maintain, and retrain the model as needed.
- High-volume inference where per-call cost reduction justifies training cost (millions of calls per month)
- Strict output format requirements that prompts struggle to enforce consistently
- Highly domain-specific language or knowledge not well covered by base model training data
- Latency is critical and shorter prompts meaningfully improve response time
- Consistency is paramount — fine-tuned models show less variance than prompted models on trained patterns
“Fine-tuning is not an upgrade. It is a specialisation trade-off: you gain consistency on trained patterns and lose flexibility on everything else.”
The Data Requirement Most Teams Underestimate
The most common reason fine-tuning fails to deliver: insufficient or low-quality training data. Modern fine-tuning (LoRA, QLoRA) can work with as few as 100 examples for format adaptation, but producing meaningfully better outputs on complex tasks typically requires 500-5,000 high-quality labelled examples.
High-quality means: inputs and outputs that represent the full distribution of real-world inputs, outputs that represent the correct output as judged by a domain expert (not just any output), and coverage of edge cases and difficult examples, not just the easy representative ones. Collecting this data is usually the most time-consuming part of a fine-tuning project.