Fine-Tuning vs. Prompting: When to Use Each

When building AI-powered applications, one of the first decisions is whether to fine-tune a base model or rely on prompting (including few-shot examples and system instructions) to get the desired behavior. The answer depends on what you are actually trying to achieve.

What each approach does

Prompting shapes the model's behavior at inference time by including instructions, context, and examples in the input. The model's weights — what it learned during training — do not change. What changes is the context it uses to generate a response.

Fine-tuning adjusts the model's weights by continuing to train it on a dataset of examples relevant to your task. After fine-tuning, the model's behavior is changed at a fundamental level, not just through context.

When prompting is sufficient

Prompting is the right starting point in almost every case. It requires no training infrastructure, no labeled dataset, and no time waiting for a training run to complete. It can be iterated on in minutes.

Prompting works well when:

The desired behavior is expressible in natural language instructions
A few examples in the prompt are sufficient to demonstrate the expected format or style
You need to change behavior frequently (fine-tuned models are static)
The task is within the model's existing capabilities and just needs direction

The practical message: try prompting first. Many teams spend weeks on fine-tuning projects that would have been better addressed with improved prompts.

When fine-tuning genuinely helps

Fine-tuning adds value when prompting has genuine limitations:

Consistent style and format: If you need the model to consistently produce output in a very specific format — a particular JSON schema, a specific writing style, a domain-specific response pattern — fine-tuning can bake this in more reliably than prompting. With prompting, the model may occasionally deviate from the format.

Token efficiency: If you need to include extensive examples or instructions in every prompt, fine-tuning can internalize those examples, reducing input length and therefore cost and latency.

Domain-specific knowledge: If your task involves terminology, conventions, or patterns that are underrepresented in the base model's training — specialized medical documentation, internal company jargon, a niche technical domain — fine-tuning on domain-specific examples can improve performance.

Behavior change: Some behaviors are difficult to reliably elicit through prompting. Fine-tuning can change how the model responds to a class of inputs at a more fundamental level.

What fine-tuning does not fix

A common misconception: fine-tuning can teach a model new facts. It does not do this reliably. Fine-tuning changes the model's behavior patterns, not its factual knowledge. If you want the model to accurately answer questions about your company's product catalog, fine-tuning on QA examples is less reliable than RAG (providing the catalog as context at inference time).

Fine-tuning also does not fix:

Reasoning limitations — if the model cannot solve a class of problems with prompting, fine-tuning usually does not change this
Knowledge cutoff limitations
Hallucination tendencies (though it can be reduced somewhat)

PEFT and LoRA: practical fine-tuning

Full fine-tuning — adjusting all of a model's billions of parameters — is computationally expensive and requires significant GPU resources. Parameter-efficient fine-tuning (PEFT) methods, particularly LoRA (Low-Rank Adaptation), have made fine-tuning much more accessible.

LoRA trains a small number of additional parameters that modify the model's behavior, rather than adjusting all existing parameters. The resulting LoRA adapter is small (often a few hundred MB), can be swapped on top of a base model, and achieves comparable results to full fine-tuning for many tasks at a fraction of the compute cost.

Building a fine-tuning dataset

If you decide to fine-tune, the quality of your training data matters far more than the quantity. A few hundred high-quality, diverse examples often outperform thousands of mediocre ones.

Good fine-tuning data:

Shows the model exactly the input/output pairs you want it to learn
Is diverse enough to cover the range of inputs it will see in production
Has consistent formatting and quality throughout
Does not include examples of behavior you want to avoid

Generating fine-tuning data from a stronger model (using GPT-4 or Claude to generate training data for a smaller model) is a common and often effective approach.

The practical path

Start with prompting and iterate
If prompting does not achieve the required quality, identify specifically what is failing
If the failure is behavioral/stylistic and addressable with examples, consider fine-tuning
If the failure is factual/knowledge-based, consider RAG
Use LoRA or similar PEFT methods to reduce compute requirements
Evaluate fine-tuned models carefully on a held-out test set

Summary

Prompting is almost always the right starting point — it is fast, flexible, and requires no training infrastructure. Fine-tuning adds value when you need consistent behavior at scale, have genuine domain adaptation needs, or want to internalize extensive example patterns. It does not reliably inject new factual knowledge. LoRA makes fine-tuning more accessible by training only a small number of adapter parameters rather than all model weights.

Fine-Tuning vs. Prompting: When to Use Each

What each approach does

When prompting is sufficient

When fine-tuning genuinely helps

What fine-tuning does not fix

PEFT and LoRA: practical fine-tuning

Building a fine-tuning dataset

The practical path

Summary

More Intelligence

Prompt Engineering: A Practical Guide

Building AI-Powered Applications: A Practical Framework

How to Evaluate Language Model Outputs