October 27, 2025

Robust, sample efficient SFT with prompt mutations

Low-KL divergence prompt mutations: better performance at a fraction of the cost.

Authors

Affiliations

Harry Partridge

Parsed

Charles O'Neill

Parsed

TLDR: Low KL-divergence prompt mutations lead to better performance at a fraction of the cost.

In production use cases, LLMs are often required to perform the same task numerous times. This usually involves using a prompt template - a set of predefined instructions and requirements that are used repeatedly in conjunction with a small set of variable inputs. For example, an LLM might be used to transform documents from one format into another format, adhering to variable instructions specifying the desired structure, length and style.

In this kind of high-volume, repeatable use case, Supervised Fine Tuning (SFT), and Reinforcement Learning (RL) can be employed to optimise a small model to efficiently perform the specific task. However, two interlinked issues often arise:

Prompt brittleness: Since the same set of predefined instructions are present in the input for every example, the fine-tuned model can become brittle and significant performance degradation may be observed based on small changes in these instructions.
Cost:
1. For vanilla SFT, the amount of data required to achieve a desired performance threshold may be prohibitively expensive to collect.
2. For RL, producing and grading rollouts can become very expensive, particularly when expensive LLM-as-a-judge rewards are used to grade each individual rollout.
3. On-policy distillation is more cost efficient than RL since a denser reward signal may be collected from each example, but performing numerous rollouts and grading with a large teacher model is still expensive. In addition, the teacher model must be from the same family (or at least use the same tokeniser) as the student model.

Presupposing that you start with an SFT dataset consisting of inputs and gold standard outputs (constructed either using human labels, distillation from a teacher model, context distillation or iterative refinement), a simple, yet surprisingly effective strategy to address these issues is to:

Isolate the static prompt template instructions that are shared across examples
Perform precise mutations that only alter the phrasing, structure and syntax of the prompt template, but without affecting the meaning or intent behind the prompt.
Re-insert the variable components of the task into these mutated instructions and combine with the same gold standard outputs.

This process multiplies the size of the dataset by several fold without having to re-collect any more gold standard outputs. It also ensures that the model does not over-index on the idiosyncratic specifics of the initial prompt template. Indeed, repeatedly using the same prompt template means that the training process exposes only a particular surface of the weight manifold. By exposing the model to different variations of the same latent intent, the model is forced to generalise and extrapolates better to unseen inputs.

Robustness

In practice, when someone writes down the desired behaviour of their model into a prompt template, they are searching for a conditional context C in which the model will respond appropriately to the variable inputs V_i; they care about the specific details of their prompt only insofar as that prompt induces the desired behaviour. However, if the optimal policy is going to be trained into the model by using gold standard outputs

then why is it even necessary to precondition with C at all? Why not simply train on pairs (Y_i, V_i)? The issue with this approach is that it is severely off-policy: the base model is extremely unlikely to produce Y_i from V_i without the context C.

It has been demonstrated that KL divergence between a model’s policy distribution and the training dataset is the key predictor of catastrophic forgetting. Since the KL divergence of the base model with the raw input-output dataset (Y_i, V_i) is extremely high, this will result in a model that performs very poorly when shifted even slightly out of distribution. This is why it is necessary to train on the combined (V_i, C) dataset, even though the desired behaviour is embedded in the outputs themselves.

However, we can do even better than just (V_i, C). In fact, any C' which induces a low KL divergence between

and

will result in a robust trained model. We can measure this KL divergence on a given candidate prompt mutation, and select only the lowest divergence mutations. This is something we’re actively looking into.

Prompt Mutations in Practice

Parsed uses iterative refinement to construct optimal outputs for a given input example. This process leverages a significant amount of computation in running evaluations and refinements, and may therefore cost thousands of dollars for a large dataset. For only a handful of dollars, prompt mutations can be used to double or triple the size of this dataset with no performance reduction, saving thousands of dollars relative to further refinement.

Furthermore, for customers with limited data, prompt mutations provide us an easy method with which to increase the size of the training dataset, resulting in much stronger performance than would have otherwise been possible.

This customer started with a limited sample set of 5000 samples in their dataset. Performing only a single prompt mutation allowed us to double the size of the dataset to 10k datapoints and resulted in above-trend performance virtually for free. All models trained for 2 epochs from a Gemma-3-27B-it baseline. The Likert score represents the sum of scores from an ensemble of binary PASS/FAIL LLM-as-a-Judge evaluators.

On Policy Prompt Mutations

Prompt mutations can also be used in conjunction with on-policy training strategies such as on-policy distillation or RL. In this setting, it is no longer necessary to select low KL-divergence mutations, since the outputs are generated online. Instead, prompt mutations provide a simple way to broaden the training task and ensure that trained models are less fragile to small changes in the input task. Again, this is something we’re experimenting with.

Future work

Instead of merely sampling for low KL divergence prompts, it is possible to directly optimise a KV cache C' that minimises the KL divergence of

with

. We hypothesise that training with such a context will result in even more robustly trained models.

We are also interested in exploring the scaling laws of prompt mutations; we would like to have a better characterisation of the marginal benefit of the nth prompt mutation.

Other research.

Upweight the strategy, not the tokens: faster training with explicit reasoning

Teach the why, not just the what: Rationale-Guided Training

Oct 28, 2025

Upweight the strategy, not the tokens: faster training with explicit reasoning

Teach the why, not just the what: Rationale-Guided Training

Oct 28, 2025

Attention-based attribution: what your model is actually looking at

Cosine similarity is cosplay. Attention is attribution.

Oct 28, 2025

Attention-based attribution: what your model is actually looking at

Cosine similarity is cosplay. Attention is attribution.

Oct 28, 2025

Training loss predicts evaluation performance, even for non-verifiable tasks

Loss: the cheapest evaluation you’ll ever run.

Oct 27, 2025

Training loss predicts evaluation performance, even for non-verifiable tasks

Loss: the cheapest evaluation you’ll ever run.

Oct 27, 2025

Building production AI for regulated industries with a leading digital insurer

From frontier OpenAI/Google models to open-source — delivering 8x the speed and outperforming GPT-5-level accuracy.

Oct 20, 2025

Building production AI for regulated industries with a leading digital insurer

From frontier OpenAI/Google models to open-source — delivering 8x the speed and outperforming GPT-5-level accuracy.

Oct 20, 2025

Practical LoRA Research

Fine-tuning at Scale: What LoRA Gets Right (and LoRA-XS Doesn’t).

Oct 10, 2025

Practical LoRA Research

Fine-tuning at Scale: What LoRA Gets Right (and LoRA-XS Doesn’t).

Oct 10, 2025

A letter to the C-suite: the shifting role of MLEs

Your MLEs are brilliant, but you’re giving them the wrong job.

Sep 8, 2025

A letter to the C-suite: the shifting role of MLEs

Your MLEs are brilliant, but you’re giving them the wrong job.

Sep 8, 2025

Fine-tuning small open-source LLMs to outperform large closed-source models by 60% on specialized tasks

27B open-source model outperforms biggest OpenAI/Anthropic/Google models on real healthcare task.

Aug 15, 2025

Fine-tuning small open-source LLMs to outperform large closed-source models by 60% on specialized tasks

27B open-source model outperforms biggest OpenAI/Anthropic/Google models on real healthcare task.

Aug 15, 2025

Amnesiac generalist behemoths are not the future of language models

You don’t need a generic genius. You need a specialist learner.

Jul 28, 2025

Amnesiac generalist behemoths are not the future of language models

You don’t need a generic genius. You need a specialist learner.

Jul 28, 2025

The Bitter Lesson of LLM Evals

Turning expert judgment into a compounding moat. Because in LLM evals, scaling care beats scaling compute.

Jul 13, 2025

The Bitter Lesson of LLM Evals

Turning expert judgment into a compounding moat. Because in LLM evals, scaling care beats scaling compute.

Jul 13, 2025

Do transformers notice their own mistakes? Finding a linear hallucination detector inside LLMs

A linear signal in LLMs reveals hallucinations, is detected by a frozen observer, and steered with a single vector.

May 8, 2025

Do transformers notice their own mistakes? Finding a linear hallucination detector inside LLMs

A linear signal in LLMs reveals hallucinations, is detected by a frozen observer, and steered with a single vector.

May 8, 2025

Resurrecting the salmon: seeing clearer inside LLMs with domain-specific SAEs

A powerful, efficient, and domain-robust strategy for safeguarding medical-text generation.

Feb 15, 2025

Resurrecting the salmon: seeing clearer inside LLMs with domain-specific SAEs

A powerful, efficient, and domain-robust strategy for safeguarding medical-text generation.

Feb 15, 2025

Why mechanistic interpretability needs a paradigm inversion

The conventional scaling paradigm for language models themselves may be fundamentally misaligned with interp.

Jan 13, 2025

Why mechanistic interpretability needs a paradigm inversion

The conventional scaling paradigm for language models themselves may be fundamentally misaligned with interp.

Jan 13, 2025

Start owning your model today.

From training to deployment, we help you launch a specialist LLM that outperforms generic models, adapts automatically, and runs reliably at scale.

Get started

Start owning your model today.

From training to deployment, we help you launch a specialist LLM that outperforms generic models, adapts automatically, and runs reliably at scale.

Get started