Position

September 8, 2025

A letter to the C-suite: the shifting role of MLEs

Your MLEs are brilliant, but you’re giving them the wrong job.

Authors

Affiliations

Charles O'Neill

Parsed

Mudith Jayasekara

Parsed

If you’re hiring an engineering team at an AI-native company in domains like healthcare, insurance, legaltech, or insurance, you’ve probably fallen into a trap we see everywhere: assuming your MLEs can double as LLM engineers.

The problem isn’t that they aren’t smart enough. It’s that you’re getting them to do the wrong job. Their job isn’t to reinvent frontier model training inside your startup. Their job is to help get you to product-market fit.

The prototyping role of MLEs

This doesn’t mean MLEs don’t play a crucial role. In fact, they’re often the best people to help you prototype in the early days.

They can wire up APIs, run small-scale experiments to get you to your first demos. This phase is all about speed, iteration, and exploring whether there’s a real problem-solution fit. MLEs shine here because they’re adaptable, quantitative, and can hack together a proof of concept quickly.

But once you’ve validated the problem and see early signs of product-market fit, the game changes. At that point, you need to shift from hacking together demos to running scalable, reliable post-training loops. And that’s not something you can, or should, expect your MLEs to own. That’s where Parsed comes in.

The intelligence mismatch

It's a fallacy to equate the skills that make someone a good traditional MLE with those required to run an LLM-based engineering team. A PhD in neural tangent kernels, those Kaggle competition wins are orthogonal to what you actually need.

Even if you're just calling closed-source models through APIs (before we even touch training and inference) LLMs demand a fundamentally different kind of intelligence. Understanding how to scaffold and prompt your way to performance isn't about gradient-boosted models. It's about something fuzzier, more qualitative.

Writing LLM-as-a-judge evaluation harnesses requires specific intuition that doesn't map neatly onto traditional metrics of quantitative intelligence. (This is why companies like Cognition boasting about hiring IMO winners feels misguided to me). Solving International Math Olympiad problems is impressive, but it's a completely different skillset from understanding, on an intuitive level, how LLMs process tokens. How the subtleties of English as a conditioning mechanism lead to those subtle, pernicious behaviours we all know and love so well.

The training trap

"But surely," you might think, "training and ML Ops at scale—that's where my traditional MLE shines?"

Not quite.

Training LLMs isn't the neat train-test-split and hyperparameter search that Kaggle gurus cut their teeth on. A huge part of successfully training or finetuning a language model comes down to data curation. And this, again, requires that qualitative intelligence: the ability to deeply understand a problem and the requirements imposed by defining it in natural language rather than deterministic code.

When you're finetuning on outputs from a bigger model and the distillation gap leaves you disappointed after training qwen-32b, where do you go? There's a massive creative component to generating better data. And reinforcement learning is a whole different beast. How do you write evaluations that go beyond regexes and deterministic Python checks to actually serve as useful reward functions? How do you even measure whether your LLM-as-a-judge is measuring what you want, whether it's aligned with domain experts, whether it's well-defined enough for RL post-training?

LLM training is fundamentally harder than spinning up a model in Sklearn. Models are many orders of magnitude bigger. Training requires knowledge of parallelisation, sharding, and distributed systems in ways it never did before. Expecting your MLE to skill up on all this while simultaneously learning a fundamentally different paradigm of how to interact with and control LLMs is a massive ask. You can't just lump an MLE into the category of "prompt engineer + hardware guru + distributed systems expert + deep transformer guru" when their last job involved training CNNs on OCT scans.

Your team’s scarce cycles should go into defining the problem — not solving the engineering puzzle of training frontier-scale models. Parsed’s role is to absorb that engineering burden and return production-ready models that reflect your definitions.

The social intelligence factor

Perhaps most critically, LLM engineering demands social and personable intelligence in ways traditional ML never did.

Tasks don't exist in isolated boxes anymore. They're not completely defined by the features and labels you train on. Most LLM tasks are icebergs, with massive underwater components of domain knowledge and implicit context only available to teams expert in your specific vertical.

LLM engineers need to be socially intelligent enough not just to grasp this domain context, but to engage productively with domain experts and end users. They need to hold complex, nuanced discussions with stakeholders and distill all that messy, implicit information into evaluations, generation prompts, and training data.

Parsed takes those definitions and runs the heavy machinery: post-training, evaluation harnessing, RL, infrastructure.

The frontier problem

Perhaps the most salient issue is that the frontier of LLM research moves so quickly that it's impossible for engineers hired for specific vertical use cases to also position themselves on the rapidly expanding frontier.

No offence, but you're not getting Big Lab-level engineering quality at your app layer company. There's a reason the best talent goes to the big labs: They're building AGI, which is sexy. They get to work on fundamental language model problems. They pay better. They don't have to optimise a system for medical record data summarisation and JSON formatting.

To expect these MLEs to perform at a level where they can post-train the way big labs do, managing these systems at scale, is crazy. Everything is so new that the best commodity to hire for is raw intelligence and horsepower. Experience means less as things move faster. A PhD in convex optimisation doesn’t really give you any relevant experience in spinning up a GRPO run for the first time.

The case for specialisation

All of this strengthens the argument for specialisation and buying versus building.

We knew this when we started Parsed, and every interaction with teams in these verticals has only strengthened our conviction. The most self-aware companies understand this: They want Big Lab expertise packaged into models and tools their engineers can actually use productively. Sure, some are happy calling closed-source APIs forever. But anyone who wants a cheaper, faster, and most importantly, better model optimised for their specific task will eventually have to bite the bullet on post-training.

That's not to say tools and frameworks won't emerge that allow in-house engineering teams to actively participate in this process. Indeed, that's literally what we're building Parsed to be. We want to commoditise Big Lab talent so anyone can access it as a nicely packaged product.

But just putting a UI around a finetuning service isn't the answer. Yes, writing finetuning code is difficult, and automating it solves that problem. But it's only a tiny piece. The hard part about LLMs is that optimising them requires understanding everything end-to-end: from your generation prompt to your LLM-as-a-judge, through the complex mechanics of RL and SFT, and how to properly integrate all the feedback signals into a system that actually improves at what you want it to improve at.

Bottom line

If after all this, you still trust your MLEs to achieve what you want, then good for you. But you wouldn't build your own server rack with a team of software engineers who only know Java and API development. So why would LLMs be any different?

Other research.

BYO SWE-grep: automatically train blazing fast search sub-agents on your knowledge base (Pt. 1)

RL-trained search subagents that learn your knowledge base’s structure for fast, reliable retrieval

Research

Nov 11, 2025

BYO SWE-grep: automatically train blazing fast search sub-agents on your knowledge base (Pt. 1)

RL-trained search subagents that learn your knowledge base’s structure for fast, reliable retrieval

Research

Nov 11, 2025

Purpose-built LLMs for dental note-taking

Frontier thinking model performance at a fraction of the latency.

Case study

Nov 5, 2025

Purpose-built LLMs for dental note-taking

Frontier thinking model performance at a fraction of the latency.

Case study

Nov 5, 2025

Lumina: building self-improving evaluation through customer-in-the-loop refinement

Lumina: an adaptive evaluation engine that learns to judge like a subject matter expert.

Research

Oct 30, 2025

Lumina: building self-improving evaluation through customer-in-the-loop refinement

Lumina: an adaptive evaluation engine that learns to judge like a subject matter expert.

Research

Oct 30, 2025

Upweight the strategy, not the tokens: faster training with explicit reasoning through RGT (Rationale-Guided Training)

Teach the why, not just the what: Rationale-Guided Training

Research

Oct 28, 2025

Upweight the strategy, not the tokens: faster training with explicit reasoning through RGT (Rationale-Guided Training)

Teach the why, not just the what: Rationale-Guided Training

Research

Oct 28, 2025

Attention-based attribution: what your model is actually looking at

Cosine similarity is cosplay. Attention is attribution.

Research

Oct 28, 2025

Attention-based attribution: what your model is actually looking at

Cosine similarity is cosplay. Attention is attribution.

Research

Oct 28, 2025

Robust, sample efficient SFT with prompt mutations

Low-KL divergence prompt mutations: better performance at a fraction of the cost.

Research

Oct 27, 2025

Robust, sample efficient SFT with prompt mutations

Low-KL divergence prompt mutations: better performance at a fraction of the cost.

Research

Oct 27, 2025

Training loss predicts evaluation performance, even for non-verifiable tasks

Loss: the cheapest evaluation you’ll ever run.

Research

Oct 27, 2025

Training loss predicts evaluation performance, even for non-verifiable tasks

Loss: the cheapest evaluation you’ll ever run.

Research

Oct 27, 2025

Building production AI for regulated industries with a leading digital insurer

From frontier OpenAI/Google models to open-source — delivering 8x the speed and outperforming GPT-5-level accuracy.

Case study

Oct 20, 2025

Building production AI for regulated industries with a leading digital insurer

From frontier OpenAI/Google models to open-source — delivering 8x the speed and outperforming GPT-5-level accuracy.

Case study

Oct 20, 2025

Iterative SFT (iSFT): dense reward learning

Iterative SFT: dense, high-bandwidth learning

Research

Oct 15, 2025

Iterative SFT (iSFT): dense reward learning

Iterative SFT: dense, high-bandwidth learning

Research

Oct 15, 2025

Write small, learn forever: rank-1 LoRA for continual learning

Why rank-1 LoRA updates might be the missing link between static fine-tuning and truly continuous, live-on-GPU learning.

Research

Oct 12, 2025

Write small, learn forever: rank-1 LoRA for continual learning

Why rank-1 LoRA updates might be the missing link between static fine-tuning and truly continuous, live-on-GPU learning.

Research

Oct 12, 2025

Practical LoRA Research

Fine-tuning at Scale: What LoRA Gets Right (and LoRA-XS Doesn’t).

Research

Oct 10, 2025

Practical LoRA Research

Fine-tuning at Scale: What LoRA Gets Right (and LoRA-XS Doesn’t).

Research

Oct 10, 2025

Fine-tuning small open-source LLMs to outperform large closed-source models by 60% on specialized tasks

27B open-source model outperforms biggest OpenAI/Anthropic/Google models on real healthcare task.

Case study

Aug 15, 2025

Fine-tuning small open-source LLMs to outperform large closed-source models by 60% on specialized tasks

27B open-source model outperforms biggest OpenAI/Anthropic/Google models on real healthcare task.

Case study

Aug 15, 2025

Amnesiac generalist behemoths are not the future of language models

You don’t need a generic genius. You need a specialist learner.

Position

Jul 28, 2025

Amnesiac generalist behemoths are not the future of language models

You don’t need a generic genius. You need a specialist learner.

Position

Jul 28, 2025

The bitter lesson of LLM evals

Turning expert judgment into a compounding moat. Because in LLM evals, scaling care beats scaling compute.

Position

Jul 13, 2025

The bitter lesson of LLM evals

Turning expert judgment into a compounding moat. Because in LLM evals, scaling care beats scaling compute.

Position

Jul 13, 2025

Do transformers notice their own mistakes? Finding a linear hallucination detector inside LLMs

A linear signal in LLMs reveals hallucinations, is detected by a frozen observer, and steered with a single vector.

Research

May 8, 2025

Do transformers notice their own mistakes? Finding a linear hallucination detector inside LLMs

A linear signal in LLMs reveals hallucinations, is detected by a frozen observer, and steered with a single vector.

Research

May 8, 2025

Resurrecting the salmon: seeing clearer inside LLMs with domain-specific SAEs

A powerful, efficient, and domain-robust strategy for safeguarding medical-text generation.

Research

Feb 15, 2025

Resurrecting the salmon: seeing clearer inside LLMs with domain-specific SAEs

A powerful, efficient, and domain-robust strategy for safeguarding medical-text generation.

Research

Feb 15, 2025

Why mechanistic interpretability needs a paradigm inversion

The conventional scaling paradigm for language models themselves may be fundamentally misaligned with interp.

Research

Jan 13, 2025

Why mechanistic interpretability needs a paradigm inversion

The conventional scaling paradigm for language models themselves may be fundamentally misaligned with interp.

Research

Jan 13, 2025

Start owning your model today.

From training to deployment, we help you launch a specialist LLM that outperforms generic models, adapts automatically, and runs reliably at scale.

Get started

Start owning your model today.

From training to deployment, we help you launch a specialist LLM that outperforms generic models, adapts automatically, and runs reliably at scale.

Get started