August 26, 2025

A letter to the C-suite: think carefully before hiring MLEs

MLE ≠ LLM engineer: different skills, different instincts, different game.

Authors

Affiliations

Charles O'Neill

Parsed

Mudith Jayasekara

Parsed

If you're hiring an engineering team at an AI-native company, particularly in a specific vertical, I'm willing to bet you've fallen into a trap we've seen everywhere: You hire an MLE and assume they're equivalent to an LLM engineer.

The intelligence mismatch

It's a fallacy to equate the skills that make someone a good traditional MLE with those required to run an LLM-based engineering team. A PhD in neural tangent kernels and those Kaggle competition wins are orthogonal to what you actually need.

Even if you're just calling closed-source models through APIs (before we even touch training and inference), LLMs demand a fundamentally different kind of intelligence. Understanding how to scaffold and prompt your way to performance isn't about gradient-boosted models. It's about something fuzzier, more qualitative.

Writing LLM-as-a-judge evaluation harnesses requires specific intuition that doesn't map neatly onto traditional metrics of quantitative intelligence. (This is why companies like Cognition boasting about hiring IMO winners feels misguided to me). Solving International Math Olympiad problems is impressive, but it's a completely different skillset from understanding, on an intuitive level, how LLMs process tokens. How the subtleties of English as a conditioning mechanism lead to those subtle, pernicious behaviours we all know and love so well.

The training trap

"But surely," you might think, "training and ML Ops at scale—that's where my traditional MLE shines?"

Not quite.

Training LLMs isn't the neat train-test-split and hyperparameter search that Kaggle gurus cut their teeth on. A huge part of successfully training or finetuning a language model comes down to data curation. And this, again, requires that qualitative intelligence: the ability to deeply understand a problem and the requirements imposed by defining it in natural language rather than deterministic code.

When you're finetuning on outputs from a bigger model and the distillation gap leaves you disappointed after training qwen-32b, where do you go? There's a massive creative component to generating better data. And reinforcement learning is a whole different beast. How do you write evaluations that go beyond regexes and deterministic Python checks to actually serve as useful reward functions? How do you even measure whether your LLM-as-a-judge is measuring what you want, whether it's aligned with domain experts, whether it's well-defined enough for RL post-training?

LLM training is fundamentally harder than spinning up a model in Sklearn. Models are many orders of magnitude bigger. Training requires knowledge of parallelisation, sharding, and distributed systems in ways it never did before. Expecting your MLE to skill up on all this while simultaneously learning a fundamentally different paradigm of how to interact with and control LLMs is a massive ask. You can't just lump an MLE into the category of "prompt engineer + hardware guru + distributed systems expert + deep transformer guru" when their last job involved training CNNs on OCT scans.

The social intelligence factor

Perhaps most critically, LLM engineering demands social and personable intelligence in ways traditional ML never did.

Tasks don't exist in isolated boxes anymore. They're not completely defined by the features and labels you train on. Most LLM tasks are icebergs, with massive underwater components of domain knowledge and implicit context only available to teams expert in your specific vertical.

LLM engineers need to be socially intelligent enough not just to grasp this domain context, but to engage productively with domain experts and end users. They need to hold complex, nuanced discussions with stakeholders and distill all that messy, implicit information into evaluations, generation prompts, and training data. Your computer vision PhD who's spent the last five years optimising convolutions might be brilliant, but this doesn’t mean they can navigate these human complexities (arguably, some of the brilliance is anti-correlated with this skillset).

The frontier problem

Perhaps the most salient issue is that the frontier of LLM research moves so quickly that it's impossible for engineers hired for specific vertical use cases to also position themselves on the rapidly expanding frontier.

No offense, but you're not getting Big Lab-level engineering quality at your AI scribe startup. There's a reason the best talent goes to the big labs: They're building AGI, which is sexy. They get to work on fundamental language model problems. They pay better. They don't have to optimise your system for EMR data summarisation and JSON formatting.

To expect these MLEs to perform at a level where they can post-train the way big labs do, managing these systems at scale, is ludicrous. Everything is so new that the best commodity to hire for is raw intelligence and horsepower. Experience means less as things move faster. A PhD in convex optimisation doesn’t really give you any relevant experience in spinning up a GRPO run for the first time.

The case for specialisation

All of this strengthens the argument for specialisation and buying versus building.

We knew this when we started Parsed, and every interaction with teams in these verticals has only strengthened our conviction. The most self-aware companies understand this: They want Big Lab expertise packaged into models and tools their engineers can actually use productively. Sure, some are happy calling closed-source APIs forever. But anyone who wants a cheaper, faster, and most importantly, better model optimised for their specific task will eventually have to bite the bullet on post-training.

That's not to say tools and frameworks won't emerge that allow in-house engineering teams to actively participate in this process. Indeed, that's literally what we're building Parsed to be. We want to commoditise Big Lab talent so anyone can access it as a nicely packaged product.

But just putting a UI around a finetuning service isn't the answer. Yes, writing finetuning code is difficult, and automating it solves that problem. But it's only a tiny piece. The hard part about LLMs is that optimising them requires understanding everything end-to-end: from your generation prompt to your LLM-as-a-judge, through the complex mechanics of RL and SFT, and how to properly integrate all the feedback signals into a system that actually improves at what you want it to improve at.

Bottom line

If after all this you still think your MLEs can get you there, good luck. But you wouldn't build your own server rack with a team of software engineers who only know Java and API development. So why would LLMs be any different?

Other research.

Amnesiac generalist behemoths are not the future of language models

You don’t need a generic genius. You need a specialist learner.

Jul 28, 2025

Amnesiac generalist behemoths are not the future of language models

You don’t need a generic genius. You need a specialist learner.

Jul 28, 2025

The Bitter Lesson of LLM Evals

Turning expert judgment into a compounding moat. Because in LLM evals, scaling care beats scaling compute.

Jul 13, 2025

The Bitter Lesson of LLM Evals

Turning expert judgment into a compounding moat. Because in LLM evals, scaling care beats scaling compute.

Jul 13, 2025

Do transformers notice their own mistakes? Finding a linear hallucination detector inside LLMs

A linear signal in LLMs reveals hallucinations, is detected by a frozen observer, and steered with a single vector.

May 8, 2025

Do transformers notice their own mistakes? Finding a linear hallucination detector inside LLMs

A linear signal in LLMs reveals hallucinations, is detected by a frozen observer, and steered with a single vector.

May 8, 2025

Resurrecting the salmon: seeing clearer inside LLMs with domain-specific SAEs

A powerful, efficient, and domain-robust strategy for safeguarding medical-text generation

Feb 15, 2025

Resurrecting the salmon: seeing clearer inside LLMs with domain-specific SAEs

A powerful, efficient, and domain-robust strategy for safeguarding medical-text generation

Feb 15, 2025

Why mechanistic interpretability needs a paradigm inversion

The conventional scaling paradigm for language models themselves may be fundamentally misaligned with interp.

Jan 13, 2025

Why mechanistic interpretability needs a paradigm inversion

The conventional scaling paradigm for language models themselves may be fundamentally misaligned with interp.

Jan 13, 2025

Start owning your model today.

From training to deployment, we help you launch a specialist LLM that outperforms generic models, adapts automatically, and runs reliably at scale.

Start owning your model today.

From training to deployment, we help you launch a specialist LLM that outperforms generic models, adapts automatically, and runs reliably at scale.