Amnesiac generalist behemoths are not the future of language models
You don’t need a generic genius. You need a specialist learner.
Jul 28, 2025
August 26, 2025
MLE ≠ LLM engineer: different skills, different instincts, different game.
Authors
Affiliations
Charles O'Neill
Parsed
Mudith Jayasekara
Parsed
If you're hiring an engineering team at an AI-native company, particularly in a specific vertical, I'm willing to bet you've fallen into a trap we've seen everywhere: You hire an MLE and assume they're equivalent to an LLM engineer.
It's a fallacy to equate the skills that make someone a good traditional MLE with those required to run an LLM-based engineering team. A PhD in neural tangent kernels and those Kaggle competition wins are orthogonal to what you actually need.
Even if you're just calling closed-source models through APIs (before we even touch training and inference), LLMs demand a fundamentally different kind of intelligence. Understanding how to scaffold and prompt your way to performance isn't about gradient-boosted models. It's about something fuzzier, more qualitative.
Writing LLM-as-a-judge evaluation harnesses requires specific intuition that doesn't map neatly onto traditional metrics of quantitative intelligence. (This is why companies like Cognition boasting about hiring IMO winners feels misguided to me). Solving International Math Olympiad problems is impressive, but it's a completely different skillset from understanding, on an intuitive level, how LLMs process tokens. How the subtleties of English as a conditioning mechanism lead to those subtle, pernicious behaviours we all know and love so well.
"But surely," you might think, "training and ML Ops at scale—that's where my traditional MLE shines?"
Not quite.
Training LLMs isn't the neat train-test-split and hyperparameter search that Kaggle gurus cut their teeth on. A huge part of successfully training or finetuning a language model comes down to data curation. And this, again, requires that qualitative intelligence: the ability to deeply understand a problem and the requirements imposed by defining it in natural language rather than deterministic code.
When you're finetuning on outputs from a bigger model and the distillation gap leaves you disappointed after training qwen-32b
, where do you go? There's a massive creative component to generating better data. And reinforcement learning is a whole different beast. How do you write evaluations that go beyond regexes and deterministic Python checks to actually serve as useful reward functions? How do you even measure whether your LLM-as-a-judge is measuring what you want, whether it's aligned with domain experts, whether it's well-defined enough for RL post-training?
LLM training is fundamentally harder than spinning up a model in Sklearn. Models are many orders of magnitude bigger. Training requires knowledge of parallelisation, sharding, and distributed systems in ways it never did before. Expecting your MLE to skill up on all this while simultaneously learning a fundamentally different paradigm of how to interact with and control LLMs is a massive ask. You can't just lump an MLE into the category of "prompt engineer + hardware guru + distributed systems expert + deep transformer guru" when their last job involved training CNNs on OCT scans.
Perhaps most critically, LLM engineering demands social and personable intelligence in ways traditional ML never did.
Tasks don't exist in isolated boxes anymore. They're not completely defined by the features and labels you train on. Most LLM tasks are icebergs, with massive underwater components of domain knowledge and implicit context only available to teams expert in your specific vertical.
LLM engineers need to be socially intelligent enough not just to grasp this domain context, but to engage productively with domain experts and end users. They need to hold complex, nuanced discussions with stakeholders and distill all that messy, implicit information into evaluations, generation prompts, and training data. Your computer vision PhD who's spent the last five years optimising convolutions might be brilliant, but this doesn’t mean they can navigate these human complexities (arguably, some of the brilliance is anti-correlated with this skillset).
Perhaps the most salient issue is that the frontier of LLM research moves so quickly that it's impossible for engineers hired for specific vertical use cases to also position themselves on the rapidly expanding frontier.
No offense, but you're not getting Big Lab-level engineering quality at your AI scribe startup. There's a reason the best talent goes to the big labs: They're building AGI, which is sexy. They get to work on fundamental language model problems. They pay better. They don't have to optimise your system for EMR data summarisation and JSON formatting.
To expect these MLEs to perform at a level where they can post-train the way big labs do, managing these systems at scale, is ludicrous. Everything is so new that the best commodity to hire for is raw intelligence and horsepower. Experience means less as things move faster. A PhD in convex optimisation doesn’t really give you any relevant experience in spinning up a GRPO run for the first time.
All of this strengthens the argument for specialisation and buying versus building.
We knew this when we started Parsed, and every interaction with teams in these verticals has only strengthened our conviction. The most self-aware companies understand this: They want Big Lab expertise packaged into models and tools their engineers can actually use productively. Sure, some are happy calling closed-source APIs forever. But anyone who wants a cheaper, faster, and most importantly, better model optimised for their specific task will eventually have to bite the bullet on post-training.
That's not to say tools and frameworks won't emerge that allow in-house engineering teams to actively participate in this process. Indeed, that's literally what we're building Parsed to be. We want to commoditise Big Lab talent so anyone can access it as a nicely packaged product.
But just putting a UI around a finetuning service isn't the answer. Yes, writing finetuning code is difficult, and automating it solves that problem. But it's only a tiny piece. The hard part about LLMs is that optimising them requires understanding everything end-to-end: from your generation prompt to your LLM-as-a-judge, through the complex mechanics of RL and SFT, and how to properly integrate all the feedback signals into a system that actually improves at what you want it to improve at.
If after all this you still think your MLEs can get you there, good luck. But you wouldn't build your own server rack with a team of software engineers who only know Java and API development. So why would LLMs be any different?