Building production AI for regulated industries with a leading digital insurer
From frontier OpenAI/Google models to open-source — delivering 8x the speed and outperforming GPT-5-level accuracy.
Oct 20, 2025
September 8, 2025
Your MLEs are brilliant, but you’re giving them the wrong job.
Authors
Affiliations
Charles O'Neill
Parsed
Mudith Jayasekara
Parsed
If you’re hiring an engineering team at an AI-native company in domains like healthcare, insurance, legaltech, or insurance, you’ve probably fallen into a trap we see everywhere: assuming your MLEs can double as LLM engineers.
The problem isn’t that they aren’t smart enough. It’s that you’re getting them to do the wrong job. Their job isn’t to reinvent frontier model training inside your startup. Their job is to help get you to product-market fit.
This doesn’t mean MLEs don’t play a crucial role. In fact, they’re often the best people to help you prototype in the early days.
They can wire up APIs, run small-scale experiments to get you to your first demos. This phase is all about speed, iteration, and exploring whether there’s a real problem-solution fit. MLEs shine here because they’re adaptable, quantitative, and can hack together a proof of concept quickly.
But once you’ve validated the problem and see early signs of product-market fit, the game changes. At that point, you need to shift from hacking together demos to running scalable, reliable post-training loops. And that’s not something you can, or should, expect your MLEs to own. That’s where Parsed comes in.
It's a fallacy to equate the skills that make someone a good traditional MLE with those required to run an LLM-based engineering team. A PhD in neural tangent kernels, those Kaggle competition wins are orthogonal to what you actually need.
Even if you're just calling closed-source models through APIs (before we even touch training and inference) LLMs demand a fundamentally different kind of intelligence. Understanding how to scaffold and prompt your way to performance isn't about gradient-boosted models. It's about something fuzzier, more qualitative.
Writing LLM-as-a-judge evaluation harnesses requires specific intuition that doesn't map neatly onto traditional metrics of quantitative intelligence. (This is why companies like Cognition boasting about hiring IMO winners feels misguided to me). Solving International Math Olympiad problems is impressive, but it's a completely different skillset from understanding, on an intuitive level, how LLMs process tokens. How the subtleties of English as a conditioning mechanism lead to those subtle, pernicious behaviours we all know and love so well.
"But surely," you might think, "training and ML Ops at scale—that's where my traditional MLE shines?"
Not quite.
Training LLMs isn't the neat train-test-split and hyperparameter search that Kaggle gurus cut their teeth on. A huge part of successfully training or finetuning a language model comes down to data curation. And this, again, requires that qualitative intelligence: the ability to deeply understand a problem and the requirements imposed by defining it in natural language rather than deterministic code.
When you're finetuning on outputs from a bigger model and the distillation gap leaves you disappointed after training qwen-32b
, where do you go? There's a massive creative component to generating better data. And reinforcement learning is a whole different beast. How do you write evaluations that go beyond regexes and deterministic Python checks to actually serve as useful reward functions? How do you even measure whether your LLM-as-a-judge is measuring what you want, whether it's aligned with domain experts, whether it's well-defined enough for RL post-training?
LLM training is fundamentally harder than spinning up a model in Sklearn. Models are many orders of magnitude bigger. Training requires knowledge of parallelisation, sharding, and distributed systems in ways it never did before. Expecting your MLE to skill up on all this while simultaneously learning a fundamentally different paradigm of how to interact with and control LLMs is a massive ask. You can't just lump an MLE into the category of "prompt engineer + hardware guru + distributed systems expert + deep transformer guru" when their last job involved training CNNs on OCT scans.
Your team’s scarce cycles should go into defining the problem — not solving the engineering puzzle of training frontier-scale models. Parsed’s role is to absorb that engineering burden and return production-ready models that reflect your definitions.
Perhaps most critically, LLM engineering demands social and personable intelligence in ways traditional ML never did.
Tasks don't exist in isolated boxes anymore. They're not completely defined by the features and labels you train on. Most LLM tasks are icebergs, with massive underwater components of domain knowledge and implicit context only available to teams expert in your specific vertical.
LLM engineers need to be socially intelligent enough not just to grasp this domain context, but to engage productively with domain experts and end users. They need to hold complex, nuanced discussions with stakeholders and distill all that messy, implicit information into evaluations, generation prompts, and training data.
Parsed takes those definitions and runs the heavy machinery: post-training, evaluation harnessing, RL, infrastructure.
Perhaps the most salient issue is that the frontier of LLM research moves so quickly that it's impossible for engineers hired for specific vertical use cases to also position themselves on the rapidly expanding frontier.
No offence, but you're not getting Big Lab-level engineering quality at your app layer company. There's a reason the best talent goes to the big labs: They're building AGI, which is sexy. They get to work on fundamental language model problems. They pay better. They don't have to optimise a system for medical record data summarisation and JSON formatting.
To expect these MLEs to perform at a level where they can post-train the way big labs do, managing these systems at scale, is crazy. Everything is so new that the best commodity to hire for is raw intelligence and horsepower. Experience means less as things move faster. A PhD in convex optimisation doesn’t really give you any relevant experience in spinning up a GRPO run for the first time.
All of this strengthens the argument for specialisation and buying versus building.
We knew this when we started Parsed, and every interaction with teams in these verticals has only strengthened our conviction. The most self-aware companies understand this: They want Big Lab expertise packaged into models and tools their engineers can actually use productively. Sure, some are happy calling closed-source APIs forever. But anyone who wants a cheaper, faster, and most importantly, better model optimised for their specific task will eventually have to bite the bullet on post-training.
That's not to say tools and frameworks won't emerge that allow in-house engineering teams to actively participate in this process. Indeed, that's literally what we're building Parsed to be. We want to commoditise Big Lab talent so anyone can access it as a nicely packaged product.
But just putting a UI around a finetuning service isn't the answer. Yes, writing finetuning code is difficult, and automating it solves that problem. But it's only a tiny piece. The hard part about LLMs is that optimising them requires understanding everything end-to-end: from your generation prompt to your LLM-as-a-judge, through the complex mechanics of RL and SFT, and how to properly integrate all the feedback signals into a system that actually improves at what you want it to improve at.
If after all this, you still trust your MLEs to achieve what you want, then good for you. But you wouldn't build your own server rack with a team of software engineers who only know Java and API development. So why would LLMs be any different?