July 28, 2025

Amnesiac generalist behemoths are not the future of language models

You don’t need a generic genius. You need a specialist learner.

Authors

Affiliations

Charles O'Neill

Parsed

Mudith Jayasekara

Parsed

We got an intern at Parsed the other day. They were brilliant. They knew the definition of every computing concept in my vocabulary. They read faster than anyone I’ve ever met. They even shipped a feature within their first hour (not just a PR, actually deployed to production). They even corrected my understanding of RAFT consensus (politely, but still). I thought I'd found a 10x engineer.

However, about a day in, I realised they were also an idiot. A task they did perfectly, first try, on Monday morning I then had to re-explain in the exact same level of detail on Monday afternoon. They remembered everything perfectly but after an hour of chatting they literally forgot all of it. Some tasks I could just quickly say some poorly defined vibe of what I wanted and they got it perfectly. Others, no matter how much I talked to them and explained what I wanted, they refused to do it in the way I wanted.

Most importantly, the intern wasn’t learning.

Of course, our intern isn’t a person but rather a language model from Big Token (you know which companies I mean). A behemoth model with probably half a trillion parameters that is obviously the most amount of intelligence and compression we've fit in a single hard-drive, but that unfortunately has all the shortcomings I've listed above.

OpenAI, Anthropic, and Google have sold us a story: stick with their big, expensive, generalist models for everything. Don't look for alternatives. Don't question why your clinical scribe needs the same model that writes poetry. Just trust that next month's update will finally solve hallucinations, that your 320 versioned prompts in Prompteams are a normal part of doing business, that intelligence alone will eventually overcome the fact that these models never actually learn from your specific use case.

The reality is you're paying $50k/month for a model that confidently tells your customers your company was founded in 1742 (it wasn't). You've written a 2,000-word prompt with seventeen different "IMPORTANT: NEVER DO X" clauses that breaks the moment someone asks a slightly different question. You watch helplessly as each model update (system prompts, hidden finetuning runs behind the scenes) makes your carefully tuned system perform worse, not better, at your specific task. Anthropic quantises the model one day and all your work is down the drain anyway; OpenAI goes down and you have to helplessly twiddle your thumbs; not your weights, not your brain.

Here's what they don't tell you: GPT-4o can write Shakespeare, solve differential equations, and code in 50 languages. Your insurance claim classifier needs exactly none of those abilities. Those are billions of parameters dedicated to haiku generation sitting idle while your model struggles to remember that claims over $10,000 require manager approval. You're essentially hiring a Nobel laureate to work exclusively as a filing clerk.

"But don't I need flexibility?" you might ask. "What if I want to use the same model for different tasks?" Let's be honest; 80% of companies using LLMs in production use them for 1-3 core tasks. You're not building AGI. You're automating customer support, or extracting data from documents, or generating product descriptions. You need a specialist, not a generalist who forgets everything between conversations.

I don’t think this is axiomatic. I don't think we need to live in this world. I believe in a world where models don't have amnesia but get better at every attempt of the tasks they're doing. Where your model learns that when customer Sarah from Toledo mentions "the usual," she means product SKU-4829. Where error rates drop by 73% after three months of continuous learning. Where you can run your model on even a single GPU (although of course you can up this if you like). Importantly, where you can choose exactly what point on the Pareto frontier of cost/latency vs performance you’d like to be on, and rest assured that you are indeed on the Pareto frontier.

The big generalist models will always thrive in the chatbot world; no argument there. But we're now seeing companies doing real, specific tasks. Clinical scribes that need to understand medical terminology, not ancient Greek. Insurance policy selectors that need to master state regulations, not constitutional law. Customer service agents that need to know your product inside out, not every product ever made.

That's what we build at Parsed. We eval your task, we optimise your very own model for it, and we host it with continual learning. One of our clients reduced their monthly AI spend by over 50% while improving accuracy by 64%. Another finally solved their hallucination problem after nine months of prompt engineering. It turns out when you train a model exclusively on insurance policies, it stops making up coverage that doesn't exist.

There's only so much differentiation you can get from your competitors by constructing a better prompt. They can copy your prompt in an afternoon. But a model that's been learning from your specific use case for six months, and that's been optimised on your actual customer interactions, well, that’s a real moat.

The future isn't one massive intern that forgets everything between conversations. It's a team of specialists who remember, who learn, who get better at their jobs every single day. While everyone else is still explaining the same task to their amnesiac genius for the thousandth time, your model will have already mastered it, improved on it, and moved on to solving problems you haven't even thought of yet. That’s an actual competitive advantage.

Stop paying for Shakespeare when you need spreadsheets. Let's build something that actually learns.

Other research.

Upweight the strategy, not the tokens: faster training with explicit reasoning

Teach the why, not just the what: Rationale-Guided Training

Oct 28, 2025

Upweight the strategy, not the tokens: faster training with explicit reasoning

Teach the why, not just the what: Rationale-Guided Training

Oct 28, 2025

Attention-based attribution: what your model is actually looking at

Cosine similarity is cosplay. Attention is attribution.

Oct 28, 2025

Attention-based attribution: what your model is actually looking at

Cosine similarity is cosplay. Attention is attribution.

Oct 28, 2025

Robust, sample efficient SFT with prompt mutations

Low-KL divergence prompt mutations: better performance at a fraction of the cost.

Oct 27, 2025

Robust, sample efficient SFT with prompt mutations

Low-KL divergence prompt mutations: better performance at a fraction of the cost.

Oct 27, 2025

Training loss predicts evaluation performance, even for non-verifiable tasks

Loss: the cheapest evaluation you’ll ever run.

Oct 27, 2025

Training loss predicts evaluation performance, even for non-verifiable tasks

Loss: the cheapest evaluation you’ll ever run.

Oct 27, 2025

Building production AI for regulated industries with a leading digital insurer

From frontier OpenAI/Google models to open-source — delivering 8x the speed and outperforming GPT-5-level accuracy.

Oct 20, 2025

Building production AI for regulated industries with a leading digital insurer

From frontier OpenAI/Google models to open-source — delivering 8x the speed and outperforming GPT-5-level accuracy.

Oct 20, 2025

Practical LoRA Research

Fine-tuning at Scale: What LoRA Gets Right (and LoRA-XS Doesn’t).

Oct 10, 2025

Practical LoRA Research

Fine-tuning at Scale: What LoRA Gets Right (and LoRA-XS Doesn’t).

Oct 10, 2025

A letter to the C-suite: the shifting role of MLEs

Your MLEs are brilliant, but you’re giving them the wrong job.

Sep 8, 2025

A letter to the C-suite: the shifting role of MLEs

Your MLEs are brilliant, but you’re giving them the wrong job.

Sep 8, 2025

Fine-tuning small open-source LLMs to outperform large closed-source models by 60% on specialized tasks

27B open-source model outperforms biggest OpenAI/Anthropic/Google models on real healthcare task.

Aug 15, 2025

Fine-tuning small open-source LLMs to outperform large closed-source models by 60% on specialized tasks

27B open-source model outperforms biggest OpenAI/Anthropic/Google models on real healthcare task.

Aug 15, 2025

The Bitter Lesson of LLM Evals

Turning expert judgment into a compounding moat. Because in LLM evals, scaling care beats scaling compute.

Jul 13, 2025

The Bitter Lesson of LLM Evals

Turning expert judgment into a compounding moat. Because in LLM evals, scaling care beats scaling compute.

Jul 13, 2025

Do transformers notice their own mistakes? Finding a linear hallucination detector inside LLMs

A linear signal in LLMs reveals hallucinations, is detected by a frozen observer, and steered with a single vector.

May 8, 2025

Do transformers notice their own mistakes? Finding a linear hallucination detector inside LLMs

A linear signal in LLMs reveals hallucinations, is detected by a frozen observer, and steered with a single vector.

May 8, 2025

Resurrecting the salmon: seeing clearer inside LLMs with domain-specific SAEs

A powerful, efficient, and domain-robust strategy for safeguarding medical-text generation.

Feb 15, 2025

Resurrecting the salmon: seeing clearer inside LLMs with domain-specific SAEs

A powerful, efficient, and domain-robust strategy for safeguarding medical-text generation.

Feb 15, 2025

Why mechanistic interpretability needs a paradigm inversion

The conventional scaling paradigm for language models themselves may be fundamentally misaligned with interp.

Jan 13, 2025

Why mechanistic interpretability needs a paradigm inversion

The conventional scaling paradigm for language models themselves may be fundamentally misaligned with interp.

Jan 13, 2025

Start owning your model today.

From training to deployment, we help you launch a specialist LLM that outperforms generic models, adapts automatically, and runs reliably at scale.

Get started

Start owning your model today.

From training to deployment, we help you launch a specialist LLM that outperforms generic models, adapts automatically, and runs reliably at scale.

Get started