Case study

November 18, 2025

Specialist models for emergency departments: outperforming frontier LLMs on clinical documentation

Purpose-built to capture emergency department nuance

Authors

Affiliations

Harry Partridge

Parsed

Charles O'Neill

Parsed

Emergency physicians spend 2-3 hours per shift on documentation. A particular company that builds ambient scribes that convert ED conversations into structured clinical charts came to us with a specific challenge: build a model that could handle the complexity of emergency medicine documentation while being fast enough for real-time use.

The task involves two stages. First, convert (often messy) ED transcripts into structured JSON with 16 distinct sections. Second, merge that JSON with physical exam templates to produce the final chart. Each stage has dozens of interlocking rules that must be followed precisely. This is quite scary, because getting things wrong means the model has introduced liability issues as well as just generating a bad note.

We built a model that achieves 84.8% accuracy on chart generation and 67.2% on summarization, compared to gemini-2.5-pro's 63.6% and 47.2%. It also runs 6-8x faster.

The problem: emergency documentation is different

Emergency department notes aren't like other medical documentation. Information comes from multiple sources simultaneously: the patient, EMS, family members, prior records. The clinician must document not just what happened, but what didn't happen, in particular explicitly ruling out dangerous diagnoses.

The system needed to handle:

  • Complex routing logic where information starts in one section then redistributes to others based on priority

  • Conditional inclusion rules (include transport mode only if not self-transport, unless configured otherwise)

  • Template modification that preserves non-contradicted normal findings while adding new observations

  • High-risk diagnosis detection that matches chief complaints to potential emergencies and tracks rule-out criteria

General-purpose LLMs consistently failed at these tasks. They'd hallucinate provider names, place physical exam findings in the history section, or worse would miss critical diagnoses that hadn't been ruled out.And frontier models fail at emergency documentation

And it’s true, frontier models actually do fail a lot at emergency documentation. They need to maintain multiple concurrent constraints while preserving information semantics, which in some sense is actually quite out of distribution for them. When gemini-2.5-pro sees “regular rate, regular rhythm, no murmurs” and needs to incorporate tachycardia, it faces a constraint satisfaction problem: preserve non-contradicted facts while surgically removing contradicted ones. This requires understanding medical relationships (rate ≠ murmurs) plus precise text manipulation. Most models do one or the other; few do both.

Building evaluators for emergency medicine

We built comprehensive evaluation frameworks using Lumina for both stages of the pipeline. (You can read more about our LLM-as-judge evaluation construction with Lumina here.) Our client provided 120 micro-checks they used for quality assurance ie specific failure modes they'd identified over months of production use. We semantically partitioned these into our evaluator framework, ensuring complete coverage while organizing them into coherent evaluation dimensions, as well as supplementing them with additional errors and holistic checks we wanted to make for quality.

For the summarization stage (transcript → JSON), we created five evaluators:

  1. Groundedness: No hallucinated information, all facts traceable to source

  2. Semantic categorization: Information placed in correct sections

  3. Conditional logic: Exclusion/inclusion rules correctly applied

  4. Formatting compliance: Age formats, temporal expressions, structural requirements

  5. Clinical quality: Non-redundant, uses proper terminology, applies high-risk diagnosis logic

For the chart generation stage (JSON + template → chart), we created five different evaluators:

  1. Factual grounding: Demographics and findings match exactly

  2. Template modification: Preserves non-contradicted normals, removes contradicted clauses

  3. Structural integrity: Sections included/omitted based on content rules

  4. Information completeness: All required data present, mandatory defaults applied

  5. Formatting and style: Length limits, prefixes, capitalization, data transformations

Each evaluator uses binary pass/fail scoring on specific criteria. A chart passes only if it satisfies all requirements.

We also needed to validate the evaluators themselves. We ran a meta-evaluation process where we generated “perfect” outputs according to our evaluators, and had our customer’s clinical team review these outputs. When they found issues our evaluators missed, we refined the evaluation prompts. When evaluators flagged clinically acceptable outputs, we traced misalignment back to task specifications. We also provided questions about ambiguities surfaced by Lumina for the customer to answer. This iteration continued until our evaluators aligned with expert clinical judgment.

Simplifying the pipeline

The original system used multiple prompts across different stages - separate prompts for each section of the summarization, another prompt for PE template selection, and more prompts for the chart generation. This created compounding errors and latency.

We consolidated this into two stages:

  • Stage 1: One prompt handles all 16 sections of summarization

  • Stage 2: One prompt handles complete chart generation

For PE template selection, we noticed the LLM was essentially pattern-matching against rules. We replaced this with deterministic Python code, eliminating an unnecessary LLM call while improving accuracy to 100%.

This simplification was only possible because we made the first stage robust enough to produce consistent, well-structured outputs that the second stage could reliably process.

The training approach

We used iterative SFT to train our model, which in this case is qwen3-32b, a dense model (we are also currently in the process of training an MoE model for the same task, qwen3-next-80b-a3b, which speeds up inference significantly at the cost of memory).

Iterative SFT is a simple idea. For each training example, we:

  1. Generate initial output from base model

  2. Run all evaluators to identify failures

  3. Use gemini-2.5-pro to repair the output based on specific failure feedback

  4. Repeat until all evaluators pass

  5. Train on these perfect outputs

This process is more sample-efficient than standard SFT because we're training on outputs that score higher than what even gemini-2.5-pro produces naturally. Each training example provides dense supervisory signal about what went wrong and how to fix it.

We also enhanced our training data through prompt mutation, which is systematically varying the input format while preserving semantic content. This prevented overfitting to specific phrasings.

For the chart generation stage, we added robustness by training on both perfect JSON outputs and unrefined outputs from the summarization stage. This ensures the model handles imperfect inputs gracefully in production.

We include two examples of how iSFT allows us to bake in specific behaviors to the model (that are difficult to prompt for) below.

Handling the high-risk diagnosis module

The most important component is the high-risk diagnosis detector. For each chief complaint, the model must identify dangerous conditions that haven't been ruled out.

For example, subarachnoid hemorrhage is ruled out only if thunderclap headache is absent AND CT within 6 hours is negative. Another example is that pulmonary embolism requires either low-risk Wells score plus negative PERC, or negative imaging.

We trained the model to apply these rule-out criteria exactly. If a dangerous diagnosis isn't explicitly ruled out, it appears in the output with specific next steps: what history to clarify, what exam findings to check, what tests to order.

Template modification

Physical exam documentation presented a unique challenge. The system starts with standard templates like "Normal Adult" containing default findings. The model must surgically edit these templates based on new findings.

If the template says "regular rate, regular rhythm, no murmurs" and the exam found tachycardia, the model must delete only "regular rate, regular rhythm" while preserving "no murmurs". It then appends "tachycardic" to the remaining text.

Most models either replace everything or keep everything. Through iSFT with targeted feedback on template modification failures, we taught our model to make precise edits. Our model achieves 94% accuracy on template modification compared to gemini-2.5-pro's 58%.

Results

The headline result is that we outperform gemini-2.5-pro (the SOTA for this task, prior to us training our model) on both summarization and chart generation, by a significant amount. We also do so about 5-7 times faster that gemini-2.5-pro, and just under 25 times faster than o4-mini, which surprisingly was the second best model we tested.

Summarization

Our model scored 42% better than gemini-2.5-pro on the summarization task (which was the best of the closed-source models). Here, pass rate is defined as the percentage of evaluators that passed (averaged across all types of evaluators and all notes).

We also include the score on individual evaluators for the sake of completeness.

Chart

Similarly, we outperform gemini-2.5-pro by 33% on chart generation, and outperform o4-mini (the next closest model) by 18%.

Importantly, we achieve a score of 100% on structural integrity scoping, and significantly outperform other models in information completeness evaluation.

Latency vs performance

However, performance is not the only axis a customer must optimize along; another key axis is latency. Thankfully, our model is not only the best, but also the fastest (and by a long way). We use Baseten’s inference stack along with optimized speculative decoding setups to ensure we get blazing inference speeds, minimizing the time the doctor has to wait for a note to be written, which in emergency settings can be very important.

Another nice way to view this is “percentage points achieved per second of thinking time” on the overall aggregated evaluations. Parsed comes out well ahead.

Of course, another thing to note is that the true latency improvement over the previous system was actually much higher, as we the latency improvements come not just from model speed but from pipeline simplification (note though that all models above are using the condensed pipeline for the sake of comparison). By consolidating multiple LLM calls into single stages and replacing pattern-matching tasks with code, we reduced the total number of inference calls from 5+ to 2. We were able to do this because when we can change a model’s weights, we don’t have to break up the task into individual steps in order to be able to prompt for it effectively; we can just teach it the mapping in one step.

Scale

Our model currently processes tens of thousands of ED notes weekly across our client’s customer base. This volume is expected to double by year-end as more emergency departments adopt the system.

Example of what our model gets right: One particular case we evaluated stood out to us, where a patient arrives by EMS with chest pain. The transcript mentions the patient has diabetes and takes metformin, but the family member states the patient stopped taking it last month. EMS noted hypotension en route. gemini-2.5-pro actually failed pretty miserably; it placed “stopped metformin” in current medications instead of past medications, included family member as a healthcare provider, put EMS vital signs in the physical exam section, and missed that chest pain requires ACS rule-out documentation. Our model, in contrast, correctly routed discontinued medication to the appropriate section; identified the family as a historian, not a provider; excluded EMS vitals from PE findings while preserving them in HPI, and generated high-risk diagnosis section noting that ACS was not ruled out.

Technical insights

Three factors drove our success:

  1. Comprehensive evaluation before training. We spent weeks with our client's clinical team cataloging every possible failure mode. This upfront investment in evaluation quality set the ceiling for model performance.

  2. Dense feedback through iSFT. Rather than training on whatever gemini-2.5-pro generates, we train on outputs that have been iteratively refined to pass all evaluators. This produces training data of higher quality than any model naturally generates.

  3. Multi-stage robustness. We deliberately trained the chart generation model on both perfect and imperfect JSON inputs from the summarization stage. This ensures graceful handling of upstream errors and eliminates the brittleness of the original multi-prompt pipeline.

Conclusion

Emergency medicine documentation requires precise application of clinical rules while maintaining semantic understanding. By building calibrated evaluators and using iterative refinement to hillclimb these evaluators, we've created a model that outperforms general-purpose systems on both accuracy and speed.

The approach generalizes: build evaluators that capture domain expertise, use iterative refinement to create perfect training examples, then train a specialist model. For regulated industries where errors have consequences, this methodology delivers models that don't just approximate the task under fuzzy prompt instructions but execute it correctly.

Other research.

BYO SWE-grep: automatically train blazing fast search sub-agents on your knowledge base (Pt. 1)

RL-trained search subagents that learn your knowledge base’s structure for fast, reliable retrieval

Research

Nov 11, 2025

BYO SWE-grep: automatically train blazing fast search sub-agents on your knowledge base (Pt. 1)

RL-trained search subagents that learn your knowledge base’s structure for fast, reliable retrieval

Research

Nov 11, 2025

Purpose-built LLMs for dental note-taking

Frontier thinking model performance at a fraction of the latency.

Case study

Nov 5, 2025

Purpose-built LLMs for dental note-taking

Frontier thinking model performance at a fraction of the latency.

Case study

Nov 5, 2025

Lumina: building self-improving evaluation through customer-in-the-loop refinement

Lumina: an adaptive evaluation engine that learns to judge like a subject matter expert.

Research

Oct 30, 2025

Lumina: building self-improving evaluation through customer-in-the-loop refinement

Lumina: an adaptive evaluation engine that learns to judge like a subject matter expert.

Research

Oct 30, 2025

Upweight the strategy, not the tokens: faster training with explicit reasoning through RGT (Rationale-Guided Training)

Teach the why, not just the what: Rationale-Guided Training

Research

Oct 28, 2025

Upweight the strategy, not the tokens: faster training with explicit reasoning through RGT (Rationale-Guided Training)

Teach the why, not just the what: Rationale-Guided Training

Research

Oct 28, 2025

Attention-based attribution: what your model is actually looking at

Cosine similarity is cosplay. Attention is attribution.

Research

Oct 28, 2025

Attention-based attribution: what your model is actually looking at

Cosine similarity is cosplay. Attention is attribution.

Research

Oct 28, 2025

Robust, sample efficient SFT with prompt mutations

Low-KL divergence prompt mutations: better performance at a fraction of the cost.

Research

Oct 27, 2025

Robust, sample efficient SFT with prompt mutations

Low-KL divergence prompt mutations: better performance at a fraction of the cost.

Research

Oct 27, 2025

Training loss predicts evaluation performance, even for non-verifiable tasks

Loss: the cheapest evaluation you’ll ever run.

Research

Oct 27, 2025

Training loss predicts evaluation performance, even for non-verifiable tasks

Loss: the cheapest evaluation you’ll ever run.

Research

Oct 27, 2025

Building production AI for regulated industries with a leading digital insurer

From frontier OpenAI/Google models to open-source — delivering 8x the speed and outperforming GPT-5-level accuracy.

Case study

Oct 20, 2025

Building production AI for regulated industries with a leading digital insurer

From frontier OpenAI/Google models to open-source — delivering 8x the speed and outperforming GPT-5-level accuracy.

Case study

Oct 20, 2025

Iterative SFT (iSFT): dense reward learning

Iterative SFT: dense, high-bandwidth learning

Research

Oct 15, 2025

Iterative SFT (iSFT): dense reward learning

Iterative SFT: dense, high-bandwidth learning

Research

Oct 15, 2025

Write small, learn forever: rank-1 LoRA for continual learning

Why rank-1 LoRA updates might be the missing link between static fine-tuning and truly continuous, live-on-GPU learning.

Research

Oct 12, 2025

Write small, learn forever: rank-1 LoRA for continual learning

Why rank-1 LoRA updates might be the missing link between static fine-tuning and truly continuous, live-on-GPU learning.

Research

Oct 12, 2025

Practical LoRA Research

Fine-tuning at Scale: What LoRA Gets Right (and LoRA-XS Doesn’t).

Research

Oct 10, 2025

Practical LoRA Research

Fine-tuning at Scale: What LoRA Gets Right (and LoRA-XS Doesn’t).

Research

Oct 10, 2025

A letter to the C-suite: the shifting role of MLEs

Your MLEs are brilliant, but you’re giving them the wrong job.

Position

Sep 8, 2025

A letter to the C-suite: the shifting role of MLEs

Your MLEs are brilliant, but you’re giving them the wrong job.

Position

Sep 8, 2025

Fine-tuning small open-source LLMs to outperform large closed-source models by 60% on specialized tasks

27B open-source model outperforms biggest OpenAI/Anthropic/Google models on real healthcare task.

Case study

Aug 15, 2025

Fine-tuning small open-source LLMs to outperform large closed-source models by 60% on specialized tasks

27B open-source model outperforms biggest OpenAI/Anthropic/Google models on real healthcare task.

Case study

Aug 15, 2025

Amnesiac generalist behemoths are not the future of language models

You don’t need a generic genius. You need a specialist learner.

Position

Jul 28, 2025

Amnesiac generalist behemoths are not the future of language models

You don’t need a generic genius. You need a specialist learner.

Position

Jul 28, 2025

The bitter lesson of LLM evals

Turning expert judgment into a compounding moat. Because in LLM evals, scaling care beats scaling compute.

Position

Jul 13, 2025

The bitter lesson of LLM evals

Turning expert judgment into a compounding moat. Because in LLM evals, scaling care beats scaling compute.

Position

Jul 13, 2025

Do transformers notice their own mistakes? Finding a linear hallucination detector inside LLMs

A linear signal in LLMs reveals hallucinations, is detected by a frozen observer, and steered with a single vector.

Research

May 8, 2025

Do transformers notice their own mistakes? Finding a linear hallucination detector inside LLMs

A linear signal in LLMs reveals hallucinations, is detected by a frozen observer, and steered with a single vector.

Research

May 8, 2025

Resurrecting the salmon: seeing clearer inside LLMs with domain-specific SAEs

A powerful, efficient, and domain-robust strategy for safeguarding medical-text generation.

Research

Feb 15, 2025

Resurrecting the salmon: seeing clearer inside LLMs with domain-specific SAEs

A powerful, efficient, and domain-robust strategy for safeguarding medical-text generation.

Research

Feb 15, 2025

Why mechanistic interpretability needs a paradigm inversion

The conventional scaling paradigm for language models themselves may be fundamentally misaligned with interp.

Research

Jan 13, 2025

Why mechanistic interpretability needs a paradigm inversion

The conventional scaling paradigm for language models themselves may be fundamentally misaligned with interp.

Research

Jan 13, 2025

Start owning your model today.

From training to deployment, we help you launch a specialist LLM that outperforms generic models, adapts automatically, and runs reliably at scale.

Start owning your model today.

From training to deployment, we help you launch a specialist LLM that outperforms generic models, adapts automatically, and runs reliably at scale.