healthtech

WER Isn’t Enough: Rethinking ASR Safety in Clinical Care

Kari Jaaskelainen

21 Nov 2025 — 1 min read

Voice tech is entering the clinic—but we still judge automatic speech recognition (ASR) mainly by Word Error Rate (WER). That score can’t tell you if a mistake is harmless or could change a care plan.

What they did: Clinicians compared ground-truth and ASR transcripts from real doctor–patient conversations, labeling the impact of discrepancies as No, Minimal, or Significant.
What they found: WER and other popular metrics correlated poorly with true clinical risk.
What’s new: An LLM-as-a-Judge, optimized with GEPA, that evaluates the safety impact of ASR errors directly.
Performance: The judge (Gemini-2.5-Pro) reached 90% accuracy with a strong Cohen’s kappa of 0.816—on par with human experts.

Bottom line: It’s time to go beyond raw accuracy and adopt safety-aware evaluations for clinical ASR—so the right words lead to the right care.

Paper: https://arxiv.org/abs/2511.16544v1

Paper: https://arxiv.org/abs/2511.16544v1

Register: https://www.AiFeta.com

healthtech AI ASR SpeechRecognition ClinicalSafety LLM NLP Healthcare PatientSafety

WER Isn’t Enough: Rethinking ASR Safety in Clinical Care

Kari Jaaskelainen

Read more

Tekoäly myötäilee toteamuksia enemmän kuin kysymyksiä

Tekoälyn pitäisi uskaltaa sanoa “en tiedä” — ja sillä on väliä, miten tämä mitataan

Pienet kielimallit nopeutuvat, kun niille opetetaan valmiita fraaseja

Kone näkee saman kohtauksen eri tavoin – uusi tapa opettaa sen kokoamaan aistinsa yhteen