WER Isn’t Enough: Rethinking ASR Safety in Clinical Care

WER Isn’t Enough: Rethinking ASR Safety in Clinical Care

Voice tech is entering the clinic—but we still judge automatic speech recognition (ASR) mainly by Word Error Rate (WER). That score can’t tell you if a mistake is harmless or could change a care plan.

  • What they did: Clinicians compared ground-truth and ASR transcripts from real doctor–patient conversations, labeling the impact of discrepancies as No, Minimal, or Significant.
  • What they found: WER and other popular metrics correlated poorly with true clinical risk.
  • What’s new: An LLM-as-a-Judge, optimized with GEPA, that evaluates the safety impact of ASR errors directly.
  • Performance: The judge (Gemini-2.5-Pro) reached 90% accuracy with a strong Cohen’s kappa of 0.816—on par with human experts.

Bottom line: It’s time to go beyond raw accuracy and adopt safety-aware evaluations for clinical ASR—so the right words lead to the right care.

Paper: https://arxiv.org/abs/2511.16544v1

Paper: https://arxiv.org/abs/2511.16544v1

Register: https://www.AiFeta.com

healthtech AI ASR SpeechRecognition ClinicalSafety LLM NLP Healthcare PatientSafety

Read more