WER is Unaware: Why Typical ASR Scores Can Risk Patient Safety
Not all transcription mistakes are equal. In clinic visits, the go-to score Word Error Rate (WER) can look fine even when errors change care-critical meaning.
- Expert clinicians compared true utterances with ASR transcripts in two doctor-patient datasets, labeling the impact: No, Minimal, or Significant.
- WER and many other metrics poorly matched these risk labels.
- The team built an LLM-as-a-Judge, optimized with GEPA. The best judge (Gemini-2.5-Pro) reached 90% accuracy and strong agreement (Cohen's kappa 0.816) with clinicians.
Why it matters: evaluating medical ASR by word accuracy alone can miss safety risks. This work offers a validated, scalable way to assess clinical impact directly, helping teams choose safer models and guide fixes where it matters most.
Paper: https://arxiv.org/abs/2511.16544v1
Paper: https://arxiv.org/abs/2511.16544v1
Register: https://www.AiFeta.com
ASR Healthcare PatientSafety SpeechRecognition AIEvaluation LLM NLP MedicalAI Safety