Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation

Kari Jaaskelainen

29 Sep 2025 — 1 min read

Clinician-grade guardrails: retrieval-augmented, clinically grounded, and validated on real patient messages.

As patient-portal messaging grows, so does the pressure on clinicians—and the appeal of LLM-drafted replies. But drafts can miss context, introduce clinical inaccuracies, or adopt the wrong tone. This work delivers a practical solution: a retrieval-augmented evaluation pipeline (RAEC) paired with a clinically grounded error ontology to flag and categorize issues at scale.

The ontology spans 5 domains and 59 granular error codes, developed via inductive coding and expert adjudication. RAEC retrieves semantically similar historical message–response pairs to provide rich clinical context, then uses a two-stage DSPy prompting architecture for scalable, interpretable, hierarchical error detection. Crucially, the system evaluates drafts both in isolation and with context, capturing omissions (e.g., missing triage guidance) and workflow appropriateness (e.g., when to escalate).

In a study of 1,500+ messages, retrieval context improved detection of completeness and workflow errors. Human validation on 100 messages showed markedly stronger agreement and performance for context-enhanced labels versus baseline (concordance 50% vs 33%; F1 0.500 vs 0.256). The result is a set of guardrails that help clinicians safely scale AI assistance without sacrificing clinical fidelity.

Who benefits: health systems deploying LLMs for outbound patient communications, digital front doors, and triage. What’s compelling: measurable uplift, transparent error coding, and a retrieval-first design that respects institutional patterns—all essential for real-world adoption. Future directions include tighter integration with EHR workflows, safety alignment, and continuous learning from clinician feedback.

Paper: http://arxiv.org/abs/2509.22565v1

Register: https://www.AiFeta.com

#AI #Healthcare #LLM #PatientSafety #RAG #Evaluation #DSPy

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning