VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

Kari Jaaskelainen

29 Sep 2025 — 1 min read

10,497 examples, 13 tasks: a holistic yardstick for voice-first multimodal assistants.

Voice assistants are rapidly evolving into multimodal agents that must hear, speak, and see. Yet evaluation has lagged behind capability. VoiceAssistant-Eval fills this gap with a comprehensive benchmark of 10,497 curated examples across 13 task categories, spanning natural sounds, music, spoken dialogue (listening); multi-turn and role-play imitation (speaking); and heterogeneous images (viewing).

Twenty-one open-source models and GPT-4o-Audio are assessed for response content, speech quality, and cross-modal consistency. Three key findings emerge: (1) proprietary models do not universally dominate; (2) most models speak well but struggle with audio understanding; and (3) carefully designed smaller models can rival much larger ones. Notably, Step-Audio-2-mini (7B) more than doubles the listening accuracy of LLaMA-Omni2-32B-Bilingual.

The benchmark also surfaces hard problems: audio+visual joint reasoning and role-play voice imitation remain challenging. Robustness and safety alignment gaps persist, underscoring the need for evaluation that captures real-world edge cases and user expectations.

Why it matters: developers can finally compare systems apples-to-apples across modalities, pinpoint failure modes, and prioritize training investments. For product teams, VoiceAssistant-Eval provides measurable targets for improvements in listening comprehension, speech naturalness, and multimodal grounding.

Code and data will be released, creating a shared platform to drive the next generation of voice-first AI.

Paper: http://arxiv.org/abs/2509.22651v1

Register: https://www.AiFeta.com

#AI #Speech #Multimodal #Benchmark #LLM #Audio #ComputerVision

Automating GDPR Compliance: A Roadmap for Companies and Law Firms

GDPR compliance is more than checkboxes. A new roadmap from the Privatech project shows how automation and machine learning can help companies and law firms assess—and even generate—privacy compliance. * Shift the focus to data processors’ real workflows: drafting policies, mapping data uses, documenting decisions. * Break compliance into machine-ready

FPGAs for Faster, Leaner Deep Learning: A Review of CNN Accelerators

Deep learning drives image search, robots, and medical scans. Most systems lean on CPUs and GPUs. This review asks: what if we run convolutional neural networks (CNNs) on FPGAs—reconfigurable chips you can tailor to the model? * Why FPGAs: custom dataflows, low latency, and strong energy efficiency—great for cameras,

Dynamic-K: Recommendations That Know When to Stop

Most apps show a fixed number of “top” items—say 10 movies or 20 products—assuming there are always enough good options. But that’s not always true: sometimes there are few relevant items, or some users are extra picky. The result? Filler recommendations. Dynamic-K flips the script. Instead of

Teaching chatbots to stop contradicting themselves (DECODE)

Teaching chatbots to stop contradicting themselves Ever had a bot say one thing, then the opposite a few turns later? This study introduces DECODE—a new task and dataset for spotting contradictions in everyday conversations, drawn from both human-human and human-bot chats. * New data beats existing natural language inference (NLI)

Read more

Automating GDPR Compliance: A Roadmap for Companies and Law Firms

FPGAs for Faster, Leaner Deep Learning: A Review of CNN Accelerators

Dynamic-K: Recommendations That Know When to Stop

Teaching chatbots to stop contradicting themselves (DECODE)