VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

Kari Jaaskelainen

29 Sep 2025 — 1 min read

A 10k-example benchmark testing audio understanding, speech, and visual grounding

Voice-first assistants are advancing fast, but current tests don’t fully capture what they must do: listen, speak, and see. VoiceAssistant-Eval fills that gap with 10,497 curated examples across 13 task categories—spanning natural sounds, music, spoken dialogue (listening); multi-turn conversation and role-play imitation (speaking); and heterogeneous images (viewing). The benchmark evaluates both response content and speech quality, plus consistency across modalities.

Evaluating 21 open-source models and GPT-4o-Audio surfaces nuanced insights: (1) proprietary models don’t always win; (2) most systems are better at speaking than listening (audio understanding lags); and (3) smaller, well-designed models can rival much larger ones. For instance, the mid-sized Step-Audio-2-mini (7B) more than doubles the listening accuracy of LLaMA-Omni2-32B-Bilingual.

Holistic scope: listening, speaking, and viewing in one benchmark.
Quality and consistency: content, speech naturalness, and cross-modal alignment.
Actionable findings: audio comprehension and role-play imitation remain challenging; multimodal input integration exposes robustness and safety gaps.

By setting a rigorous, multi-skill standard and releasing code and data, VoiceAssistant-Eval provides a common yardstick for researchers and builders to prioritize improvements—particularly around audio understanding, multimodal grounding, and safety alignment for real-world assistants.

Paper: http://arxiv.org/abs/2509.22651v1
Register: https://www.AiFeta.com

#VoiceAI #Evaluation #Multimodal #Speech #AudioUnderstanding #Benchmark #AIAssistant #MMLLM

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning