MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
A dual‑track “brain‑mouth” design for unified multimodal understanding and expressive, streaming voice
MGM‑Omni unifies multimodal reasoning with long‑horizon, controllable speech generation in a single Omni LLM. Rather than bolt speech onto a text reasoning core, it uses a dual‑track, token‑based “brain‑mouth” architecture that cleanly decouples multimodal cognition from real‑time speech synthesis—enabling low‑latency streaming and rich cross‑modal interaction.
On the understanding side, a unified training strategy with a dual audio encoder supports robust, long‑form perception across varied acoustic conditions. For generation, a chunk‑based parallel decoding approach narrows the text–speech token‑rate gap, boosting throughput while preserving stability. The model delivers streaming zero‑shot voice cloning that maintains timbre identity over extended spans, alongside natural, context‑aware prosody.
Despite data‑efficient training, MGM‑Omni reportedly outperforms existing open‑source systems on timbre preservation, long‑form audio understanding, and omnimodal comprehension. The result is an end‑to‑end paradigm that brings personalized, controllable speech together with vision‑and‑audio understanding—without resorting to brittle cascades.
Why it matters: voice is the most natural interface, but real‑time, expressive, and personalized speech at scale has been elusive. MGM‑Omni offers a practical architecture for assistants, content creation tools, and accessibility solutions that need long‑horizon voice coherence and tight multimodal grounding.
Paper: arXiv: MGM‑Omni
Register: AiFeta
#SpeechAI #Multimodal #LLM #VoiceCloning #Streaming #AudioUnderstanding #TTS