MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
A dual-track “brain–mouth” LLM for omnimodal understanding and low-latency, expressive speech.
MGM-Omni introduces a unified Omni LLM that cleanly decouples multimodal reasoning from real-time speech generation. Its dual-track, token-based “brain–mouth” architecture enables efficient cross-modal interaction while delivering streaming, low-latency speech that preserves voice identity over long horizons.
On the understanding side, a unified training strategy with dual audio encoders equips the model to handle long-form audio across varied acoustic conditions. On the generation side, a chunk-based parallel decoding scheme narrows the text–speech token-rate gap, accelerating inference and enabling zero-shot voice cloning that maintains stable timbre over extended sequences. Unlike cascaded pipelines that bolt on TTS, MGM-Omni treats speech as a first-class modality in an end-to-end system.
Results show superior timbre preservation, naturalness, and context awareness, along with strong omnimodal and long-form audio understanding. Notably, the model achieves these capabilities with marked data efficiency, pointing to a scalable recipe for personalized, controllable speech agents.
- Dual-track architecture: separates reasoning and speech for responsiveness and control.
- Parallel chunk decoding: faster, smoother long-horizon generation.
- Zero-shot voice cloning: stable timbre in streaming settings.
Why it matters: As assistants evolve from text boxes to companions, voice becomes the interface. MGM-Omni’s design offers an efficient, end-to-end path to personalized, expressive, and comprehending voice agents.
Paper: arXiv: MGM-Omni
Register: https://www.AiFeta.com
#OmniLLM #Multimodal #SpeechSynthesis #VoiceCloning #Streaming #Personalization #AudioAI #LLM