MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

A dual-track “brain–mouth” LLM for omnimodal understanding and low-latency, expressive speech.

MGM-Omni introduces a unified Omni LLM that cleanly decouples multimodal reasoning from real-time speech generation. Its dual-track, token-based “brain–mouth” architecture enables efficient cross-modal interaction while delivering streaming, low-latency speech that preserves voice identity over long horizons.

On the understanding side, a unified training strategy with dual audio encoders equips the model to handle long-form audio across varied acoustic conditions. On the generation side, a chunk-based parallel decoding scheme narrows the text–speech token-rate gap, accelerating inference and enabling zero-shot voice cloning that maintains stable timbre over extended sequences. Unlike cascaded pipelines that bolt on TTS, MGM-Omni treats speech as a first-class modality in an end-to-end system.

Results show superior timbre preservation, naturalness, and context awareness, along with strong omnimodal and long-form audio understanding. Notably, the model achieves these capabilities with marked data efficiency, pointing to a scalable recipe for personalized, controllable speech agents.

  • Dual-track architecture: separates reasoning and speech for responsiveness and control.
  • Parallel chunk decoding: faster, smoother long-horizon generation.
  • Zero-shot voice cloning: stable timbre in streaming settings.

Why it matters: As assistants evolve from text boxes to companions, voice becomes the interface. MGM-Omni’s design offers an efficient, end-to-end path to personalized, expressive, and comprehending voice agents.

Paper: arXiv: MGM-Omni
Register: https://www.AiFeta.com

#OmniLLM #Multimodal #SpeechSynthesis #VoiceCloning #Streaming #Personalization #AudioAI #LLM

Read more