MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

Kari Jaaskelainen

30 Sep 2025 — 1 min read

A dual-track “brain–mouth” LLM for omnimodal understanding and low-latency, expressive speech.

MGM-Omni introduces a unified Omni LLM that cleanly decouples multimodal reasoning from real-time speech generation. Its dual-track, token-based “brain–mouth” architecture enables efficient cross-modal interaction while delivering streaming, low-latency speech that preserves voice identity over long horizons.

On the understanding side, a unified training strategy with dual audio encoders equips the model to handle long-form audio across varied acoustic conditions. On the generation side, a chunk-based parallel decoding scheme narrows the text–speech token-rate gap, accelerating inference and enabling zero-shot voice cloning that maintains stable timbre over extended sequences. Unlike cascaded pipelines that bolt on TTS, MGM-Omni treats speech as a first-class modality in an end-to-end system.

Results show superior timbre preservation, naturalness, and context awareness, along with strong omnimodal and long-form audio understanding. Notably, the model achieves these capabilities with marked data efficiency, pointing to a scalable recipe for personalized, controllable speech agents.

Dual-track architecture: separates reasoning and speech for responsiveness and control.
Parallel chunk decoding: faster, smoother long-horizon generation.
Zero-shot voice cloning: stable timbre in streaming settings.

Why it matters: As assistants evolve from text boxes to companions, voice becomes the interface. MGM-Omni’s design offers an efficient, end-to-end path to personalized, expressive, and comprehending voice agents.

Paper: arXiv: MGM-Omni
Register: https://www.AiFeta.com

#OmniLLM #Multimodal #SpeechSynthesis #VoiceCloning #Streaming #Personalization #AudioAI #LLM

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning