MoST: One Open-Source Model for Speech + Text
Meet MoST, a fully open-source AI that understands speech and text in one model. Instead of treating audio and words the same, MoST uses a Modality-Aware Mixture of Experts (MAMoE) to send each token to the right specialists.
- Modality-specific experts learn the unique patterns of audio and text.
- Shared experts help knowledge flow across both, boosting cross-modal skills.
- Efficient training pipeline: post-train on ASR/TTS, then fine-tune on speech-text instructions — all from open datasets.
Results: MoST beats similarly sized models on ASR, TTS, audio language modeling, and spoken question answering. Ablations show routing + shared experts drive the gains.
Why it matters: a practical path to assistants that listen, read, and reply more accurately—using only open data.
Paper: https://arxiv.org/abs/2601.10272 | Code & data: https://github.com/NUS-HPC-AI-Lab/MoST
Paper: https://arxiv.org/abs/2601.10272v1
Register: https://www.AiFeta.com
#AI #SpeechAI #Multimodal #MixtureOfExperts #OpenSource #ASR #TTS #LLM #NLP #Research