MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Kari Jaaskelainen

30 Sep 2025 — 1 min read

Sub-billion LLMs that reason well—trained on far less data, with a fully open recipe.

Do small models need massive corpora to reason? MobileLLM-R1 makes a compelling case that they don’t. Challenging the assumption that advanced chain-of-thought capabilities require >10T tokens, the authors carefully curate and resample open-source datasets using tailored quality metrics. They show that roughly ~2T high-quality tokens are sufficient to spark strong reasoning, and that a 4.2T-token pretraining pass—resampled from those ~2T tokens—plus established post-training can deliver state-of-the-art results for sub-billion models.

The headline result: MobileLLM-R1-950M posts an AIME score of 15.5, compared to 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Perhaps more striking, despite using only 11.7% of the tokens reportedly used by Qwen3 (36T) for pretraining, MobileLLM-R1-950M matches or beats Qwen3-0.6B across multiple reasoning benchmarks.

Data-first design: Curated and resampled open datasets, guided by custom benefit metrics.
Efficient pretraining: 4.2T tokens sampled from a ~2T-token high-quality pool.
Strong results at small scale: Competitive or better performance versus larger and proprietary-trained peers.
Reproducibility: Full training recipe, data sources, mixing ratios, and checkpoints released.

Why it matters: If high-quality, well-mixed data can replace brute-force scale, then advanced reasoning becomes accessible to broader communities and on-device settings. MobileLLM-R1 offers a practical blueprint to build capable, interpretable, and efficient reasoners—without proprietary data advantages.

Paper: arXiv: MobileLLM-R1
Register: https://www.AiFeta.com

#LLM #Reasoning #DataCuration #SubBillion #OpenSource #Efficiency #AIME #NLP

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning