InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Kari Jaaskelainen

29 Sep 2025 — 1 min read

An end-to-end, lossless-feel FP8 recipe that speeds LLM training for reasoning

Can we train reasoning-strong LLMs faster and cheaper—without sacrificing accuracy? InfiR2 answers with a practical, open FP8 recipe spanning continual pretraining and supervised fine-tuning. The approach uses a fine-grained, hybrid-granularity quantization strategy to preserve numerical fidelity where it matters while exploiting FP8’s efficiency where it’s safe to do so.

Across extensive experiments, including continued pretraining on a 160B-token corpus, the recipe remains remarkably stable and essentially lossless versus BF16 on a suite of reasoning evaluations. The efficiency gains are tangible: up to 22% less training time, 14% lower peak memory, and 19% higher throughput—making large-scale training more accessible without a quality trade-off.

End-to-end FP8: covers pretraining and SFT coherently.
Hybrid quantization: fine-grained control aligns precision with sensitivity.
Stable and strong: parity with BF16 on reasoning benchmarks.
Efficiency wins: faster training, lower memory, higher throughput.

By establishing FP8 as a robust, production-ready alternative to BF16—and committing to open-sourcing the code—InfiR2 lowers the barrier to innovation for teams aiming to build and iterate on reasoning-enhanced models under real-world compute budgets.

Paper: http://arxiv.org/abs/2509.22536v1
Register: https://www.AiFeta.com

#LLM #FP8 #TrainingEfficiency #Quantization #Reasoning #Scaling #DeepLearning

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning