InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Kari Jaaskelainen

29 Sep 2025 — 1 min read

An end-to-end FP8 recipe that’s stable, lossless vs BF16, and measurably faster for reasoning LLMs.

Training frontier LLMs is costly. FP8 promises efficiency, but real-world adoption has lacked a robust, open recipe. InfiR2 fills that gap with an end-to-end FP8 training methodology spanning continual pretraining and supervised fine-tuning, underpinned by a fine-grained, hybrid-granularity quantization strategy that preserves numerical fidelity.

Across extensive experiments—including continued pretraining on a 160B-token corpus—InfiR2 matches BF16 baselines on a suite of reasoning benchmarks while delivering notable efficiency gains: up to 22% reduction in training time, 14% lower peak memory, and 19% higher throughput. Crucially, the approach demonstrates stability at scale and near-lossless quality, addressing the practical concerns that have hampered FP8 adoption.

Why it matters: reasoning-enhanced LLMs often demand longer contexts, deeper stacks, and more tokens. An FP8 pipeline that behaves predictably enables faster iteration cycles, cheaper ablations, and broader access for labs and startups. The team commits to releasing code, positioning InfiR2 as a foundation others can build on, adapt to new hardware, and extend to multi-modal stacks.

Expect downstream work to explore FP8-aware optimizers, activation scaling strategies for long-context regimes, and push-button migration guides from BF16.

Paper: http://arxiv.org/abs/2509.22536v1

Register: https://www.AiFeta.com

#AI #LLM #FP8 #Training #Efficiency #Quantization #Reasoning

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning