CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

Kari Jaaskelainen

29 Sep 2025 — 1 min read

RL with verifiable rewards that define caption quality by downstream utility

Supervised captioning has hit a ceiling: it’s costly and tends to memorize specific answers. CapRL reframes the objective with Reinforcement Learning and verifiable rewards (RLVR) for open-ended captioning. The key insight: a high-quality caption should be useful—it should enable a vision-free LLM to answer questions about the image. CapRL implements a decoupled, two-stage pipeline: an LVLM generates a caption; then a separate LLM answers multiple-choice questions using only that caption. The reward is the answer accuracy, providing an objective, task-grounded signal for training.

By optimizing captions for utility rather than imitation, CapRL stimulates denser, more informative descriptions that generalize. Pretraining on the CapRL-5M dataset (annotated by CapRL-3B) delivers substantial gains across 12 benchmarks. Within the Prism caption-quality framework, CapRL reaches performance comparable to a much larger LVLM (Qwen2.5-VL-72B), outperforming the SFT baseline by an average of 8.4%.

Utility-driven objective: captions are rewarded for enabling correct QA.
Decoupled RLVR: LVLM captions; a vision-free LLM verifies via MCQ.
Scalable data: CapRL-5M supports broad pretraining without proprietary labels.

The result is a practical recipe for building captioners that inform downstream tasks—from VQA to retrieval—while reducing reliance on expensive human or proprietary supervision.

Paper: http://arxiv.org/abs/2509.22647v1
Register: https://www.AiFeta.com

#VisionLanguage #ImageCaptioning #RL #RLVR #LVLM #Evaluation #VQA #MultimodalAI

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

Evidence that RL teaches genuinely new abilities: compositional skills emerge and transfer across tasks Does RL merely reweight what an LLM already knows—or can it teach genuinely new skills? This paper offers concrete evidence for the latter. Using a controlled, synthetic framework, the authors define “skills” as string transformation

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

A structured 80k instruction–image corpus spanning 11 domains and 51 subtasks to train unified visual editors Unified models for image generation and editing hit a data ceiling: existing corpora emphasize basic manipulations but miss real‑world complexity. OpenGPT‑4o‑Image tackles this with a hierarchical task taxonomy and automated

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

ROVER replaces PPO loops with uniform‑policy Q‑values—boosting quality and diversity in math reasoning Popular RLVR methods for LLM reasoning lean on generalized policy iteration (e.g., PPO/GRPO), but suffer instability and diversity collapse. This paper reframes math RLVR as a specialized finite‑horizon MDP with deterministic

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

A dynamic, self‑paced curriculum that restructures problems to match model ability in RLVR Online RL with Verifiable Rewards (RLVR) has boosted LLM reasoning—but most methods treat all problems equally, wasting effort on solved items and flailing on those beyond current capability. CLPO fixes that with a dynamic pedagogy:

Read more

From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning