CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
RL with verifiable rewards that define caption quality by downstream utility
Supervised captioning has hit a ceiling: it’s costly and tends to memorize specific answers. CapRL reframes the objective with Reinforcement Learning and verifiable rewards (RLVR) for open-ended captioning. The key insight: a high-quality caption should be useful—it should enable a vision-free LLM to answer questions about the image. CapRL implements a decoupled, two-stage pipeline: an LVLM generates a caption; then a separate LLM answers multiple-choice questions using only that caption. The reward is the answer accuracy, providing an objective, task-grounded signal for training.
By optimizing captions for utility rather than imitation, CapRL stimulates denser, more informative descriptions that generalize. Pretraining on the CapRL-5M dataset (annotated by CapRL-3B) delivers substantial gains across 12 benchmarks. Within the Prism caption-quality framework, CapRL reaches performance comparable to a much larger LVLM (Qwen2.5-VL-72B), outperforming the SFT baseline by an average of 8.4%.
- Utility-driven objective: captions are rewarded for enabling correct QA.
- Decoupled RLVR: LVLM captions; a vision-free LLM verifies via MCQ.
- Scalable data: CapRL-5M supports broad pretraining without proprietary labels.
The result is a practical recipe for building captioners that inform downstream tasks—from VQA to retrieval—while reducing reliance on expensive human or proprietary supervision.
Paper: http://arxiv.org/abs/2509.22647v1
Register: https://www.AiFeta.com
#VisionLanguage #ImageCaptioning #RL #RLVR #LVLM #Evaluation #VQA #MultimodalAI