CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
Objective rewards for a subjective task: captions judged by how well they enable QA.
Supervised captioning often memorizes ground-truth phrasing and struggles to generalize. CapRL brings Reinforcement Learning with Verifiable Rewards (RLVR) to image captioning by redefining “good” captions in terms of utility: can a non-visual LLM answer questions about the image using the caption alone?
The two-stage pipeline first has an LVLM generate a caption, then computes an objective reward from a separate, vision-free LLM answering multiple-choice questions based solely on that caption. This decoupled setup turns a subjective task into one with verifiable signals, encouraging captions that are informative, precise, and semantically rich—without overfitting to a single phrasing.
Pretraining on CapRL-5M (captions produced by a 3B model) yields substantial gains across 12 benchmarks. Within the Prism framework for caption quality evaluation, CapRL matches the performance of a much larger Qwen2.5-VL-72B and exceeds the baseline by an average of 8.4%. The approach improves open-ended descriptiveness while maintaining factual grounding.
Why it matters: stronger captions amplify downstream utility—from retrieval and VQA to accessibility and content moderation. By rewarding captions that enable reliable QA, CapRL aligns training with practical goals rather than surface form.
Future work may refine reward design (e.g., counterfactual QA, hallucination penalties) and extend to dense region-level descriptions and video.
Paper: http://arxiv.org/abs/2509.22647v1
Register: https://www.AiFeta.com
#AI #ComputerVision #Captioning #RL #LVLM #Evaluation #VQA