Teaching Vision-Language Agents with Deliberate Practice (DPPO)

Teaching Vision-Language Agents with Deliberate Practice (DPPO)

Building capable embodied AI is hard: real-world data is scarce and training is expensive. This paper introduces Deliberate Practice Policy Optimization (DPPO), a coach-like training loop that helps vision-language agents learn more from less.

How it works: the system alternates between learning from examples (to expand skills) and trial-and-error reinforcement learning (to polish them). It spots its own weak spots and focuses practice where it matters most.

  • Unified view: framed as preference learning for consistent objectives.
  • Efficiency: maximizes progress with limited, sparse data.
  • Results: Pelican-VL 1.0 trained with DPPO improves 20.3% over its base and beats open-source 100B-parameter models by 10.6%.
  • Open source: code and models released for the community.

Bottom line: DPPO is a practical bridge between vision-language models and real-world embodied skills, making versatile agents cheaper and faster to train.

Paper: https://arxiv.org/abs/2511.16602

Paper: https://arxiv.org/abs/2511.16602v1

Register: https://www.AiFeta.com

AI EmbodiedAI Robotics ReinforcementLearning VisionLanguage VLM MachineLearning OpenSource

Read more