In-Video Instructions: Visual Signals as Generative Control
What if you could direct a video generator by doodling right on the frames?
The paper introduces In-Video Instruction: instead of long, vague text prompts, you add visual cues—overlaid words, arrows, or motion paths—inside the image. Each cue acts as a concrete instruction tied to a specific object.
- Explicit and spatial: instructions sit exactly on the target.
- Unambiguous: different objects can carry different commands.
- Scalable: works in complex, multi-object scenes where text alone falls short.
Across three state-of-the-art generators—Veo 3.1, Kling 2.5, and Wan 2.2—the models reliably read and execute these on-screen directions, especially in multi-object settings.
Example: label a red car "turn left" and sketch a curved arrow; the model animates the car along that path in upcoming frames.
Paper: https://arxiv.org/abs/2511.19401v1
Paper: https://arxiv.org/abs/2511.19401v1
Register: https://www.AiFeta.com
AI GenerativeAI VideoGeneration ComputerVision Research HCI UX