In-Video Instructions: Visual Signals as Generative Control

In-Video Instructions: Visual Signals as Generative Control

What if you could direct a video generator by doodling right on the frames?

The paper introduces In-Video Instruction: instead of long, vague text prompts, you add visual cues—overlaid words, arrows, or motion paths—inside the image. Each cue acts as a concrete instruction tied to a specific object.

  • Explicit and spatial: instructions sit exactly on the target.
  • Unambiguous: different objects can carry different commands.
  • Scalable: works in complex, multi-object scenes where text alone falls short.

Across three state-of-the-art generators—Veo 3.1, Kling 2.5, and Wan 2.2—the models reliably read and execute these on-screen directions, especially in multi-object settings.

Example: label a red car "turn left" and sketch a curved arrow; the model animates the car along that path in upcoming frames.

Paper: https://arxiv.org/abs/2511.19401v1

Paper: https://arxiv.org/abs/2511.19401v1

Register: https://www.AiFeta.com

AI GenerativeAI VideoGeneration ComputerVision Research HCI UX

Read more