See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

A training-free VLM approach that turns language into waypoint-guided UAV control

See, Point, Fly (SPF) reimagines how unmanned aerial vehicles follow natural-language commands—without any additional training. Instead of treating action prediction as text generation, SPF reframes aerial vision-and-language navigation as 2D spatial grounding. The system decomposes open-ended instructions into iterative waypoint annotations directly on the camera image, then converts these 2D waypoints and adaptive travel distances into 3D displacement vectors that a UAV can execute.

By closing the loop between perception, instruction understanding, and control, SPF robustly follows dynamic targets in dynamic environments. Its adaptive distance mechanism further accelerates navigation by modulating step sizes for efficiency and stability. Critically, SPF generalizes across a range of vision-language models (VLMs), making it flexible and future-proof as new VLMs emerge.

  • Key idea: treat language-guided navigation as spatial grounding, not text-only planning.
  • How it works: iteratively point to 2D waypoints, predict distance, convert to 3D actions.
  • Closed-loop control: continuously updates waypoints to track moving objects and changing scenes.
  • Efficiency: adaptive step sizing improves progress without sacrificing safety.

The results are striking. In a DRL simulation benchmark, SPF establishes a new state of the art, surpassing the previous best by an absolute 63% margin. Extensive real-world tests show large gains over strong baselines, and ablation studies validate each design choice. Because SPF is training-free and VLM-agnostic, teams can deploy it rapidly in novel settings—from infrastructure inspection to search-and-rescue—without costly data collection or fine-tuning.

Ready to see waypoints turn words into flight?

Paper: http://arxiv.org/abs/2509.22653v1
Register: https://www.AiFeta.com

#VLM #Robotics #UAV #Navigation #ComputerVision #Multimodal #AerialAI #SpatialGrounding

Read more