See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation
Turning natural-language intent into flight paths with waypoint grounding—no training required.
See, Point, Fly (SPF) reimagines vision-and-language navigation for drones by treating action selection as spatial grounding—not text generation. Instead of “talking” a UAV through step-by-step actions, SPF asks a vision-language model (VLM) to iteratively mark 2D waypoints on the live camera feed. These waypoints, paired with an adaptively chosen travel distance, are then lifted into 3D displacement vectors that the drone can execute immediately.
The result is a training-free, closed-loop controller that follows natural, free-form instructions—across goals, environments, and even moving targets. SPF’s adaptive distance adjustment accelerates progress in open spaces yet tightens control when precision matters. Because the VLM only needs to ground language in image space, SPF generalizes across different VLM backbones without custom fine-tuning.
Performance is striking: in a DRL simulation benchmark, SPF sets a new state of the art, outperforming the previous best by an absolute 63% margin. Extensive real-world trials show consistent gains over strong baselines, and ablations clarify how waypoint grounding, distance adaptation, and closed-loop control each contribute.
Why it matters: from search-and-rescue and infrastructure inspection to agriculture, filming, and security, operators can now steer drones with natural language that translates into grounded spatial actions—robustly and efficiently. SPF reduces the brittleness of text-only action generation and removes the cost of training, while enabling pursuit of dynamic targets in dynamic scenes.
Practical notes: SPF assumes a forward-facing camera and reasonable visual observability; extreme lighting, occlusions, or severe domain shifts may require additional sensing or safety bounds.
Paper: http://arxiv.org/abs/2509.22653v1
Register: https://www.AiFeta.com
#AI #Robotics #UAV #VLM #Navigation #ComputerVision #Autonomy