Be My Eyes: Small 'eyes', big 'brain'—a modular path to multimodal AI
LLMs are great thinkers—but they’re mostly text-only. BeMyEyes is a new way to give them “sight” without building giant, expensive multimodal models.
- Two agents, one goal: a lean Perceiver (vision-language model) looks at images or other formats, while a powerful Reasoner LLM thinks through the answer. They collaborate via conversation.
- Smart training: synthetic data and supervised fine-tuning teach the Perceiver how to best brief the Reasoner.
- Why it matters: Keeps the broad knowledge and reasoning of frontier LLMs, avoids heavy multimodal training, and makes adding new domains/modalities flexible.
- Results: An all–open-source stack—text-only DeepSeek-R1 + Qwen2.5-VL-7B Perceiver—outperforms large proprietary systems like GPT-4o on many knowledge-heavy multimodal tasks.
BeMyEyes shows a modular, scalable path for future multimodal AI—mix and match the best “eyes” with the best “brains.”
Paper: https://arxiv.org/abs/2511.19417v1
Paper: https://arxiv.org/abs/2511.19417v1
Register: https://www.AiFeta.com
AI multimodal LLM VLM opensource ComputerVision agents DeepSeek Qwen BeMyEyes research