Be My Eyes: Small 'eyes', big 'brain'—a modular path to multimodal AI

Be My Eyes: Small 'eyes', big 'brain'—a modular path to multimodal AI

LLMs are great thinkers—but they’re mostly text-only. BeMyEyes is a new way to give them “sight” without building giant, expensive multimodal models.

  • Two agents, one goal: a lean Perceiver (vision-language model) looks at images or other formats, while a powerful Reasoner LLM thinks through the answer. They collaborate via conversation.
  • Smart training: synthetic data and supervised fine-tuning teach the Perceiver how to best brief the Reasoner.
  • Why it matters: Keeps the broad knowledge and reasoning of frontier LLMs, avoids heavy multimodal training, and makes adding new domains/modalities flexible.
  • Results: An all–open-source stack—text-only DeepSeek-R1 + Qwen2.5-VL-7B Perceiver—outperforms large proprietary systems like GPT-4o on many knowledge-heavy multimodal tasks.

BeMyEyes shows a modular, scalable path for future multimodal AI—mix and match the best “eyes” with the best “brains.”

Paper: https://arxiv.org/abs/2511.19417v1

Paper: https://arxiv.org/abs/2511.19417v1

Register: https://www.AiFeta.com

AI multimodal LLM VLM opensource ComputerVision agents DeepSeek Qwen BeMyEyes research

Read more