DeepEyesV2: Teaching AI to Use Tools

Kari Jaaskelainen

10 Nov 2025 — 1 min read

AI that sees, thinks—and uses tools

Meet DeepEyesV2, a multimodal “agentic” model that doesn’t just read text and look at images—it can call external tools like code runners and web search, then weave the results into its reasoning.

Key ideas:

Two-stage training: a cold-start phase teaches basic tool-use patterns; reinforcement learning then refines when and how to invoke tools.
Curated data that rewards tool use, not just perception—so the model learns when tools actually help.
RealX-Bench: a new benchmark that tests real-world multimodal reasoning requiring perception, search, and logic.

What they found: direct reinforcement learning wasn’t enough to spark reliable tool use. The two-stage pipeline led to task-adaptive behavior—image operations for perception tasks, calculators/code for math and logic—and enabled more complex, context-aware tool chains.

Results: DeepEyesV2 performs well on RealX-Bench and other benchmarks spanning real-world understanding, mathematical reasoning, and search-heavy tasks.

By Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu. Paper: http://arxiv.org/abs/2511.05271v1

Paper: http://arxiv.org/abs/2511.05271v1

Register: https://www.AiFeta.com

ai multimodal agentic tooluse reinforcementlearning computervision llm benchmarks research

Automating GDPR Compliance: A Roadmap for Companies and Law Firms

GDPR compliance is more than checkboxes. A new roadmap from the Privatech project shows how automation and machine learning can help companies and law firms assess—and even generate—privacy compliance. * Shift the focus to data processors’ real workflows: drafting policies, mapping data uses, documenting decisions. * Break compliance into machine-ready

FPGAs for Faster, Leaner Deep Learning: A Review of CNN Accelerators

Deep learning drives image search, robots, and medical scans. Most systems lean on CPUs and GPUs. This review asks: what if we run convolutional neural networks (CNNs) on FPGAs—reconfigurable chips you can tailor to the model? * Why FPGAs: custom dataflows, low latency, and strong energy efficiency—great for cameras,

Dynamic-K: Recommendations That Know When to Stop

Most apps show a fixed number of “top” items—say 10 movies or 20 products—assuming there are always enough good options. But that’s not always true: sometimes there are few relevant items, or some users are extra picky. The result? Filler recommendations. Dynamic-K flips the script. Instead of

Teaching chatbots to stop contradicting themselves (DECODE)

Teaching chatbots to stop contradicting themselves Ever had a bot say one thing, then the opposite a few turns later? This study introduces DECODE—a new task and dataset for spotting contradictions in everyday conversations, drawn from both human-human and human-bot chats. * New data beats existing natural language inference (NLI)

AI that sees, thinks—and uses tools

Read more

Automating GDPR Compliance: A Roadmap for Companies and Law Firms

FPGAs for Faster, Leaner Deep Learning: A Review of CNN Accelerators

Dynamic-K: Recommendations That Know When to Stop

Teaching chatbots to stop contradicting themselves (DECODE)