From Web to Pixels: Bringing Agentic Search into Visual Perception
Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu + 1 more
TLDR
This paper introduces WebEye, a benchmark, and Pixel-Searcher, a model, for visual perception tasks requiring external knowledge and agentic search.
Key contributions
- Formalizes "Perception Deep Research" for open-world visual perception requiring external facts.
- Introduces WebEye, a new object-anchored benchmark with verifiable evidence and knowledge-intensive queries.
- Proposes Pixel-Searcher, an agentic search-to-pixel workflow for resolving hidden target identities.
- Pixel-Searcher achieves strong open-source performance across WebEye's three task views.
Why it matters
This work addresses a critical gap in visual perception by tackling open-world scenarios where objects need external knowledge to be identified. It provides both a challenging benchmark and a strong baseline model, pushing the field towards more practical and knowledge-intensive visual AI.
Original Abstract
Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.