ArXiv TLDR

Count Anything at Any Granularity

🐦 Tweet
2605.10887

Chang Liu, Haoning Wu, Weidi Xie

cs.CV

TLDR

This paper introduces multi-grained open-world object counting, a new dataset (KubriCount), and a model (HieraCount) to improve counting accuracy.

Key contributions

  • Redefines open-world counting as "multi-grained," using text and visual exemplars for explicit granularity.
  • Introduces KubriCount, a large, automatically scaled dataset for multi-grained counting evaluation.
  • Reveals existing models struggle with fine-grained counting distinctions.
  • Presents HieraCount, a new model that significantly improves multi-grained counting accuracy.

Why it matters

This paper addresses a critical limitation in open-world object counting by making granularity explicit. It provides both a novel framework and a comprehensive dataset, paving the way for more robust and user-intent-aware counting systems. This work is crucial for advancing VLM capabilities in real-world applications.

Original Abstract

Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat "what to count" as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: https://verg-avesta.github.io/KubriCount/.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.