RACANet: Reliability-Aware Crowd Anchor Network for RGB-T Crowd Counting

April 27, 20262604.24543

Jinghao Shi, Mengqi Lei, Kunliang He, Yun Li, Wei Bao + 1 more

cs.CV

TLDR

RACANet improves RGB-T crowd counting by explicitly modeling local spatial discrepancies and modality reliability with a two-stage fusion framework.

Key contributions

Introduces RACANet, a two-stage reliability-aware network for RGB-T crowd counting.
Employs cross-modal alignment pretraining using crowd-prior supervision and soft matching.
Proposes Local Anchor Fusion Module (LAFM) for adaptive pixel-level feature redistribution.
Utilizes a discrepancy-aware consistency constraint to coordinate modal reliability.

Why it matters

Existing RGB-T crowd counting methods lack explicit modeling of local discrepancies and fine-grained modality reliability. RACANet addresses these limitations with a novel two-stage fusion framework. This leads to improved accuracy and interpretability in complex scenes.

Original Abstract

RGB-Thermal (T) crowd counting aims to integrate visible-spectrum and thermal infrared information to improve the robustness of crowd density estimation in complex scenes. Although existing studies generally improve counting accuracy through cross-modal feature fusion, most current methods rely on implicit cross-modal fusion strategies and lack explicit modeling of local spatial discrepancies as well as fine-grained characterization of modality reliability at the positional level, thereby limiting the accuracy and interpretability of the fusion process. To address these issues, this paper proposes a two-stage fusion framework, RACANet, a Reliability-Aware Crowd Anchor Network for RGB-T crowd counting. First, we introduce a lightweight cross-modal alignment pretraining stage, which explicitly learns cross-modal semantic correspondences through crowd-prior supervision and local bidirectional soft matching. Then, based on the priors learned during pretraining, a Local Anchor Fusion Module (LAFM) is introduced in the formal training stage. This module generates local semantic anchors by aggregating features from highly reliable regions and further enables adaptive pixel-level feature redistribution with a local attention mechanism. In addition, we propose a discrepancy-aware consistency constraint to dynamically coordinate the reliability of regions where modal representations are consistent. Experiments conducted on two widely used benchmark datasets, RGBT-CC and Drone-RGBT, demonstrate that RACANet outperforms existing methods. The anonymous code is available at https://anonymous.4open.science/r/RACANet-9985.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers