Interpreting V1 Population Activity via Image-Neural Latent Representation Alignment
Xin Wang, Zhuangzhi Gao, Hongyi Qin, Zhongli Wu, Feixiang Zhou + 1 more
TLDR
DINA aligns image and V1 neural representations to interpret visual computations, revealing decoding relies on coarse, low-level visual structure.
Key contributions
- Introduces DINA, an interpretable contrastive framework for analyzing V1 population visual computations.
- DINA aligns visual stimuli and V1 responses in a shared latent space using a dual-tower architecture.
- Reveals V1 decoding performance is primarily supported by coarse, low-level visual structure, not semantics.
- Alignable feature maps capture shape and texture cues from distributed image regions via sparse neurons.
Why it matters
This paper introduces DINA, a novel framework that enhances visual stimulus decoding from V1 activity while providing crucial interpretability. It reveals that V1's decoding performance relies on coarse, low-level visual structures, not fine details or semantic categories. This advances our understanding of primary visual cortex computations.
Original Abstract
Understanding the neural mechanisms underlying visual computation has long been a central challenge in neuroscience. Recent alignment based approaches have improved the accuracy of decoding visual stimuli from brain activity, yet they provide limited insight into the neural computations that give rise to these improvements. To address this gap, we propose Dual-Tower Image-Neural Alignment (DINA), an interpretable contrastive framework for analyzing population level visual computations in primary visual cortex (V1). DINA jointly trains a biologically motivated dual-tower architecture that aligns visual stimuli and corresponding V1 population responses in a shared latent space at the level of intermediate feature maps, enabling both accurate decoding and direct access to interpretable feature maps. Evaluated on large-scale two-photon calcium imaging data from mouse V1, DINA achieves accurate neural-based decoding while revealing that decoding performance is primarily supported by coarse, low-level visual structure, rather than semantic category information or fine-grained details. Further analysis reveals that alignable feature maps emerge from multiple spatially distributed image regions, capturing both shape and texture cues, and are predominantly reconstructed by sparse subsets of strongly responsive neurons and their functional interactions. Together, these results confirm that, beyond enabling accurate decoding, DINA provides a principled framework for probing the computational mechanisms underlying visual processing in V1.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.