ArXiv TLDR

DeepTaxon: An Interpretable Retrieval-Augmented Multimodal Framework for Unified Species Identification and Discovery

🐦 Tweet
2604.24029

Jiawei Wang, Ming Lei, Yaning Yang, Xinyan Lin, Yuquan Le + 6 more

cs.CVcs.CLcs.IRcs.MM

TLDR

DeepTaxon unifies species identification and discovery using an interpretable retrieval-augmented multimodal framework.

Key contributions

  • Unifies species identification and discovery, treating discovery as an explicit retrieval-based decision.
  • Employs a retrieval-augmented multimodal framework with interpretable chain-of-thought reasoning.
  • Trains with supervised fine-tuning on synthetic data, then reinforcement learning on hard samples.
  • Achieves consistent improvements in both identification and discovery across diverse datasets.

Why it matters

This paper addresses a fundamental challenge in biodiversity research by unifying species identification and discovery. DeepTaxon offers an interpretable and scalable solution, significantly improving performance on both tasks. Its novel approach redefines discovery, providing a robust tool for biological classification.

Original Abstract

Identifying species in biology among tens of thousands of visually similar taxa while discovering unknown species in open-world environments remains a fundamental challenge in biodiversity research. Current methods treat identification and discovery as separate problems, with classification models assuming closed sets and discovery relying on threshold-based rejection. Here we present DeepTaxon, a retrieval-augmented multimodal framework that unifies species identification and discovery through interpretable reasoning over retrieved visual evidence. Given a query image, DeepTaxon retrieves the top-$k$ candidate species with $n$ exemplar images each from a retrieval index and performs chain-of-thought comparative reasoning. Critically, we redefine discovery as an explicit, retrieval-based decision problem rather than an implicit parametric memory problem. A sample is novel if and only if the retrieval index lacks sufficient evidence for identification, so each retrieval naturally yields a classification or discovery label without manual annotation, thereby providing automatic supervision for both tasks. We train the framework via supervised fine-tuning on synthetic retrieval-augmented data, followed by reinforcement learning on hard samples, converting high-recall retrieval into high-precision decisions that scale to massive taxonomic vocabularies. Extensive experiments on a large-scale in-distribution benchmark and six out-of-distribution datasets demonstrate consistent improvements in both identification and discovery. Ablation studies further reveal effective test-time scaling with candidate count $k$ and exemplar count $n$, strong zero-shot transfer to unseen domains, and consistent performance across retrieval encoders, establishing an interpretable solution for biodiversity research.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.