ArXiv TLDR

Interpretable DNA Sequence Classification via Dynamic Feature Generation in Decision Trees

🐦 Tweet
2604.12060

Nicolas Huynh, Krzysztof Kacprzyk, Ryan Sheridan, David Bentley, Mihaela van der Schaar

cs.LGcs.AIq-bio.GN

TLDR

DEFT uses large language models to dynamically generate interpretable, high-level features for DNA sequence classification in decision trees.

Key contributions

  • Introduces DEFT, a framework for interpretable DNA sequence classification via dynamic feature generation.
  • Leverages LLMs to propose biologically-informed features tailored to local sequence distributions.
  • Features are iteratively refined using a reflection mechanism during decision tree construction.
  • Discovers human-interpretable and highly predictive features across diverse genomic tasks.

Why it matters

This paper tackles the interpretability challenge in DNA sequence analysis. DEFT bridges the gap between black-box deep learning and limited decision trees by dynamically generating biologically-informed features. This approach offers transparent, high-performance models crucial for understanding gene regulation and disease.

Original Abstract

The analysis of DNA sequences has become critical in numerous fields, from evolutionary biology to understanding gene regulation and disease mechanisms. While deep neural networks can achieve remarkable predictive performance, they typically operate as black boxes. Contrasting these black boxes, axis-aligned decision trees offer a promising direction for interpretable DNA sequence analysis, yet they suffer from a fundamental limitation: considering individual raw features in isolation at each split limits their expressivity, which results in prohibitive tree depths that hinder both interpretability and generalization performance. We address this challenge by introducing DEFT, a novel framework that adaptively generates high-level sequence features during tree construction. DEFT leverages large language models to propose biologically-informed features tailored to the local sequence distributions at each node and to iteratively refine them with a reflection mechanism. Empirically, we demonstrate that DEFT discovers human-interpretable and highly predictive sequence features across a diverse range of genomic tasks.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.