Entropy, Disagreement, and the Limits of Foundation Models in Genomics

April 5, 20262604.04287

Maxime Rochkoulets, Lovro Vrček, Mile Šikić

cs.LGcs.CLq-bio.GN

TLDR

High entropy in genomic sequences causes poor performance and instability in foundation models, suggesting self-supervised training limitations.

Key contributions

High entropy in genomic sequences causes near-uniform output distributions and model disagreement.
Genomic foundation models exhibit unstable static embeddings due to high sequence entropy.
Models trained on DNA concentrate Fisher information in embedding layers, failing to exploit inter-token relationships.

Why it matters

This paper provides a fundamental explanation for the limited success of foundation models in genomics, attributing it to high sequence entropy. It challenges the applicability of current self-supervised training methodologies for genomic data, guiding future research.

Original Abstract

Foundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences -- from the point of view of unseen token prediction -- leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding layers, seemingly failing to exploit inter-token relationships. Our results suggest that self-supervised training from sequences alone may not be applicable to genomic data, calling into question the assumptions underlying current methodologies for training genomic foundation models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers