Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction
Younhun Kim, Georg K. Gerber, Travis E. Gibson
TLDR
This paper uses Set-Aggregated Genome Embeddings (SAGE) with genomic language models to predict microbiome abundance from DNA, showing improved generalization.
Key contributions
- Introduces Set-Aggregated Genome Embeddings (SAGE) for microbiome abundance prediction.
- Leverages genomic language models (GLMs) for few-shot learning from raw DNA sequences.
- Demonstrates improved generalization on novel genomes over classical bioinformatics methods.
- Shows community-level latent representations and intermediate transformations enhance prediction.
Why it matters
This work advances microbiome analysis by predicting community properties directly from DNA sequences using novel embedding techniques. It offers a more generalized and robust approach than traditional methods, crucial for understanding and manipulating complex microbial ecosystems. This could accelerate discoveries in health, agriculture, and environmental science.
Original Abstract
Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.