OmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning

May 7, 20262605.06728

Maciej Sypetkowski, Joanna Krawczyk, Łukasz Smoliński, Remigiusz Kinas, Przemysław Pietrzak + 2 more

q-bio.GNcs.AIq-bio.CB

TLDR

OmicsLM is a multimodal LLM that connects quantitative omics data with natural language for biological reasoning, outperforming existing models.

Key contributions

Introduces OmicsLM, a multimodal LLM linking quantitative omics profiles with natural-language biological tasks.
Represents transcriptomic data as compact continuous representations within the LLM context for multi-sample processing.
Trained on 5.5M examples across 70+ task types, covering diverse biological reasoning challenges.
Proposes GEO-OmicsQA, a new benchmark for language-guided, multi-sample biological question answering.

Why it matters

This paper introduces OmicsLM, a novel multimodal LLM that bridges the gap between quantitative omics data and natural language biological reasoning. It enables more intuitive and powerful analysis of complex biological datasets. The new GEO-OmicsQA benchmark also addresses a critical evaluation gap for language-guided multi-sample reasoning.

Original Abstract

Interpreting transcriptomic data is one of the most common analytical tasks in modern biology. Yet most current models either consume expression profiles without producing natural-language biological explanations, or reason in language without direct access to quantitative omics measurements. We introduce OmicsLM, a multimodal LLM that connects quantitative omics profiles with natural-language biological tasks. OmicsLM represents each transcriptomic profile as a compact continuous representation within the LLM context. This interface preserves quantitative expression signal while allowing natural-language instructions, explicit gene mentions, and multiple interleaved biological samples to be processed together in one model context. We train OmicsLM on more than 5.5 million instruction-following examples spanning over 70 task types, combining continuous transcriptomic inputs, experimental data rendered through diverse language templates, and free-text biological knowledge and question-answering data. This mixture covers cell type annotation, perturbation prediction, clinical prediction, pathway reasoning, and open-ended biological question answering. Existing benchmarks evaluate either profile-level prediction or text-only biological QA, leaving language-guided, multi-sample reasoning over real expression profiles unmeasured. To close this gap, we introduce GEO-OmicsQA, a benchmark for multi-sample biological question answering built from real Gene Expression Omnibus (GEO) studies. We demonstrate that OmicsLM can use expression profiles directly and perform comparably to specialized omics models on profile-level tasks, while outperforming both omics-specialized models and general LLMs on language-guided biological reasoning over expression data.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers