MicroFuse: Protein-to-Genome Expert Fusion for Microbial Operon Reasoning
TLDR
MicroFuse integrates protein and genome context using a Mixture-of-Experts model to accurately predict microbial operons, outperforming baselines.
Key contributions
- Introduces MicroFuse, a protein-to-genome expert fusion framework for microbial operon prediction.
- Utilizes a four-expert Mixture-of-Experts (MoE) module to handle agreement and conflict between modalities.
- Develops OG-Operon100K, a new 100,000-pair benchmark for microbial operon co-membership.
- Achieves state-of-the-art performance, especially in ambiguous cases where protein data alone is misleading.
Why it matters
MicroFuse offers a robust solution for microbial operon prediction by intelligently fusing protein and genomic data. Its expert fusion approach handles complex biological signals, especially conflicting ones, improving accuracy where previous methods struggled. This advances our understanding of microbial gene regulation.
Original Abstract
Predicting microbial operon co-membership requires integrating two complementary biological signals: protein-scale molecular identity and genome-context organization. While recent biological foundation models provide powerful representations of each view independently, naive concatenation of these modalities ignores a key biological property -- protein identity and genomic context may agree when adjacent genes form a coherent functional module, or conflict when sequence similarity is misleading but genomic layout indicates independent regulation. We present MicroFuse, a protein-to-genome expert fusion framework that integrates structure-aware protein representations from ProstT5 with genome-context representations from Bacformer through a four-expert Mixture-of-Experts module (protein, genome-context, agreement, and conflict experts) with a learned soft router. Training combines binary cross-entropy with symmetric cross-modal InfoNCE alignment and disagreement-weighted supervised contrastive shaping. We further construct OG-Operon100K, a 100,000-pair scaffold-level benchmark from the OMG metagenomic corpus with biologically grounded positive and negative criteria. On OG-Operon100K, MicroFuse achieves the strongest AUROC, AUPRC, mAP, and mAR among ProstT5-only, Bacformer-only, and Concat MLP baselines. Ablations identify cross-modal contrastive alignment as the dominant component, and a hard sequence-conflict subset reveals MicroFuse's largest gains precisely in biologically ambiguous cases where protein identity alone is misleading.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.