ArXiv TLDR

HyperEvoGen: Exploring deep phylogeny using non-Euclidean variational inference

🐦 Tweet
2604.22997

Jason Lamanna, Erfan Mowlaei, Xinghua Shi, Sudhir Kumar, Vincenzo Carnevale

q-bio.QM

TLDR

HyperEvoGen uses a hyperbolic VAE to model protein evolution, accurately reconstructing phylogenies and generating sequences, overcoming limitations of traditional methods.

Key contributions

  • Introduces HyperEvoGen, a Poincaré VAE with hyperbolic latent geometry for protein evolution.
  • Learns evolutionarily meaningful representations, preserving phylogenetic structure and true divergence.
  • Achieves more accurate ancestral reconstructions than conventional baselines on simulations.
  • Generates higher-quality protein sequences with less training time compared to Potts models.

Why it matters

HyperEvoGen addresses the limitations of traditional methods in modeling deep protein evolution using hyperbolic geometry. It offers a scalable, accurate approach for ancestral reconstruction and sequence generation, which supports large-scale evolutionary studies and accelerates protein design applications.

Original Abstract

Homologous proteins evolve from a common ancestral sequence, constrained by intricate patterns of co-evolving residues. Accurate reconstruction of evolutionary histories remains a challenge, primarily due to the inability of the existing approaches to capture long-range coevolutionary ties and lack of a precise metric to represent the evolutionary distance between sequences. Standard approaches are based on p-distance or substitution-corrected measures such as Jukes-Cantor. These methods saturate in cases of deep evolutionary divergence, losing all evolutionary signal after enough time. We present HyperEvoGen, a Poincaré variational autoencoder with adversarial training, hyperbolic latent geometry, and a compound loss function that learns evolutionarily meaningful representations from single-family alignments. The arrangement of protein sequences in HyperEvoGen's hyperbolic embedding aims to preserve phylogenetic structure and produce latent distances which scale with true evolutionary divergence. HyperEvoGen enables fast, scalable modeling of protein evolution while preserving hierarchical relatedness in a geometry-aware representation. On Potts-coupled simulation benchmarks, it produces more accurate ancestral reconstructions than conventional baselines, and offers higher-quality sequence generation with less training time than Potts models. This combination of accuracy and throughput supports large-family evolutionary studies and accelerates design-oriented applications.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.