Expanding functional protein sequence space using high entropy generative models
Roberto Netti, Emily Hinds, Francesco Calvanese, Rama Ranganathan, Martin Weigt + 1 more
TLDR
High-entropy Boltzmann Machines generate functional proteins and explore vastly larger sequence spaces, better representing evolutionary landscapes.
Key contributions
- Compared fully connected and sparse Boltzmann Machines for protein design using Chorismate Mutase.
- Identified a maximum-entropy model (meDCA) balancing constraint satisfaction and flexibility.
- All models generated functional enzymes, but meDCA explored 15 orders of magnitude more viable sequence space.
- High-entropy models minimize overfitting and better capture local neutral spaces of natural proteins.
Why it matters
This paper shows that high-entropy Boltzmann Machines are superior for protein design, allowing exploration of vastly larger functional sequence spaces. This improves our understanding of evolutionary fitness landscapes and could accelerate the design of novel proteins.
Original Abstract
Boltzmann Machines trained on evolutionary sequence data have emerged as a powerful paradigm for the data-driven design of artificial proteins. However, the relationship between model architecture, specifically parameter density, and experimental performance remains poorly understood. Here, we investigate this relationship using the Chorismate Mutase enzyme family as a model system. We compare standard fully connected Boltzmann Machines for Direct Coupling Analysis (bmDCA) with sparse models generated via progressive edge activation (eaDCA) and edge decimation (edDCA). We identify a maximum-entropy model (meDCA) along the decimation trajectory that represents an optimal balance between constraint satisfaction and the flexibility of the probability distribution. We synthesized and tested artificial sequences from all models using an in vivo complementation assay, finding that all architectures, regardless of sparsity, generate functional enzymes with high success rates, even at significant divergence from natural sequences. Despite this functional equivalence, we demonstrate that the meDCA model samples a viable sequence space that is more than fifteen orders of magnitude larger than its low-entropy counterparts. Furthermore, comparative analyses reveal that high-entropy models systematically minimize overfitting and better capture the local neutral spaces surrounding natural proteins. These findings suggest that while various models satisfying coevolutionary statistics can generate functional sequences, high-entropy Boltzmann Machines provide a superior representation of the underlying evolutionary fitness landscape.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.