ENSEMBITS: an alphabet of protein conformational ensembles
TLDR
Ensembits is the first tokenizer for protein conformational ensembles, capturing dynamic motions and alternative states for protein language modeling.
Key contributions
- Introduces Ensembits, the first tokenizer for protein conformational ensembles, capturing dynamic motions.
- Addresses challenges in tokenizing dynamics, including variable-size ensembles and data sparsity.
- Outperforms existing methods in RMSF prediction and per-residue motion amplitude.
- Matches or exceeds static tokenizers on EC, GO, and binding site prediction with less data.
Why it matters
Existing protein structure tokenizers miss crucial dynamic information. Ensembits provides the first discrete vocabulary for protein dynamics, crucial for advancing protein language modeling and design in the shift to ensemble generation.
Original Abstract
Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation-invariance encoding of variable-size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token-conditioned ANOVA test on per-residue motion amplitude. Ensembits further matches or exceeds static tokenizers on EC, GO, binding site/affinity prediction, and zero-shot mutation-effect prediction despite using far less pretraining data. Notably, the distillation objective enables Ensembits to predict dynamics token from one single predicted structure, which alleviates dynamics data sparsity. As the field moves from static structure prediction toward ensemble generation, Ensembits offer the discrete vocabulary needed to bring dynamics into protein language modeling and design.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.