ArXiv TLDR

GoForth: Language Models for RNA Design under Structure, Sequence, and Coding Constraints

🐦 Tweet
2605.07608

Michael Lindsey

q-bio.QMq-bio.BM

TLDR

GoForth is a new RNA language model that efficiently designs RNA sequences under complex structure, sequence, and coding constraints.

Key contributions

  • Introduces GoForth, a forward-trained RNA language model for inverse sequence design.
  • Conditions on multiple constraints: target folds, fixed bases, and coding restrictions.
  • Separates sequence prior, forward folding sampler, and reward oracle for modular design.
  • Achieves fast, high-quality RNA candidate generation and learns designability.

Why it matters

This paper addresses a critical challenge in RNA design by enabling the simultaneous application of diverse constraints. GoForth offers a robust, efficient, and modular approach to generate RNA sequences, significantly advancing practical applications in biology and engineering.

Original Abstract

RNA inverse sequence design has broad biological and engineering applications, but computational methods for practical design queries remain limited. Such queries may impose several constraints at once, including target folds or motifs, fixed bases, and coding restrictions, while leaving arbitrary sequence and structure in unspecified regions. Because these constraints may permit many acceptable sequences, we study RNA design as a conditional generative modeling problem. The basic object is a conditional law over RNA sequences given a user-specified condition, with full inverse folding as a special case. We introduce GoForth, a forward-trained RNA language model that conditions on structure, sequence, and coding targets. The formulation separates three ingredients that are often entangled in RNA design: a sequence prior, a forward folding sampler, and a reward or likelihood oracle. We train encoder-decoder models on witnessed folds rather than on outputs from an inverse-design teacher and validate our methodology on full inverse-folding benchmarks, as well as tasks involving constraints on structure, sequence, and coding. The resulting models achieve fast and high-quality candidate generation for mixed RNA design specifications. Moreover they furnish useful semantic embeddings of design tasks and a robust learned notion of designability.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.