ArXiv TLDR

Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction

🐦 Tweet
2605.06644

Yuchen Xiong, Swee Keong Yeap, Steven Aw Yoong Kit

cs.LG

TLDR

This paper introduces a novel chromophore-region 3D mechanism graph algorithm for predicting fluorescent protein quantum yield (QY) with high accuracy.

Key contributions

  • Developed a chromophore-centered 3D mechanism graph algorithm for fluorescent protein quantum yield (QY) prediction.
  • Achieved state-of-the-art QY prediction (R=0.772) on a 531-protein benchmark, outperforming baselines.
  • Demonstrated superior performance in remote homology prediction (<50% similarity), a challenging scenario.
  • Offers intrinsically interpretable features that reveal band-specific mechanisms governing QY.

Why it matters

This paper significantly advances fluorescent protein quantum yield prediction, crucial for bioimaging and biosensing. By modeling local physical signals on chromophore regions, it surpasses sequence-only models, offering improved accuracy and vital interpretability for protein engineering and design.

Original Abstract

Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local physical signals act on specific chromophore regions. We present a chromophore-centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate, bridge and imidazolinone regions, and transformed by channel-signal-region propagation. The representation contains 121 enrichment features; after removing identity shortcuts, 52 non-identity features are used for band-specific ExtraTrees regression. Because each feature encodes a contact channel, seed signal and target CRO region, interpretation is intrinsic rather than post hoc. On a 531-protein benchmark, the method achieved the best random-CV performance among model-based baselines (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002), exceeding Band mean (R = 0.632), ESM-C (R = 0.734) and SaProt (R = 0.731), and ranked first in bright screening (Bright P@5 = 0.704). Under homology control, the advantage was clearest in the remote bucket (&lt;50% similarity; R = 0.697 versus 0.633, 0.575 and 0.408), with the strongest overall bright/dark Top-K screening. Stable selected features recovered band-specific mechanisms: aromatic packing and clamp asymmetry in GFP-like proteins, charge/clamp balance in Red proteins, and flexibility-risk/bulky-contact features in Far-red proteins. Source code, feature tables and evaluation scripts are available from the first author upon request. Contact: yuchenak05@gmail.com

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.