ArXiv TLDR

Graph-Theoretic Models for the Prediction of Molecular Measurements

🐦 Tweet
2604.19840

Anna Niane, Prudence Djagba

cs.LGq-bio.QM

TLDR

Enhanced graph-theoretic models predict molecular properties accurately, matching deep learning performance while being computationally efficient and accessible.

Key contributions

  • Evaluated a baseline graph-theoretic model, finding poor generalization (avg R^2=0.24) on diverse MoleculeNet datasets.
  • Developed an enhancement framework using regularization, diverse descriptors, and ensemble learning for molecular prediction.
  • Achieved significant performance gains (avg R^2=0.79), matching or outperforming GCNs and GNN hybrids.
  • The framework is highly efficient, requiring no GPU and training quickly with open-source tools.

Why it matters

This research significantly advances classical graph-theoretic models for molecular property prediction, demonstrating that sophisticated feature engineering and ensemble methods can rival deep learning approaches. Its low computational cost and open-source nature make it highly accessible, particularly for researchers with limited resources.

Original Abstract

Graph-theoretic approaches offer simplicity, interpretability, and low computational cost for molecular property prediction. Among these, the model proposed by Mukwembi and Nyabadza, based on the external activity $D(G)$ and internal activity $ζ(G)$ indices, achieved strong results on a small flavonoid dataset. However, its ability to generalize to larger and chemically diverse datasets has not been tested. This study evaluates the baseline $D(G)$-$ζ(G)$ polynomial model on five benchmark datasets from MoleculeNet, covering biological activity (BACE, 1,513 molecules), lipophilicity (LogP synthetic, 14,610 molecules; LogP experimental, 753 molecules), aqueous solubility (ESOL, 1,128 molecules), and hydration free energy (SAMPL, 642 molecules). The baseline model achieves an average $R^2 = 0.24$, confirming limited transferability. To address this, a systematic enhancement framework is proposed, progressively incorporating Ridge regularization, additional graph descriptors, physicochemical properties, ensemble learning with Gradient Boosting, Lasso feature selection, and a hybrid approach combining topological indices with Morgan fingerprints. The enhanced models raise the average best $R^2$ to 0.79, with individual improvements ranging from 165\% to 274\%. All improvements are statistically significant ($p < 0.001$). A direct comparison with a Graph Convolutional Network under identical experimental conditions shows that the enhanced classical models match or outperform deep learning on all five datasets. Comparison with the recent GNN+PGM hybrid of Djagba et al.\ further confirms competitiveness, with the enhanced models achieving the best results on two datasets and tying on one. The entire framework requires no GPU, trains in under five minutes, and uses only open-source tools, making it accessible for researchers in resource-limited settings.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.