ArXiv TLDR

Bolek: A Multimodal Language Model for Molecular Reasoning

🐦 Tweet
2605.02745

Frederic Grabowski, Jacek Szczerbiński, Maciej Jaśkowski, Kalina Jasińska-Kobus, Paweł Dąbrowski-Tumański + 2 more

cs.LGcs.AIq-bio.BM

TLDR

Bolek is a compact multimodal language model that provides auditable, grounded molecular reasoning by integrating structural data into an instruction-tuned text decoder.

Key contributions

  • Introduces Bolek, a compact multimodal LLM that grounds molecular reasoning in structural data via Morgan fingerprint injection.
  • Achieves significant performance gains over baselines (Qwen3-4B-Instruct, TxGemma-9B-Chat) on molecular classification tasks.
  • Produces highly grounded and auditable explanations, citing molecular descriptors with strong agreement to RDKit values.

Why it matters

This paper introduces a novel approach to make AI-driven molecular reasoning more transparent and trustworthy. By grounding explanations in verifiable molecular features, Bolek enhances the auditability of drug discovery decisions. Its compact size and superior performance over larger models make it a significant advancement.

Original Abstract

Molecular property models increasingly support high-stakes drug-discovery decisions, but their outputs are often difficult to audit: classical predictors return scores without rationale, while language models can produce fluent explanations weakly grounded in the input molecule. We introduce Bolek, a compact multimodal language model that grounds natural-language reasoning in molecular structure by injecting a Morgan fingerprint embedding into an instruction-tuned text decoder. Bolek is fine-tuned on molecular alignment tasks, including molecule description, RDKit descriptor prediction, and substructure detection, and on downstream reasoning over 15 TDC binary classification tasks using synthetic chains-of-thought anchored in concrete molecular features. Across these tasks, Bolek outperforms its Qwen3-4B-Instruct base on all endpoints in yes/no mode and on 13 of 15 in chain-of-thought mode, raising mean ROC/PR AUC from 0.55 to 0.76. It also outperforms TxGemma-9B-Chat on 13 of 15 binary classification tasks despite being less than half its size. Bolek's explanations are more grounded than those of the baseline LLMs: it cites numerical descriptors 10-100x more often per chain-of-thought, and the cited values agree strongly with RDKit for key descriptors such as TPSA, MolLogP, and MolWt (Spearman rho = 0.87-0.91). Generalisation extends beyond the training panel: on 15 unseen TDC classification endpoints, Bolek matches TxGemma on five, and it produces non-trivial rank correlations on three held-out regression endpoints despite never seeing downstream regression during training. These results suggest that targeted modality injection and reasoning supervision tied to verifiable molecular features can yield compact, auditable molecular reasoning models.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.