ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway

April 7, 20262604.06264

Jueon Park, Wonjune Jang, Chanhwi Kim, Yein Park, Jaewoo Kang

q-bio.QMcs.AI

TLDR

ToxReason is a new benchmark evaluating LLMs' mechanistic chemical toxicity reasoning via Adverse Outcome Pathways, showing reasoning-aware training improves predictions.

Key contributions

Introduces ToxReason, a novel benchmark for mechanistic chemical toxicity reasoning across organs.
ToxReason evaluates LLMs by requiring inference of toxic outcomes and their underlying AOP mechanisms.
Reveals that high predictive performance in LLMs doesn't guarantee reliable mechanistic toxicity reasoning.
Demonstrates that reasoning-aware training significantly enhances both mechanistic reasoning and toxicity prediction.

Why it matters

This paper addresses a critical gap in evaluating LLMs for chemical toxicity, where current models often lack biologically faithful explanations. ToxReason provides a robust framework to assess mechanistic reasoning, crucial for developing trustworthy and reliable toxicity prediction models. It shows that focusing on reasoning during training is key.

Original Abstract

Recent advances in large language models (LLMs) have enabled molecular reasoning for property prediction. However, toxicity arises from complex biological mechanisms beyond chemical structure, necessitating mechanistic reasoning for reliable prediction. Despite its importance, current benchmarks fail to systematically evaluate this capability. LLMs can generate fluent but biologically unfaithful explanations, making it difficult to assess whether predicted toxicities are grounded invalid mechanisms. To bridge this gap, we introduce ToxReason, a benchmark grounded in the Adverse Outcome Pathway (AOP) that evaluates organ-level toxicity reasoning across multiple organs. ToxReason integrates experimental drug-target interaction evidence with toxicity labels, requiring models to infer both toxic outcomes and their underlying mechanisms from Molecular Initiating Event (MIE) to Adverse Outcome (AO). Using ToxReason, we evaluate toxicity prediction performance and reasoning quality across diverse LLMs. We find that strong predictive performance does not necessarily imply reliable reasoning. Furthermore, we show that reasoning-aware training improves mechanistic reasoning and, consequently, toxicity prediction performance. Together, these results underscore the necessity of integrating reasoning into both evaluation and training for trustworthy toxicity modeling.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers