Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction Workflows

May 11, 20262605.11221

Yaochen Rao, Farzaneh Jalalypour, N. M. Anoop Krishnan, Rocío Mercado

q-bio.QMcs.LG

TLDR

A new expert-in-the-loop LLM workflow automates targeted protein degradation data extraction, significantly expanding databases with high accuracy.

Key contributions

Developed an expert-in-the-loop LLM workflow for targeted protein degradation (TPD) data extraction.
Achieved F1 = 0.98 for molecular glue extraction and F1 > 0.93 for PROTACs with minimal expert data.
Expanded existing molecular glue and PROTAC databases by 81% and 92% respectively.
Successfully extracted critical kinetic and assay-context information for TPD modeling.

Why it matters

This paper introduces a highly effective LLM-driven workflow that overcomes the bottleneck of manual data curation in targeted protein degradation. By automating the extraction of complex assay data, it significantly expands critical biomedical databases. This advancement enables more robust predictive modeling and accelerates drug discovery efforts.

Original Abstract

Predictive models in biomedicine depend on structured assay data locked in the text, tables, and supplements of primary publications. This bottleneck is especially acute in targeted protein degradation (TPD), where each assay record must combine compound identity, degradation target, recruiter, assay context, and endpoint values reported across sections, tables, and supplementary files. Inconsistent compound identifiers and incomplete or implicit assay context further demand domain-specific logic that generic LLM pipelines do not provide. Existing molecular glue and PROTAC databases are manually curated and often lack the experimental context required for downstream modeling. We formulate TPD database extraction as a domain-specific curation task and present an expert-in-the-loop LLM workflow, evaluated through a triangular comparison among LLM predictions, standardized baseline records, and expert-annotated ground truth. A lightweight cross-validated prompt-refinement module adapts extraction instructions from scarce expert annotations. With only seven annotated molecular glue publications, the workflow achieved record-level $F_1 = 0.98$ and transferred to PROTACs by terminology substitution alone, maintaining record-level $F_1 > 0.93$. Applied at scale, it expanded molecular glue and PROTAC databases by 81% and 92% records, respectively, with 92% and 82.5% of newly recovered records validated as correct upon expert review. The workflow also recovered kinetic and assay-context information essential for cross-study potency comparison and condition-aware degradation modeling. We release the workflow, prompts, evaluation code, and extracted datasets as resources for TPD data curation and AI-assisted scientific curation more broadly.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers