ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation
Cong Liu, Milong Ren, Jiaqi Guan, Chengyue Gong, Jinyuan Sun + 2 more
TLDR
ProtDBench is a unified, throughput-aware benchmark framework for fair evaluation and comparison of protein binder design methods.
Key contributions
- Introduces ProtDBench, a standardized and throughput-aware framework for protein binder design evaluation.
- Defines unified benchmark tasks, evaluation protocols, and success criteria for systematic analysis.
- Reveals significant verifier-dependent bias in structure prediction models using wet-lab annotated data.
- Benchmarks generative methods with novel throughput-aware and cluster-level success metrics.
Why it matters
Current protein binder design metrics are inconsistent, hindering method comparison. ProtDBench solves this by offering a standardized, throughput-aware evaluation framework. This enables fair, reproducible benchmarking and reveals biases, advancing the field by clarifying method performance.
Original Abstract
Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.