Building informative materials datasets beyond targeted objectives

May 6, 20262605.05104

Rafael Espinosa Castañeda, Ashley Dale, Hongchen Wang, Yonatan Kurniawan, Hao Wan + 4 more

cond-mat.mtrl-scics.AIcs.DBcs.LGstat.AP

TLDR

A new framework uses diversity-aware selection to build materials datasets that are highly informative for both targeted and untargeted properties.

Key contributions

Introduces a framework for building materials datasets optimized for target properties.
Employs diversity-aware selection to ensure broad coverage of the materials space.
Improves untargeted property prediction by up to 10% compared to random sampling.
Achieves up to 25% gains for targeted property prediction over random sampling.

Why it matters

Materials data collection is costly, and current methods often create datasets with limited future utility. This framework ensures datasets remain broadly informative for both current and future objectives, mitigating cold-start problems.

Original Abstract

Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to 10% . For targeted properties, performance can degrade with respect to random sampling by up to 12.5% without diversity, while our framework achieves gains of up to 25%. Incorporating diversity into dataset construction not only preserves informativeness for the targeted properties, but also improves materials coverage for potential future objectives. As a result, the constructed datasets remain broadly informative across considered and unconsidered outcomes, ensuring unbiased quality entries and mitigating cold-start limitations in subsequent modeling and discovery campaigns.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers