ArXiv TLDR

DataMaster: Towards Autonomous Data Engineering for Machine Learning

🐦 Tweet
2605.10906

Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei + 10 more

cs.LGcs.AI

TLDR

DataMaster automates data engineering for ML, using a novel agent framework with tree search, shared data, and memory to boost model performance.

Key contributions

  • Introduces DataMaster, an autonomous agent for optimizing ML data engineering.
  • Uses a DataTree for structured search, a Data Pool for shared data, and Global Memory for findings.
  • Improves ML model performance by optimizing data discovery, selection, and transformation.
  • Boosts medal rate by 32.27% on MLE-Bench Lite and outperforms instruct models on GPQA.

Why it matters

Manual data engineering is a key bottleneck in ML. DataMaster automates this by optimizing data, not the algorithm, to boost performance. This novel framework streamlines ML development, showing strong results and enabling more efficient AI systems.

Original Abstract

As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.