Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft
Zhou Ziheng, Huacong Tang, Jinyuan Zhang, Haowei Lin, Bangcheng Yang + 7 more
TLDR
SciCrafter, a Minecraft benchmark, reveals current AI struggles with discovery-to-application, plateauing at 26% success and highlighting a shift in bottlenecks.
Key contributions
- Introduces SciCrafter, a Minecraft benchmark for evaluating AI's discovery-to-application loop using redstone circuits.
- Frontier AI models (GPT-5.2, Gemini-3-Pro, Claude-Opus-4.5) achieve only ~26% success on SciCrafter tasks.
- Decomposes the loop, showing knowledge application as the main gap, but knowledge gap identification is emerging for frontier models.
- SciCrafter is released as a diagnostic tool for future research on AI systems navigating the full discovery-to-application loop.
Why it matters
This paper introduces a novel benchmark, SciCrafter, that rigorously tests AI's ability to discover and apply causal knowledge in a complex environment like Minecraft. Its findings highlight a critical shift in AI bottlenecks, moving from problem-solving to problem identification for frontier models. This work is crucial for advancing general intelligence research.
Original Abstract
Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.