ArXiv TLDR

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

🐦 Tweet
2604.21910

Bartosz Balis, Michal Orzechowski, Piotr Kica, Michal Dygas, Michal Kuszewski

cs.AI

TLDR

This paper introduces an agentic AI architecture that automates the translation of natural language research questions into scientific workflows.

Key contributions

  • Introduces a three-layer agentic AI architecture to automate scientific workflow generation from natural language.
  • Semantic layer uses LLMs for intent extraction, deterministic layer generates workflows, and knowledge layer uses expert "Skills."
  • "Skills" significantly boost intent accuracy from 44% to 83% and reduce data transfer by 92%.
  • Demonstrates efficient end-to-end query completion on Kubernetes with minimal LLM overhead and cost.

Why it matters

This paper bridges the gap between research questions and executable scientific workflows using agentic AI. It significantly reduces manual effort and specialized expertise, making scientific research more accessible and reproducible. By confining LLM non-determinism, the architecture ensures reliable and consistent workflow generation.

Original Abstract

Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer). This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows. We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes. In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%; skill-driven deferred workflow generation reduces data transfer by 92\%; and the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under $0.001 per query.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.