Artificial Intelligence

Research on AI systems, knowledge representation, planning, and general intelligence.

cs.AI · 1428 papers

Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition

A new confidence-guided diffusion augmentation framework significantly boosts Bangla compound character recognition by synthesizing and filtering high-quality data.

2605.10916May 11, 2026Md. Sultan Al Rayhan, Maheen Islam

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Shepherd is a functional programming model for meta-agents that uses a Git-like execution trace for fast state forking and replay.

2605.10913May 11, 2026Simon Yu, Derek Chong, Ananjan Nandi +4

Engineering Robustness into Personal Agents with the AI Workflow Store

This paper introduces an AI Workflow Store to integrate rigorous software engineering into AI agents, creating robust, reusable workflows instead of brittle on-the-fly systems.

2605.10907May 11, 2026Roxana Geambasu, Mariana Raykova, Pierre Tholoniat +3

DataMaster: Towards Autonomous Data Engineering for Machine Learning

DataMaster automates data engineering for ML, using a novel agent framework with tree search, shared data, and memory to boost model performance.

2605.10906May 11, 2026Yaxin Du, Xiyuan Yang, Zhifan Zhou +12

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

This paper introduces a diagnostic framework to analyze on-policy distillation, revealing it helps more on incorrect rollouts and that optimal context varies.

2605.10889May 11, 2026Mohammadreza Armandpour, Fatih Ilhan, David Harrison +6

Shields to Guarantee Probabilistic Safety in MDPs

This paper extends classical safety shields to guarantee probabilistic safety in Markov Decision Processes, introducing new constructions.

2605.10888May 11, 2026Linus Heck, Filip Macák, Roman Andriushchenko +2

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

LoKA introduces a system-model co-design framework to make FP8 low-precision arithmetic practical and efficient for large recommendation models.

2605.10886May 11, 2026Liang Luo, Yinbin Ma, Quanyu Zhu +20

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

AssayBench is a new benchmark for phenotypic screen prediction in virtual cell models, evaluating LLMs and agents on diverse cellular phenotypes.

2605.10876May 11, 2026Edward De Brouwer, Carl Edwards, Alexander Wu +9

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench is a new unified multimodal benchmark for evaluating AI models in generating editable CAD programs from various inputs.

2605.10873May 11, 2026Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme +3

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

This paper introduces a decision-centric rate-distortion framework for agent memory, proposing DeMem to optimize memory by preserving distinctions crucial for decisions.

2605.10870May 11, 2026Mingxi Zou, Zhihan Guo, Langzhang Liang +6

BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data

BEACON is a large, multimodal dataset from competitive Valorant gameplay for continuous authentication and behavioral fingerprinting research.

2605.10867May 11, 2026Ishpuneet Singh, Gursmeep Kaur, Uday Pratap Singh Atwal +3

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

BenchCAD is a new industry-standard benchmark for evaluating MLLMs on generating executable parametric CAD programs, revealing current models' limitations.

2605.10865May 11, 2026Haozhe Zhang, Kaichen Liu, Miaomiao Chen +4

The Generalized Turing Test: A Foundation for Comparing Intelligence

The Generalized Turing Test (GTT) offers a formal, dataset-agnostic framework to compare AI agent intelligence via indistinguishability.

2605.10851May 11, 2026Daniel Mitropolsky, Susan S. Hong, Riccardo Neumarker +2

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

Pi-Serini demonstrates that well-tuned lexical retrieval with capable LLMs can effectively support deep agentic search, outperforming dense retrievers.

2605.10848May 11, 2026Tz-Huan Hsu, Jheng-Hong Yang, Jimmy Lin

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

DISCA is a training-free, inference-time method that culturally aligns LLMs by leveraging within-country sociodemographic disagreement, improving fairness.

2605.10843May 11, 2026Huynh Trung Kiet, Dao Sy Duy Minh, Tuan Nguyen +5

Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

Clin-JEPA is a multi-phase co-training framework for JEPA pretraining on EHR patient trajectories, enabling accurate forecasting and risk prediction.

2605.10840May 11, 2026Yixuan Yang, Mehak Arora, Ryan Zhang +10

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

This paper introduces a new evaluation protocol for AI pentesting agents, shifting from task completion to realistic vulnerability discovery.

2605.10834May 11, 2026Pedro Conde, Henrique Branquinho, Valerio Mazzone +3

MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

Introduces MMVIAD, the first continuous multi-view video dataset for industrial anomaly detection, and VISTA, a model outperforming GPT-5.4.

2605.10833May 11, 2026Xiran Zhao, Jing Jin, Yan Bai +6

SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

SLIM enhances LLM molecular editing by using sparse latent steering to precisely control properties and improve success rates.

2605.10831May 11, 2026Mingxu Zhang, Yuhan Li, Lujundong Li +3

ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models

ALAM learns algebraically consistent latent transitions from action-free videos, significantly boosting VLA policy performance on complex robot manipulation tasks.

2605.10819May 11, 2026Zuojin Tang, Haoyun Liu, Xinyuan Chang +11

PreviousPage 6 of 72Next

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.