Artificial Intelligence
Research on AI systems, knowledge representation, planning, and general intelligence.
cs.AI · 1428 papersConfidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition
A new confidence-guided diffusion augmentation framework significantly boosts Bangla compound character recognition by synthesizing and filtering high-quality data.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a functional programming model for meta-agents that uses a Git-like execution trace for fast state forking and replay.
Engineering Robustness into Personal Agents with the AI Workflow Store
This paper introduces an AI Workflow Store to integrate rigorous software engineering into AI agents, creating robust, reusable workflows instead of brittle on-the-fly systems.
DataMaster: Towards Autonomous Data Engineering for Machine Learning
DataMaster automates data engineering for ML, using a novel agent framework with tree search, shared data, and memory to boost model performance.
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
This paper introduces a diagnostic framework to analyze on-policy distillation, revealing it helps more on incorrect rollouts and that optimal context varies.
Shields to Guarantee Probabilistic Safety in MDPs
This paper extends classical safety shields to guarantee probabilistic safety in Markov Decision Processes, introducing new constructions.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA introduces a system-model co-design framework to make FP8 low-precision arithmetic practical and efficient for large recommendation models.
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
AssayBench is a new benchmark for phenotypic screen prediction in virtual cell models, evaluating LLMs and agents on diverse cellular phenotypes.
CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation
CADBench is a new unified multimodal benchmark for evaluating AI models in generating editable CAD programs from various inputs.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
This paper introduces a decision-centric rate-distortion framework for agent memory, proposing DeMem to optimize memory by preserving distinctions crucial for decisions.
BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data
BEACON is a large, multimodal dataset from competitive Valorant gameplay for continuous authentication and behavioral fingerprinting research.
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
BenchCAD is a new industry-standard benchmark for evaluating MLLMs on generating executable parametric CAD programs, revealing current models' limitations.
The Generalized Turing Test: A Foundation for Comparing Intelligence
The Generalized Turing Test (GTT) offers a formal, dataset-agnostic framework to compare AI agent intelligence via indistinguishability.
Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
Pi-Serini demonstrates that well-tuned lexical retrieval with capable LLMs can effectively support deep agentic search, outperforming dense retrievers.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA is a training-free, inference-time method that culturally aligns LLMs by leveraging within-country sociodemographic disagreement, improving fairness.
Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories
Clin-JEPA is a multi-phase co-training framework for JEPA pretraining on EHR patient trajectories, enabling accurate forecasting and risk prediction.
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
This paper introduces a new evaluation protocol for AI pentesting agents, shifting from task completion to realistic vulnerability discovery.
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
Introduces MMVIAD, the first continuous multi-view video dataset for industrial anomaly detection, and VISTA, a model outperforming GPT-5.4.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM enhances LLM molecular editing by using sparse latent steering to precisely control properties and improve success rates.
ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models
ALAM learns algebraically consistent latent transitions from action-free videos, significantly boosting VLA policy performance on complex robot manipulation tasks.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.