Top AI Papers This Week
The top 10 AI/ML papers from arXiv in the last 7 days, ranked by trending research topics and summarized in one line each.
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.
- #1cs.ROdVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models
dVLA-RL introduces a novel reinforcement learning approach for discrete diffusion Vision-Language-Action models by optimizing denoising trajectories.
- Enables Reinforcement Learning (RL) for Discrete Diffusion VLAs (dVLAs) by optimizing denoising trajectories.
- Formulates the denoising process as an MDP, solving intractable marginal action probability for dVLAs.
- Introduces unified step scheduling, adapting denoising steps for efficient multi-task learning.
- #2cs.LG, cs.AI, cs.ARThe Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model
This paper presents a roofline-inspired scaling model to accurately predict the energy consumption of Transformer fine-tuning across multiple GPUs.
- Developed a framework to model energy consumption of Transformer training across multiple GPUs.
- Relates measured energy to proxies for compute, memory traffic, and hardware efficiency.
- Introduces a roofline-inspired hardware-efficiency factor to account for parallelism effects.
- #3cs.SE, cs.AI, cs.CRAutoSpec: Safety Rule Evolution for LLM Agents via Inductive Logic Programming
AutoSpec uses Inductive Logic Programming to automatically evolve and refine safety rules for LLM agents, reducing false positives while maintaining high recall.
- Introduces AutoSpec, a framework for automatically evolving safety rules for LLM agents.
- Leverages Inductive Logic Programming (ILP) to learn rule edits from user annotations and counterexamples.
- Reduces false positives by up to 94% and achieves high F1 scores (0.98, 0.93) in agent safety.
- #4cs.CV, cs.AITriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs
TriViewBench introduces a controlled benchmark to reveal MLLM limitations in multi-view structural reasoning, showing severe performance drops with complexity.
- Introduces TriViewBench, a novel 3-view benchmark with synthetic 3D scenes for controlled complexity scaling.
- Evaluates 18 MLLMs, revealing a consistent capability hierarchy and monotonic performance degradation with complexity.
- Identifies distinct failure modes in object counting: undercounting (occlusion) and overcounting (cross-view confusion).
- #5cs.AI, cs.CR, cs.LGThe Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems
This paper introduces the Unfireable Safety Kernel, an external, execution-time AI alignment mechanism that prevents AI agents from bypassing safety controls.
- Introduces "escapable AI systems" where agents can bypass internal safety controls.
- Proposes an "Unfireable Safety Kernel" for execution-time AI alignment with four architectural control properties.
- Implements the kernel in Rust, featuring machine-checked fail-closed invariants for robust security.
- #6cs.AI, cs.LGInvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy
InvestPhilBench is a new multi-layer dynamic benchmark and scoring pipeline evaluating LLM procedural reasoning in expert investment philosophy.
- Introduces InvestPhilBench, a multi-layer dynamic benchmark with 8 cognitive tiers for LLM investment reasoning.
- Comprises 118 primary-source-verified principle cards, 25 decision frameworks, and 243 QA questions.
- Features the Benchmark Automated Scoring Pipeline (BASP) with 5 metrics and a Failure Mode Detection Protocol.
- #7cs.ROForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models
ForesightSafety-VLA is a new diagnostic benchmark for Vision-Language-Action (VLA) models, focusing on embodied safety evaluation.
- Introduces ForesightSafety-VLA, a diagnostic benchmark for evaluating embodied safety in VLA models.
- Defines a 13-category safety taxonomy across physical interaction, instruction, and perception.
- Evaluates policies under scene structure, language, and visual variations to diagnose failure sources.
- #8cs.ROLIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models
LIBERO-Safety introduces a benchmark and dataset for evaluating physical and semantic safety in VLA models, revealing generalization-safety trade-offs.
- Introduces LIBERO-Safety, a parametric benchmark for procedurally generating safety-critical VLA scenarios.
- Develops a novel keypose-driven pipeline for scalable data generation, overcoming teleoperation bottlenecks.
- Curates a large-scale dataset of 19,664 strictly collision-free demonstrations with domain randomization.
- #9cs.AI, cs.CL, cs.CVVeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct
VeriEvol scales multimodal mathematical reasoning by generating harder, image-grounded prompts and ensuring answer reliability through a novel verification framework.
- Introduces VeriEvol, an iterative framework for verifiable data construction in visual mathematical reasoning.
- Uses a type-aware evolution module to generate harder, image-grounded math prompts.
- Employs HTV-Agent, a verifier that ensures answer reliability through multi-source falsification.
- #10cs.IRScaling Dense Retrieval with LLM-Annotated Training Data: Structured Mining and Progressive Curriculum for E-Commerce Sponsored Search
A new pipeline uses LLM-annotated data and curriculum learning to scale dense retrieval for e-commerce search, significantly boosting performance over click-based methods.
- Multi-channel retrieval mining from three production systems for diverse training signals.
- LLM cascade for graded-relevance annotation, achieving 89.1% agreement with human labels.
- Three-stage progressive curriculum training with 240M+ examples across five difficulty levels.
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.