ArXiv TLDR
Weekly ยท June 28, 2026

Top AI Papers This Week

The top 10 AI/ML papers from arXiv in the last 7 days, ranked by trending research topics and summarized in one line each.

๐Ÿ“ฌ Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week โ€” summarized, scored, and delivered to your inbox every Monday.

  1. #1cs.RO
    dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models

    dVLA-RL introduces a novel reinforcement learning approach for discrete diffusion Vision-Language-Action models by optimizing denoising trajectories.

    • Enables Reinforcement Learning (RL) for Discrete Diffusion VLAs (dVLAs) by optimizing denoising trajectories.
    • Formulates the denoising process as an MDP, solving intractable marginal action probability for dVLAs.
    • Introduces unified step scheduling, adapting denoising steps for efficient multi-task learning.
    Read full summary โ†’
  2. #2cs.LG, cs.AI, cs.AR
    The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

    This paper presents a roofline-inspired scaling model to accurately predict the energy consumption of Transformer fine-tuning across multiple GPUs.

    • Developed a framework to model energy consumption of Transformer training across multiple GPUs.
    • Relates measured energy to proxies for compute, memory traffic, and hardware efficiency.
    • Introduces a roofline-inspired hardware-efficiency factor to account for parallelism effects.
    Read full summary โ†’
  3. #3cs.SE, cs.AI, cs.CR
    AutoSpec: Safety Rule Evolution for LLM Agents via Inductive Logic Programming

    AutoSpec uses Inductive Logic Programming to automatically evolve and refine safety rules for LLM agents, reducing false positives while maintaining high recall.

    • Introduces AutoSpec, a framework for automatically evolving safety rules for LLM agents.
    • Leverages Inductive Logic Programming (ILP) to learn rule edits from user annotations and counterexamples.
    • Reduces false positives by up to 94% and achieves high F1 scores (0.98, 0.93) in agent safety.
    Read full summary โ†’
  4. #4cs.CV, cs.AI
    TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs

    TriViewBench introduces a controlled benchmark to reveal MLLM limitations in multi-view structural reasoning, showing severe performance drops with complexity.

    • Introduces TriViewBench, a novel 3-view benchmark with synthetic 3D scenes for controlled complexity scaling.
    • Evaluates 18 MLLMs, revealing a consistent capability hierarchy and monotonic performance degradation with complexity.
    • Identifies distinct failure modes in object counting: undercounting (occlusion) and overcounting (cross-view confusion).
    Read full summary โ†’
  5. #5cs.AI, cs.CR, cs.LG
    The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

    This paper introduces the Unfireable Safety Kernel, an external, execution-time AI alignment mechanism that prevents AI agents from bypassing safety controls.

    • Introduces "escapable AI systems" where agents can bypass internal safety controls.
    • Proposes an "Unfireable Safety Kernel" for execution-time AI alignment with four architectural control properties.
    • Implements the kernel in Rust, featuring machine-checked fail-closed invariants for robust security.
    Read full summary โ†’
  6. #6cs.AI, cs.LG
    InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy

    InvestPhilBench is a new multi-layer dynamic benchmark and scoring pipeline evaluating LLM procedural reasoning in expert investment philosophy.

    • Introduces InvestPhilBench, a multi-layer dynamic benchmark with 8 cognitive tiers for LLM investment reasoning.
    • Comprises 118 primary-source-verified principle cards, 25 decision frameworks, and 243 QA questions.
    • Features the Benchmark Automated Scoring Pipeline (BASP) with 5 metrics and a Failure Mode Detection Protocol.
    Read full summary โ†’
  7. #7cs.RO
    ForesightSafety-VLA: A Unified Diagnostic Safety Benchmark for Vision-Language-Action Models

    ForesightSafety-VLA is a new diagnostic benchmark for Vision-Language-Action (VLA) models, focusing on embodied safety evaluation.

    • Introduces ForesightSafety-VLA, a diagnostic benchmark for evaluating embodied safety in VLA models.
    • Defines a 13-category safety taxonomy across physical interaction, instruction, and perception.
    • Evaluates policies under scene structure, language, and visual variations to diagnose failure sources.
    Read full summary โ†’
  8. #8cs.RO
    LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

    LIBERO-Safety introduces a benchmark and dataset for evaluating physical and semantic safety in VLA models, revealing generalization-safety trade-offs.

    • Introduces LIBERO-Safety, a parametric benchmark for procedurally generating safety-critical VLA scenarios.
    • Develops a novel keypose-driven pipeline for scalable data generation, overcoming teleoperation bottlenecks.
    • Curates a large-scale dataset of 19,664 strictly collision-free demonstrations with domain randomization.
    Read full summary โ†’
  9. #9cs.AI, cs.CL, cs.CV
    VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

    VeriEvol scales multimodal mathematical reasoning by generating harder, image-grounded prompts and ensuring answer reliability through a novel verification framework.

    • Introduces VeriEvol, an iterative framework for verifiable data construction in visual mathematical reasoning.
    • Uses a type-aware evolution module to generate harder, image-grounded math prompts.
    • Employs HTV-Agent, a verifier that ensures answer reliability through multi-source falsification.
    Read full summary โ†’
  10. #10cs.IR
    Scaling Dense Retrieval with LLM-Annotated Training Data: Structured Mining and Progressive Curriculum for E-Commerce Sponsored Search

    A new pipeline uses LLM-annotated data and curriculum learning to scale dense retrieval for e-commerce search, significantly boosting performance over click-based methods.

    • Multi-channel retrieval mining from three production systems for diverse training signals.
    • LLM cascade for graded-relevance annotation, achieving 89.1% agreement with human labels.
    • Three-stage progressive curriculum training with 240M+ examples across five difficulty levels.
    Read full summary โ†’

๐Ÿ“ฌ Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week โ€” summarized, scored, and delivered to your inbox every Monday.