Top AI Papers This Week
The top 10 AI/ML papers from arXiv in the last 7 days, ranked by trending research topics and summarized in one line each.
- #1cs.SE, cs.AIConstraint Decay: The Fragility of LLM Agents in Backend Code Generation
LLM agents struggle significantly with structural constraints in backend code generation, showing "constraint decay" as requirements accumulate.
- Systematic study on LLM agents' ability to handle structural constraints in multi-file backend code.
- Introduces "constraint decay": agent performance substantially declines as structural requirements accumulate.
- Agents perform better in minimal frameworks (Flask) but struggle in convention-heavy ones (FastAPI, Django).
- #2cs.CV, cs.AI, cs.LGFill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP proposes a granular alignment paradigm to stabilize visual latent reasoning in MLLMs by addressing feature-space mismatches, improving performance.
- Identifies and addresses a feature-space mismatch in MLLMs causing unstable visual latent reasoning.
- Introduces GAP with feature-level alignment using a PCA-aligned latent head for input-compatible latents.
- Incorporates context-level alignment with auxiliary visual supervision and capacity-guided selective supervision.
- #3q-bio.GN, cs.AI, q-bio.CBOmicsLM: A Multimodal Large Language Model for Multi-Sample Omics Reasoning
OmicsLM is a multimodal LLM that connects quantitative omics data with natural language for biological reasoning, outperforming existing models.
- Introduces OmicsLM, a multimodal LLM linking quantitative omics profiles with natural-language biological tasks.
- Represents transcriptomic data as compact continuous representations within the LLM context for multi-sample processing.
- Trained on 5.5M examples across 70+ task types, covering diverse biological reasoning challenges.
- #4cs.CVOpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
OpenSearch-VL provides an open-source recipe for training frontier multimodal deep search agents, achieving state-of-the-art performance.
- Curated high-quality training data (SearchVL-SFT-36k, SearchVL-RL-8k) reducing retrieval shortcuts.
- Developed a diverse tool environment integrating text/image search, OCR, and image processing.
- Introduced a fatal-aware GRPO algorithm to manage cascading tool failures in multi-turn interactions.
- #5cs.CV, cs.AI, cs.IROpen-SAT: LLM-Guided Query Embedding Refinement for Open-Vocabulary Object Retrieval in Satellite Imagery
Open-SAT improves open-vocabulary satellite image retrieval by using LLMs to refine query embeddings at inference time, achieving significant F1 score gains.
- Refines VLM-generated query embeddings using LLMs for better alignment with satellite imagery.
- Operates as a training-free, inference-time algorithm, avoiding additional model training.
- Leverages contextual information from LLMs about objects and their surroundings for enhanced retrieval.
- #6cs.CLCited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents
This paper introduces a framework to evaluate LLM research agents' source attribution, revealing high link validity but low factual accuracy in citations.
- Developed a novel framework using an AST parser to evaluate LLM citation quality at scale.
- Evaluates citations across three dimensions: link accessibility, content relevance, and factual accuracy.
- Benchmarked 14 LLMs, finding high link validity (>94%) and relevance (>80%) but low factual accuracy (39-77%).
- #7cs.CR, cs.AICyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
CyBiasBench reveals LLM cyber-attack agents exhibit inherent biases, concentrating efforts on specific attack families regardless of prompts.
- Introduced CyBiasBench, a 630-session benchmark to quantify attack-selection bias in LLM cyber agents.
- Found explicit, agent-specific biases in attack family allocation, independent of attack success rates.
- Identified a "bias momentum effect" where agents resist steering, without improving attack performance.
- #8cs.NEThe Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents
This paper proposes the Causally Emergent Alignment Hypothesis, showing that causal emergence in RL agents predicts final reward and aligns with learning.
- Introduces the Causally Emergent Alignment Hypothesis for RL agents.
- Shows causal emergence (CE) in latent spaces predicts final reward early in training.
- Demonstrates CE dynamics align with reward improvement across diverse RL tasks.
- #9cs.CL, q-bio.NCMeow-Omni 1: A Multimodal Large Language Model for Feline Ethology
Meow-Omni 1 is the first quad-modal MLLM for feline ethology, fusing video, audio, physiology, and text to achieve SOTA intent recognition.
- Introduces Meow-Omni 1, the first open-source quad-modal MLLM for computational ethology.
- Fuses video, audio, physiological time-series, and text for deeper feline intent understanding.
- Achieves state-of-the-art 71.16% intent-recognition accuracy on the new MeowBench benchmark.
- #10cs.CR, cs.CLLITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
LITMUS benchmarks LLM agent behavioral jailbreaks in real OS environments, revealing critical safety gaps and a new "Execution Hallucination" phenomenon.
- Introduces LITMUS, a benchmark for LLM agent behavioral jailbreaks in real OS environments.
- Utilizes semantic-physical dual verification and OS-level state rollback for robust testing.
- Reveals agents execute dangerous operations (40.64%) and suffer "Execution Hallucination."
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.