Natural Language Processing
Research on language models, text understanding, generation, and computational linguistics.
cs.CL · 805 papersWARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data
WARDEN is a novel two-stage system that transcribes and translates endangered Wardaman to English using only 6 hours of audio, outperforming larger models.
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
EVA-Bench is a new end-to-end framework for evaluating voice agents using realistic bot-to-bot audio simulations and novel composite metrics.
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
TFlow enables multi-agent LLMs to communicate via transient weight perturbations, boosting efficiency and accuracy over text-based methods.
Negation Neglect: When models fail to learn negations in training
LLMs finetuned on documents that flag claims as false often learn to believe those claims are true, a phenomenon called Negation Neglect.
An LLM-Based System for Argument Reconstruction
This paper introduces an LLM-based system that reconstructs natural language arguments into abstract argument graphs, showing potential for scalable analysis.
Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
This paper introduces a novel method for detecting step-level hallucinations in LLMs by analyzing hidden-state transport geometry during a single forward pass.
Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching
This paper compares dense and MoE transformers at tiny scale, finding MoE outperforms dense when matching active parameters but not total parameters.
Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs
Omnimodal LLMs struggle to reject false textual claims contradicting sensory input, revealing a "Representation-Action Gap" in grounding.
Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety
Fine-tuning compact 8B LLMs with expert curricula generates children's English stories with controllable difficulty and safety, outperforming larger models.
RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning
RTLC, a three-stage prompting paradigm inspired by Feynman, significantly boosts LLM-as-judge accuracy on JudgeBench without fine-tuning.
Fine-tuning with Hierarchical Prompting for Robust Propaganda Classification Across Annotation Schemas
A new intent-focused propaganda taxonomy and hierarchical prompting (HiPP) significantly improve robust propaganda classification, especially after fine-tuning.
Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
Low-rank pre-training methods yield geometrically distinct solutions from full-rank models and each other, even with similar perplexity, requiring deeper evaluation metrics.
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
FlowCompile is an optimizing compiler for structured LLM workflows that explores design space at compile-time to find efficient, reusable configurations.
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
A new on-policy distillation method, "Prefix Teach, Suffix Fade," improves strong-to-weak model training by focusing supervision on locally teachable trajectory segments.
Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
RDPO improves multi-objective and mixed-reward RL by decorrelating rewards and stabilizing advantage allocation for diverse reward types.
Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction
This paper introduces edit-level majority voting to reduce over-correction in LLM-based grammatical error correction, improving performance.
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations
Automatic metrics and LLM judges poorly evaluate creativity in literary translations, often penalizing creative solutions and showing bias towards machine output.
Inducing Artificial Uncertainty in Language Models
A new method induces artificial uncertainty in language models on easy data, improving their calibration and uncertainty quantification on challenging tasks.
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
RealICU is a new benchmark for evaluating LLM agents on long-context ICU data, revealing recall-safety tradeoffs and anchoring biases in existing models.
Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models
An on-device PII substitution pipeline uses locale-conditioned few-shot prompting to prevent SLM regurgitation, though rule-based methods aid downstream NER more.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.