Diagnosing CFG Interpretation in LLMs

April 22, 20262604.20811

cs.AI

TLDR

This paper diagnoses how LLMs interpret context-free grammars, revealing failures in structural semantics and reliance on semantic bootstrapping.

Key contributions

Evaluates LLMs' in-context interpretation of novel context-free grammars (CFGs).
Introduces RoboGrid to stress-test LLMs on CFG syntax, behavior, and semantics.
Reveals LLMs maintain surface syntax but fail structural semantics, collapsing under deep recursion.
Demonstrates LLMs rely on semantic bootstrapping from keywords, not pure symbolic induction.

Why it matters

This paper is crucial for understanding LLM limitations in agentic systems requiring adherence to machine-interpretable interfaces. It reveals LLMs struggle with hierarchical state-tracking and symbolic induction, which are critical for building reliable, grammar-agnostic agents.

Original Abstract

As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers