ArXiv TLDR

Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems

🐦 Tweet
2604.07304

Eduard Frankford, Erik Cikalleshi, Ruth Breu

cs.SEcs.AI

TLDR

This paper proposes a Hybrid Socratic Framework using chatbots to assess students' code understanding in automated programming assessment systems, addressing LLM challenges.

Key contributions

  • Conducted a scoping review of conversational assessment approaches in programming education.
  • Identified three dominant architectural families: rule-based, LLM-based, and hybrid systems.
  • Proposed a Hybrid Socratic Framework for integrating conversational verification into APASs.
  • Discussed practical safeguards against LLM-generated explanations and over-reliance.

Why it matters

LLMs challenge traditional programming assessment by enabling students to produce correct code without understanding. This paper offers a crucial framework to verify true code understanding, ensuring deeper learning and academic integrity in AI-assisted education.

Original Abstract

Large Language Models (LLMs) challenge conventional automated programming assessment because students can now produce functionally correct code without demonstrating corresponding understanding. This paper makes two contributions. First, it reports a saturation-based scoping review of conversational assessment approaches in programming education. The review identifies three dominant architectural families: rule-based or template-driven systems, LLM-based systems, and hybrid systems. Across the literature, conversational agents appear promising for scalable feedback and deeper probing of code understanding, but important limitations remain around hallucinations, over-reliance, privacy, integrity, and deployment constraints. Second, the paper synthesizes these findings into a Hybrid Socratic Framework for integrating conversational verification into Automated Programming Assessment Systems (APASs). The framework combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie prompts to runtime facts. The paper also discusses practical safeguards against LLM-generated explanations, including proctored deployment modes, randomized trace questions, stepwise reasoning tied to concrete execution states, and local-model deployment options for privacy-sensitive settings. Rather than replacing conventional testing, the framework is intended as a complementary layer for verifying whether students understand the code they submit.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.