The Missing Evaluation Axis: What 10,000 Student Submissions Reveal About AI Tutor Effectiveness

May 7, 20262605.05648

Rose Niousha, Samantha Boatright Smith, Bita Akram, Peter Brusilovsky, Arto Hellas + 3 more

cs.CYcs.AIcs.HC

TLDR

This paper introduces a new evaluation framework for AI tutors, focusing on student behavioral responses to feedback using over 10,000 code submissions.

Key contributions

Identifies a "missing evaluation axis" for AI tutors: student behavioral responses to feedback.
Proposes a framework to measure if students act on and correctly apply tutor feedback.
Applies framework to 10,235 student code submissions, comparing two AI tutors.
Shows behavioral signals better predict perceived helpfulness than pedagogical quality.

Why it matters

This paper introduces a crucial behavioral dimension to AI tutor evaluation, moving beyond just pedagogical quality. By analyzing student actions, it provides a more complete and actionable picture of tutor performance, enabling better design and deployment of effective AI learning tools.

Original Abstract

Current Artificial Intelligence (AI)-based tutoring systems (AI tutors) are primarily evaluated based on the pedagogical quality of their feedback messages. While important, pedagogy alone is insufficient because it ignores a critical question: what do students actually do with the feedback they receive? We argue that AI tutor evaluation should be extended with a behavioral dimension grounded in student interaction data, which complements pedagogical assessment. We propose an evaluation framework and apply it to 10,235 code submissions with corresponding AI tutor feedback from an introductory undergraduate programming course to measure whether students act on tutor feedback and whether those actions are applied correctly. Using this framework to compare two deployed AI tutors across different semesters in a large-scale introductory computer science course reveals substantial differences in student engagement patterns that are not captured by pedagogy-only evaluation. Moreover, these engagement-based behavioral signals are more strongly associated with student perception of helpful feedback than pedagogical quality alone, providing a more complete and actionable picture of AI tutor performance.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers