Ensuring Reliability in Programming Knowledge Tracing: A Re-evaluation of Attention-augmented Models and Experimental Protocols
TLDR
This study re-evaluates attention-augmented PKT models, showing reported gains stem from experimental design flaws and proposing a robust evaluation protocol.
Key contributions
- Identified that reported performance gains in PKT models are often influenced by model configuration and sequence construction.
- Demonstrated that improper ordering of student attempts (e.g., ignoring ServerTimestamp) violates causality, leading to optimistic results.
- Proposed a consistent evaluation protocol: grid search on a single fold, fixed hyperparameters across all cross-validation folds.
- Showed that under controlled settings, the performance gap between attention-enhanced models and DKT significantly reduces.
Why it matters
This paper is crucial for reliable research in Programming Knowledge Tracing. It exposes common experimental design flaws that inflate model performance and provides a rigorous protocol for consistent and comparable evaluations. This ensures future advancements are built on solid, reproducible foundations.
Original Abstract
Programming Knowledge Tracing (PKT) has recently advanced through hybrid approaches that integrate attention-based feature modeling for code representation with RNN-based sequential prediction. While these models report strong empirical performance, their reliability can be sensitive to subtle implementation and experimental design choices. This study revisits representative PKT models and shows that reported gains can be substantially influenced by model configuration and sequence construction practices. We identify issues in attention dimension settings that affect performance estimates, and demonstrate that improper ordering of student attempts, such as ignoring ServerTimestamp, can violate temporal causality and lead to overly optimistic results. To ensure consistent evaluation, hyperparameters are selected via grid search guided by a single designated fold and then fixed uniformly across all folds during cross-validation. We further analyze the role of assignment-wise characteristics and systematically explore the impact of maximum sequence length. Using this protocol, we re-evaluate PKT models on the CodeWorkout dataset. Our results show that, under controlled and consistent settings, the performance gap between attention-enhanced models and standard DKT is significantly reduced, and increased architectural complexity does not consistently translate into superior performance. Beyond individual model comparisons, this work provides practical guidance for reliable and comparable evaluation in programming knowledge tracing.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.