ArXiv TLDR

Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs

🐦 Tweet
2604.27618

Naomi Esposito, Anthony Tricarico, Luisa Porzio, Ali Aghazadeh Ardebili, Massimo Stella

cs.AIcs.CYcs.HCcs.LGcs.SI

TLDR

MEDS is a new dataset mapping LLM math reasoning, anxiety, and confidence across 28,000 personas to improve AI math education.

Key contributions

  • Introduces MEDS, a dataset of 28,000 LLM personas shadowing human/AI math performance.
  • Includes four task types: open interviews, psychometric tests, cognitive networks, and high-school math problems.
  • Integrates self-efficacy, math anxiety, and cognitive networks beyond just math proficiency scores.
  • Reveals LLM peculiarities like human-like negative attitudes, logical fallacies, and math overconfidence.

Why it matters

This paper introduces a unique dataset, MEDS, that goes beyond traditional math benchmarks by assessing LLM math performance alongside psychological factors like anxiety and confidence. It provides crucial insights into LLM biases and reasoning. This data is vital for developing more effective and safer AI math tutors and advancing learning analytics.

Original Abstract

To enhance LLMs' impact on math education, we need data on their mathematical prowess and biases across prompts. To fill this gap, we introduce MEDS (Math Education Digital Shadows) as a dataset mapping how large language models reason about and report mathematics across human- and AI-like conditions. MEDS involves 28,000 personas from 14 LLMs (from families like Mistral, Qwen, DeepSeek, Granite, Phi and Grok) shadowing either humans or AI assistants. Each record/shadow includes a set of prompts along with psychological/sociodemographic persona metadata and four types of math tasks: (i) open math interview, (ii) three psychometric tests about math perceptions with explanations, (iii) cognitive networks capturing math attitudes, and (iv) 18 high-school math test questions together with their reasoning and confidence scores. MEDS differs from traditional score-only math benchmarks because it integrates concepts of self-efficacy, math anxiety, and cognitive network science besides math proficiency scores. Data validation shows that the sampled LLMs exhibit schema integrity and consistent personas, together with family-specific peculiarities like human-like negative math attitudes, logical fallacies, and math overconfidence. MEDS will benefit learning analytics experts, cognitive scientists, and developers of safer AI tutors in mathematics.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.