The Manokhin Probability Matrix: A Diagnostic Framework for Classifier Probability Quality
TLDR
The Manokhin Probability Matrix offers a 2D diagnostic framework to separate classifier reliability and resolution, guiding targeted improvements.
Key contributions
- Introduces the Manokhin Probability Matrix, a 2D framework separating classifier reliability (calibration) and resolution (discrimination).
- Classifies models into four archetypes (Eagle, Bull, Sloth, Mole) based on performance, each with specific improvement prescriptions.
- Empirical study across 21 classifiers and 30 tasks identifies specific models for each archetype and calibrator effects.
- Demonstrates that post-hoc calibration cannot improve discriminatory power, making discrimination the primary optimization target.
Why it matters
This paper offers a crucial diagnostic tool for understanding and improving classifier performance beyond the aggregate Brier score. By separating calibration from discrimination, it provides clear, actionable strategies for model development. It highlights that discrimination should be prioritized before post-hoc calibration, guiding more effective model optimization.
Original Abstract
The Brier score conflates two distinct properties of probabilistic predictions: reliability (calibration error) and resolution (discriminatory power). We introduce the Manokhin Probability Matrix, a BCG-style two-dimensional diagnostic framework that separates them. Classifiers are placed on a 2x2 grid by Spiegelhalter Z-statistic and AUC-ROC expected rank, then assigned to one of four archetypes: Eagle (good on both axes), Bull (strong discrimination, poor calibration), Sloth (well-calibrated, weak discriminator), and Mole (poor on both). Each archetype carries a distinct prescription. We populate the matrix from a large-scale empirical study spanning 21 classifiers, 5 post-hoc calibrators, and 30 real-world binary classification tasks from the TabArena-v0.1 suite. The assignment is unambiguous. CatBoost, TabICL, EBM, TabPFN, GBC, and Random Forest are Eagles. XGBoost, LightGBM, and HGB are Bulls; Venn-Abers calibration cuts log-loss by 6.5 to 12.6% on Bulls but degrades Eagles by 2.1%. SVM, LR, LDA, and the empirical base-rate predictor are Sloths. MLP, KNN, Naive Bayes, and ExtraTrees are Moles. A theoretical asymmetry follows: no order-preserving post-hoc calibrator can add discriminatory power (Proposition 1), so calibration is the fixable part and discrimination is the hard part. The practical rule is direct: do not optimise aggregate Brier score without first decomposing it; optimise discrimination first, then fix calibration post-hoc. Code and raw experimental data are available at https://github.com/valeman/classifier_calibration.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.