Real vs. Semi-Simulated: Rethinking Evaluation for Treatment Effect Estimation
TLDR
This paper reveals that common semi-simulated benchmarks and counterfactual metrics for treatment effect estimation don't align with real-world performance.
Key contributions
- Counterfactual metrics don't reliably align with observable metrics for treatment effect estimation.
- Model rankings from semi-simulated benchmarks do not transfer to real-world datasets.
- Simple meta-learners with strong base models consistently outperform specialized causal models.
- Suggests incorporating observable metrics and real-data validation for better assessment.
Why it matters
This paper reveals a critical disconnect between academic evaluation methods and real-world performance for treatment effect models. It highlights the need for more realistic evaluation using observable metrics and real data to foster practical progress in causal inference.
Original Abstract
Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industrial practice. However, the two communities often evaluate models under markedly different conditions. Methodological work typically relies on semi-simulated benchmarks and metrics that require counterfactual outcomes, whereas real-world applications rely on observable metrics based on ranking or test outcomes. Despite the well-known gap between methodological progress and practical deployment, the relationship between these evaluation regimes has not been examined systematically. We conduct a large-scale empirical study of treatment effect evaluation across standard semi-simulated benchmark families and real-world datasets. Our benchmark covers meta-learners paired with multiple base learners, as well as specialized causal machine learning models. We evaluate these methods using observable metrics common in application-oriented literature, alongside counterfactual metrics commonly used in methods papers. Our results reveal two complementary gaps. First, counterfactual metrics do not reliably recover the estimators preferred by observable metrics, even on the same semi-simulated benchmarks. Second, rankings obtained on semi-simulated benchmarks do not transfer to real datasets. We further find that simple meta-learners with strong base models are consistently competitive, in contrast to specialized causal models. Overall, our findings suggest that progress in treatment effect estimation research should not be assessed solely through counterfactual metrics and semi-simulated benchmarks, but it would benefit from incorporating observable metrics and real-data validation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.