Generative Synthetic Data for Causal Inference: Pitfalls, Remedies, and Opportunities
TLDR
Generative synthetic data often distorts causal effects; this paper proposes a hybrid framework to improve average treatment effect preservation.
Key contributions
- Generative synthetic data (GANs, LLMs) distort causal estimands like ATE despite strong predictive fidelity.
- Formalizes ATE preservation failure, showing it requires control of covariate law and treatment-effect contrast.
- Proposes a hybrid synthetic-data framework, generating covariates separately from treatment and outcome.
- Develops a synthetic simulation engine for pre-analysis estimator evaluation under realistic covariate structure.
Why it matters
Generative synthetic data, while useful for privacy, often distorts causal effects. This paper identifies these pitfalls and offers a robust hybrid framework, improving ATE preservation. It provides practical tools for reliable causal analysis using synthetic data.
Original Abstract
Synthetic data offers a promising tool for privacy-preserving data release, augmentation, and simulation, but its use in causal inference requires preserving more than predictive fidelity. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can achieve strong train-on-synthetic-test-on-real performance while substantially distorting causal estimands such as the average treatment effect (ATE). We formalize this failure through sensitivity and tradeoff results showing that ATE preservation requires control of both the generated covariate law and the treatment-effect contrast in the outcome regression. Motivated by this observation, we propose a hybrid synthetic-data framework that generates covariates separately from the treatment and outcome mechanisms, using distance-to-closest-record diagnostics to monitor covariate synthesis and separately learned nuisance models to construct (W, A, Y) triplets. We further study targeted synthetic augmentation for practical positivity problems and characterize when added overlap support helps by improving conditional-effect estimation more than it shifts the covariate distribution. Finally, we develop a synthetic simulation engine for pre-analysis estimator evaluation, enabling finite-sample comparison of OR, IPW, AIPW, and TMLE under realistic covariate structure. Across experiments, hybrid synthetic data substantially improve ATE preservation relative to fully generative baselines and provide a practical diagnostic tool for robust causal analysis.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.