Improving Machine Learning Performance with Synthetic Augmentation

April 16, 20262604.14498

Mel Sohm, Charles Dezons, Sami Sellami, Oscar Ninou, Axel Pincon

cs.AIcs.LGstat.ML

TLDR

This paper formalizes synthetic data augmentation, revealing its bias-variance trade-off and showing it benefits only variance-dominant ML tasks.

Key contributions

Formalizes synthetic augmentation as modifying the effective training distribution.
Identifies a structural bias-variance trade-off induced by synthetic data augmentation.
Introduces a size-matched null augmentation and a block permutation test for robust inference.
Demonstrates synthetic data benefits variance-dominant tasks but deteriorates performance in bias-dominant settings.

Why it matters

This research provides crucial clarity on the statistical role of synthetic data augmentation in financial machine learning. By formalizing its bias-variance trade-off, it offers practical guidance on when and how to effectively use synthetic data, preventing performance degradation in critical applications.

Original Abstract

Synthetic augmentation is increasingly used to mitigate data scarcity in financial machine learning, yet its statistical role remains poorly understood. We formalize synthetic augmentation as a modification of the effective training distribution and show that it induces a structural bias--variance trade-off: while additional samples may reduce estimation error, they may also shift the population objective whenever the synthetic distribution deviates from regions relevant under evaluation. To isolate informational gains from mechanical sample-size effects, we introduce a size-matched null augmentation and a finite-sample, non-parametric block permutation test that remains valid under weak temporal dependence. We evaluate this framework in both controlled Markov-switching environments and real financial datasets, including high-frequency option trade data and a daily equity panel. Across generators spanning bootstrap, copula-based models, variational autoencoders, diffusion models, and TimeGAN, we vary augmentation ratio, model capacity, task type, regime rarity, and signal-to-noise. We show that synthetic augmentation is beneficial only in variance-dominant regimes, such as persistent volatility forecasting-while it deteriorates performance in bias-dominant settings, including near-efficient directional prediction. Rare-regime targeting can improve domain-specific metrics but may conflict with unconditional permutation inference. Our results provide a structural perspective on when synthetic data improves financial learning performance and when it induces persistent distributional distortion.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers