Comparative analysis of missing data imputation methods for CSST survey: Impact on photometric redshift estimation performance
Ling Wang, Zhu Chen, Zhijian Luo, Liping Fu, Zuhui Fan + 40 more
TLDR
Evaluates imputation methods to improve photometric redshift accuracy amid missing data in CSST surveys.
Key contributions
- Benchmarked ML and DL imputation methods on CSST mock data for photo-z estimation.
- KNN excels under ideal MCAR conditions with complete training data.
- SAITS outperforms KNN with incomplete training or mixed missingness scenarios.
- Imputation models fail on MNAR data from flux limits, needing advanced architectures.
Why it matters
Accurate photometric redshifts are vital for cosmology but hindered by missing data. This study guides method choice under realistic conditions, highlighting domain shifts and missingness types for better survey analyses.
Original Abstract
Improving the accuracy of photometric redshifts (photo-$z$) is essential for reliable statistical studies of cosmology and galaxy evolution. However, missing photometric bands are a common observational challenge that can significantly degrade photo-$z$ estimation accuracy. In this work, we present a systematic evaluation of data imputation methods aimed at improving photo-$z$ performance. We benchmark a range of representative machine learning (ML) and deep learning (DL) architectures, identifying k-nearest neighbors (KNN) and the attention-based SAITS model as the leading performers. These models are then applied to China Space Station Survey Telescope (CSST) mock data to assess their performance under realistic observational conditions. Our results show that KNN yields the highest accuracy under idealized missing completely at random (MCAR) conditions with complete training sets, whereas robustness tests reveal that SAITS significantly outperforms KNN when training data is incomplete or when applied to realistic mixed-mechanism scenarios. We find that domain consistency between training and testing missingness patterns is a prerequisite for optimal performance, highlighting the risks of domain shift in supervised regression tasks. Furthermore, our analysis demonstrates that while general imputation models are highly effective for MCAR and missing at random (MAR) data, they are detrimental when applied to missing not at random (MNAR) data arising from flux limits, as statistical models fail to capture the physical information inherent in these non-detections. Consequently, we advocate for more sophisticated architectures capable of disentangling stochastic missingness from physical non-detections to address these distinct mechanisms individually.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.