ArXiv TLDR

Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes

🐦 Tweet
2604.19722

Jake Lee

cs.LGcs.AI

TLDR

Adaptive MSD-Splitting (AMSD) improves C4.5 and Random Forests by dynamically binning skewed continuous data, boosting accuracy and efficiency.

Key contributions

  • Introduces Adaptive MSD-Splitting (AMSD) for robust discretization of skewed continuous attributes.
  • Dynamically adjusts binning intervals based on feature skewness, preserving discriminative resolution.
  • Integrates AMSD into Random Forests (RF-AMSD), achieving state-of-the-art accuracy efficiently.
  • Yields 2-4% accuracy improvement over standard MSD-Splitting with O(N) time complexity.

Why it matters

Discretizing continuous data is a bottleneck, especially with skewed real-world datasets. AMSD offers a significant leap in handling such data, improving both accuracy and computational efficiency. This is crucial for large-scale machine learning, particularly in ensemble methods like Random Forests.

Original Abstract

The discretization of continuous numerical attributes remains a persistent computational bottleneck in the induction of decision trees, particularly as dataset dimensions scale. Building upon the recently proposed MSD-Splitting technique -- which bins continuous data using the empirical mean and standard deviation to dramatically improve the efficiency and accuracy of the C4.5 algorithm -- we introduce Adaptive MSD-Splitting (AMSD). While standard MSD-Splitting is highly effective for approximately symmetric distributions, its rigid adherence to fixed one-standard-deviation cutoffs can lead to catastrophic information loss in highly skewed data, a common artifact in real-world biomedical and financial datasets. AMSD addresses this by dynamically adjusting the standard deviation multiplier based on feature skewness, narrowing intervals in dense regions to preserve discriminative resolution. Furthermore, we integrate AMSD into ensemble methods, specifically presenting the Random Forest-AMSD (RF-AMSD) framework. Empirical evaluations on the Census Income, Heart Disease, Breast Cancer, and Forest Covertype datasets demonstrate that AMSD yields a 2-4% accuracy improvement over standard MSD-Splitting, while maintaining near-identical O(N) time complexity reductions compared to the O(N log N) exhaustive search. Our Random Forest extension achieves state-of-the-art accuracy at a fraction of standard computational costs, confirming the viability of adaptive statistical binning in large-scale ensemble learning architectures.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.