Model-agnostic information transfer and fusion for classification with label noise
Zhu Guojun, Zhang Sanguo, Ren Mingyang
TLDR
This paper introduces a model-agnostic nonparametric framework to effectively classify data with label noise by purifying large noisy datasets using small clean ones.
Key contributions
- Tackles label noise in large datasets by combining them with small, expert-verified clean datasets.
- Introduces a generic, model-agnostic nonparametric framework for robust classification.
- Uses clean data to "purify" noisy samples and effectively manage remaining ambiguities.
- Backed by rigorous statistical theory and validated in medical image analysis (pneumonia).
Why it matters
This paper offers a critical solution to label noise, a major challenge in machine learning, especially when combining large noisy datasets with small clean ones. It overcomes limitations of existing methods by providing a robust, model-agnostic framework. This is crucial for domains like medical imaging where data quality is paramount.
Original Abstract
Label noise presents a fundamental challenge in modern machine learning, especially when large-scale datasets are generated via automated processes. An increasingly common and important data paradigm, particularly in domains like medical imaging, involves learning from a large dataset with coarse, noisy labels supplemented by a small, expert-verified, clean dataset. This setting constitutes a typical information transfer and fusion problem. However, the significant distribution shift between the noisy and clean data violates the core overall parametric similarity assumptions of existing statistical transfer learning methods, while their reliance on parametric models is ill-suited for complex data like images. To address these limitations, this paper develops a generic model-agnostic nonparametric framework for classification with label noise, which applies to a broad class of classifiers. Our approach leverages the small clean dataset to ``purify'' the large noisy one and carefully manages the remaining ambiguous samples. This framework is underpinned by a rigorous statistical theory. Its empirical performance is demonstrated through simulations and a real-world application to medical image analysis for pneumonia diagnosis.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.