Feature Dimensionality Outweighs Model Complexity in Breast Cancer Subtype Classification Using TCGA-BRCA Gene Expression Data

May 7, 20262605.06562

cs.LGq-bio.GN

TLDR

This study shows that feature dimensionality is more critical than model complexity for breast cancer subtype classification, with logistic regression excelling.

Key contributions

Evaluated model complexity and feature selection for breast cancer subtype classification.
Compared Logistic Regression, Random Forest, and SVM on TCGA-BRCA gene expression data.
Logistic Regression achieved stable, balanced performance, improving rare subtype detection.
Random Forest struggled with minority subtypes; SVM was sensitive to feature count.

Why it matters

This paper highlights that simpler models like logistic regression, combined with careful feature selection and appropriate metrics like macro F1, are crucial for accurate breast cancer subtype classification. It emphasizes that feature dimensionality is more impactful than model complexity, guiding better model choices for high-dimensional biological data.

Original Abstract

Accurate classification of breast cancer subtypes from gene expression data is critical for diagnosis and treatment selection. However, such datasets are characterized by high dimensionality and limited sample size, posing challenges for machine learning models. In this study, we evaluate the impact of model complexity and feature selection on subtype classification performance using TCGA-BRCA gene expression data. Logistic regression, random forest, and support vector machine (SVM) models were trained using varying numbers of highly variable genes (50 to 20,518). Performance was evaluated using stratified 5-fold cross-validation and assessed with accuracy and macro F1 score. While all models achieved high accuracy, macro F1 analysis revealed substantial differences in subtype-level performance. Logistic regression demonstrated the most stable and balanced performance across subtypes, including improved detection of rare classes. Random forest underperformed on minority subtypes despite strong overall accuracy, while SVM showed sensitivity to feature dimensionality. These findings highlight the importance of model simplicity, evaluation metrics, and feature selection in high-dimensional biological classification tasks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers