An Imbalanced Dataset with Multiple Feature Representations for Studying Quality Control of Next-Generation Sequencing

April 4, 20262604.04981

Philipp Röchner, Clarissa Krämer, Johannes U Mayer, Franz Rothlauf, Steffen Albrecht + 1 more

q-bio.GNcs.LGcs.NE

TLDR

A new imbalanced dataset with two distinct feature representations is introduced to improve quality control of next-generation sequencing data.

Key contributions

Dataset derived from 37,491 NGS samples for quality control studies.
Includes two feature types: 34 QC-tool features and variable ENCODE blocklist features.
Provides binary quality labels from automated QC and domain experts for supervised learning.
Enables studying how different feature types and granularities impact quality problem detection.

Why it matters

Existing NGS repositories lack sufficient quality-related features for automated QC tool development. This dataset fills that gap by providing comprehensive feature representations, enabling researchers to develop and compare robust quality control algorithms. It facilitates understanding how different feature types impact quality problem detection.

Original Abstract

Next-generation sequencing (NGS) is a key technique for studying the DNA and RNA of organisms. However, identifying quality problems in NGS data across different experimental settings remains challenging. To develop automated quality-control tools, researchers require datasets with features that capture the characteristics of quality problems. Existing NGS repositories, however, offer only a limited number of quality-related features. To address this gap, we propose a dataset derived from 37.491 NGS samples with two types of quality-related feature representations. The first type consists of 34 features derived from quality control tools (QC-34 features). The second type has a variable number of features ranging from eight to 1.183. These features were derived from read counts in problematic genomic regions identified by the ENCODE blocklist (BL features). All features describe the same human and mouse samples from five genomic assays, allowing direct comparison of feature representations. The proposed dataset includes a binary quality label, derived from automated quality control and domain experts. Among all samples, $3.2\%$ are of low quality. Supervised machine learning algorithms accurately predicted quality labels from the features, confirming the relevance of the provided feature representations. The proposed feature representations enable researchers to study how different feature types (QC-34 vs. BL features) and granularities (varying number of BL features) affect the detection of quality problems.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers