Supregraph: Enabling Information-Optimal Assembly Graph Representation of a Read Set
TLDR
Supregraphs offer an information-optimal assembly graph representation, overcoming limitations of de Bruijn and overlap graphs for genome assembly.
Key contributions
- Introduces "supregraphs," a new mathematical model for genome assembly graphs.
- Provides a formal framework for correct read-set-to-graph conversion, assuming error-free reads.
- Shows supregraphs can be built by transforming de Bruijn graphs using the multiplexing procedure.
- Enables theoretically optimal genome assemblies by providing an information-optimal representation.
Why it matters
Current genome assembly methods struggle with information loss or artificial breaks. This paper introduces supregraphs, a novel graph representation that provides a mathematically sound and information-optimal way to convert read sets into assembly graphs. This advancement promises to improve the accuracy and completeness of genome assemblies, leading to more reliable genomic research.
Original Abstract
The first step in any genome assembly algorithm entails the conversion from the domain of strings and overlaps to the language of graphs and paths, typically using one of the two conventional methods: de Bruijn graphs or overlap graphs. However, both standard approaches are known to have limitations. De Bruijn graphs fail to represent complete information from reads, while the overlap graphs often produce artificial breaks in contigs due to the necessity to discard contained reads as a preliminary step. In this work we present a mathematical model for genome assembly that provides a formal framework to determine what constitutes a correct conversion of a read set into an assembly graph under the assumption of error-free reads. We prove that a correct representation of a read set exists in the form of a new class of assembly graphs, which we call supregraphs. We show that supregraphs can be constructed by iteratively transforming de Bruijn graphs using the multiplexing procedure, previously employed in the genome assemblers LJA and Verkko. Finally, we demonstrate that, under a set of natural assumptions, supregraphs provide a foundation for constructing theoretically optimal genome assemblies.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.