Show simple item record

dc.contributor.authorMuchina, Peter
dc.date.accessioned2022-11-14T08:43:15Z
dc.date.available2022-11-14T08:43:15Z
dc.date.issued2022-04-11
dc.identifier.otherPARALLELIZING LONG-READ DE NOVO GENOME ASSEMBLERS ON MIXED ARCHITECTURES (CPU/GPU) WITH COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA)
dc.identifier.otherPeter Kimani Muchina
dc.identifier.urihttp://elibrary.pu.ac.ke/handle/123456789/1006
dc.descriptionThe computational challenge of de novo genome assembly (when no reference genome is available) is hard to tackle when millions to billions of short reads must be assembled into contigs and further into scaffolds. Current de novo aligners (SGA, Spades, Velvet, ABySS, or SOAPdenovo) try their best at finding overlaps between reads or building de Bruijn graphs on k-mers, but doing so on relatively large genomes (say, over tens or hundreds of megabases) requires both vast amounts of available RAM and CPU cycles. The vast number (in the millions or billions) of reads now routinely generated by Next-Generation Sequencing jobs (e.g., on Illumina MiSeq, HiSeq, and NovaSeq) has rendered virtually intractable the overlap layout consensus methods, initially designed in the context of Sanger sequencing data, i.e., for relatively few reads with excellent accuracy. OLC methods were later superseded by de Bruijn graph-based methods with the advent of Next-Generation Sequencing. Since the inception of GPU-based computing (where vast arrays of individually slower graphical processors are available, but with high bandwidth to main memory and high parallelism), some avenues have been explored very recently for the acceleration of de Bruijn-based and OLC-based assembly methods on mixed architectures (CPU + GPU). The advent of long-read sequencing technologies (Oxford Nanopore Technologies and Pacific Biosciences) have significantly impacted genomics research. It is now possible to generate reads whose lengths exceed the tens or hundreds of kilobases (the current record being around 2.4Mbp for one read). This revives the applicability of the OLC methods, and there is now the hope that such OLC methods (which were in full use at the time of the construction of the first reference human genome, based on Sanger sequencing reads) would give better accuracy than other graph-based methods on long reads. However, the problem of their computational tractability on large datasets still needs to be adequately addressed. It could benefit significantly from the most recent mixed architectures and software stacks enabling vi developers to produce accelerated code. This study explores the development of a new pipeline for accelerated long-read de novo assembly on GPU architectures. The pipeline couples a stack of GPU accelerated software and mainstream CPU-based software used for de novo assembly. The pipeline was tested on the Jetson Nano and the GeForce RTX 2070 GPU-enabled platforms. The pipeline couples Cudamapper (overlap), Miniasm (layout), and Minipolish (Consensus). The overlap step is run on the GPU, while the layout and consensus steps run on the CPU. ASFV (≈ 190Kb) and Salmonella enterica (≈ 5Mb) are used as the test dataset in the study. A quality assessment using QUAST established that the pipeline scales up to the level of other mainstream de novo assemblers such as Flye, Raven, and Redbean. This study develops a GPU-based long-read de novo genome assembler that scales up to the level of other mainstream de novo genome assemblers. The pipeline is based on the OLC strategy, which gives an accurate genome compared to the dBG approach. In addition, this study tests the pipeline on a small, portable, and lightweight platform, the Jetson Nano and the GeForce RTX 2070. The study reveals that GPU-based assembly methods are ideal for addressing the genome assembly problem based on their efficiency and accuracy.en_US
dc.description.abstractThe computational challenge of de novo genome assembly (when no reference genome is available) is hard to tackle when millions to billions of short reads must be assembled into contigs and further into scaffolds. Current de novo aligners (SGA, Spades, Velvet, ABySS, or SOAPdenovo) try their best at finding overlaps between reads or building de Bruijn graphs on k-mers, but doing so on relatively large genomes (say, over tens or hundreds of megabases) requires both vast amounts of available RAM and CPU cycles. The vast number (in the millions or billions) of reads now routinely generated by Next-Generation Sequencing jobs (e.g., on Illumina MiSeq, HiSeq, and NovaSeq) has rendered virtually intractable the overlap layout consensus methods, initially designed in the context of Sanger sequencing data, i.e., for relatively few reads with excellent accuracy. OLC methods were later superseded by de Bruijn graph-based methods with the advent of Next-Generation Sequencing. Since the inception of GPU-based computing (where vast arrays of individually slower graphical processors are available, but with high bandwidth to main memory and high parallelism), some avenues have been explored very recently for the acceleration of de Bruijn-based and OLC-based assembly methods on mixed architectures (CPU + GPU). The advent of long-read sequencing technologies (Oxford Nanopore Technologies and Pacific Biosciences) have significantly impacted genomics research. It is now possible to generate reads whose lengths exceed the tens or hundreds of kilobases (the current record being around 2.4Mbp for one read). This revives the applicability of the OLC methods, and there is now the hope that such OLC methods (which were in full use at the time of the construction of the first reference human genome, based on Sanger sequencing reads) would give better accuracy than other graph-based methods on long reads. However, the problem of their computational tractability on large datasets still needs to be adequately addressed. It could benefit significantly from the most recent mixed architectures and software stacks enabling vi developers to produce accelerated code. This study explores the development of a new pipeline for accelerated long-read de novo assembly on GPU architectures. The pipeline couples a stack of GPU accelerated software and mainstream CPU-based software used for de novo assembly. The pipeline was tested on the Jetson Nano and the GeForce RTX 2070 GPU-enabled platforms. The pipeline couples Cudamapper (overlap), Miniasm (layout), and Minipolish (Consensus). The overlap step is run on the GPU, while the layout and consensus steps run on the CPU. ASFV (≈ 190Kb) and Salmonella enterica (≈ 5Mb) are used as the test dataset in the study. A quality assessment using QUAST established that the pipeline scales up to the level of other mainstream de novo assemblers such as Flye, Raven, and Redbean. This study develops a GPU-based long-read de novo genome assembler that scales up to the level of other mainstream de novo genome assemblers. The pipeline is based on the OLC strategy, which gives an accurate genome compared to the dBG approach. In addition, this study tests the pipeline on a small, portable, and lightweight platform, the Jetson Nano and the GeForce RTX 2070. The study reveals that GPU-based assembly methods are ideal for addressing the genome assembly problem based on their efficiency and accuracy.en_US
dc.description.sponsorshipPwani Universityen_US
dc.language.isoenen_US
dc.publisherPwani Universityen_US
dc.subjectCOMPUTE UNIFIED DEVICEen_US
dc.subjectPARALLELIZING LONG-READen_US
dc.subjectARCHITECTUREen_US
dc.subjectPARALLELIZING LONG-READ DE NOVO GENOMEen_US
dc.titlePARALLELIZING LONG-READ DE NOVO GENOME ASSEMBLERS ON MIXED ARCHITECTURES (CPU/GPU) WITH COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA)en_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record