At a glance:
Oxford Nanopore Technologies (ONT) sequencing has witnessed significant progress in recent years, becoming a key player in the genomics field. As the technology matures, so does the bioinformatics analysis of ONT data. Researchers have been diligently developing specialized tools and algorithms to better utilize the unique characteristics of ONT data, such as long read lengths and ionic current signals. This article explores the latest bioinformatics advancements that enable enhanced base calling, base modification detection, error correction, assembly, and alignment of ONT data.
The bioinformatics analysis of ONT data typically involves a multi-step pipeline to transform raw electrical signals into meaningful genomic information. The pipeline includes base calling, error correction, alignment, variant calling, and additional steps for specialized analyses, such as detecting modifications and assessing transcriptome complexity.
Workflow of bioinformatics analysis – CD Genomics
Explore our Oxford Nanopore Sequencing Data Analysis Service for more information.
Base calling is a fundamental step in ONT data analysis, converting the raw ionic current signals into DNA base sequences. Early versions of base callers had relatively high error rates, hindering downstream analyses. However, with continuous improvements, modern base callers, such as Guppy and Chiron, have significantly enhanced accuracy and now offer real-time base calling capabilities.
Furthermore, ONT technology is uniquely suited to detect epigenetic modifications, such as DNA methylation. Specialized algorithms, including Tombo and DeepSignal, have been developed to identify base modifications by analyzing specific changes in the ionic current signal associated with modified bases. This epigenetic information is crucial for understanding gene regulation and other biological processes.
Bioinformatics pipeline of ONT sequencing. (Lood et al., 2020)
One of the key advantages of Oxford Nanopore Technologies (ONT) sequencing is its ability to directly detect DNA and RNA modifications. By distinguishing the unique current shifts caused by modified bases from those of unmodified bases, ONT sequencing offers insights into epigenetic modifications and post-transcriptional RNA modifications. In this section, we explore the methodologies and tools developed for the detection of DNA and RNA modifications using ONT sequencing data.
CD Genomics offers Epigenetics and Methylation Analysis Using Long-Read Sequencing for both DNA and RNA modifications.
DNA Modification Detection
ONT sequencing enables the direct detection of certain DNA modifications, such as 5-methylcytosine (5mC), 6-methyladenine (6mA), and N4-methylcytosine (4mC), at different levels of resolution, ranging from bulk-level detection to the single-molecule level. Several tools have been developed to identify DNA modifications from ONT data:
RNA Modification Detection
Detecting RNA modifications directly using ONT sequencing has also shown promise, although the resolution varies, and single-nucleotide resolution at the single-molecule level is yet to be demonstrated. In the past, PacBio sequencing was used to detect N6-methyladenosine (m6A) modifications in RNA molecules. More recently, ONT direct RNA sequencing has generated robust data of reasonable quality, paving the way for the detection of RNA modifications. Several pilot studies have successfully detected bulk-level RNA modifications using various methodologies:
Although these pilot studies have detected bulk-level RNA modifications, achieving single-nucleotide resolution at the single-molecule level remains a challenge.
While the average accuracy of Oxford Nanopore Technologies (ONT) sequencing is improving, certain subsets of reads or read fragments still exhibit very low accuracy. The error rates of both 1D reads and 2D/1D2 reads remain higher than those of short reads generated by next-generation sequencing technologies. As a result, error correction is a critical step applied before many downstream analyses, such as genome assembly and gene isoform identification. Error correction helps rescue reads for higher sensitivity and improves the quality of the results, including breakpoint determination at single-nucleotide resolution.
There are two main types of error correction algorithms used in ONT sequencing data analysis:
Recently, benchmark studies have demonstrated the efficacy of existing hybrid error correction tools, such as FMLRC, LSC, and LorDEC, along with sufficient short-read coverage, to reduce the long-read error rate to a level (approximately 1-4%) similar to that of short reads. On the other hand, self-correction reduces the error rate to approximately 3-6%, which may be attributed to non-random systematic errors in ONT data.
The long read lengths offered by ONT sequencing make it ideal for de novo genome assembly of complex organisms. De novo assembly tools like Canu and Flye have been adapted to specifically handle ONT data. These tools take advantage of the long reads to span repetitive regions and resolve complex genomic structures, providing more contiguous and accurate genome assemblies.
Genome polishing with ONT data, using tools like Pilon, further refines the assembled genome, correcting errors and misassemblies, leading to high-quality reference genomes.
Aligning error-prone long reads, such as those generated by Oxford Nanopore Technologies (ONT) sequencing, poses unique challenges due to the high error rates and increased read lengths compared to traditional short-read data. In response to the rise of long-read sequencing technologies, specialized aligners have been developed to effectively handle the distinct characteristics of error-prone long reads.
In 2016, GraphMap became the first aligner explicitly designed for ONT reads. It was initially motivated by PacBio data but also demonstrated effectiveness on ONT data. GraphMap employs a progressive refinement approach to handle high error rates and utilizes fast graph traversal algorithms for high-speed and precise alignment of long reads.
Minimap2 was developed to cater to the increasing read lengths beyond 100 kb in ONT sequencing. Using a seed-chain-align procedure, minimap2 achieved remarkable performance, running faster than other long-read aligners like LAST, NGMLR, and GraphMap, while still maintaining high accuracy. Notably, minimap2 also supports splice-aware alignment for ONT cDNA or direct RNA-sequencing reads, making it well-suited for transcriptomic applications.
In the realm of transcriptome data, aligners like GMAP (published in 2005) and the short-read aligner STAR, adapted for long reads, have been widely used for splice-aware alignment of error-prone transcriptome long reads to reference genomes. These aligners help identify exon-exon junctions and detect alternative splicing events, providing valuable insights into gene expression and isoform diversity.
Additional aligners have been developed specifically for ONT transcriptome data, including Graphmap2 and deSALT. Graphmap2 has demonstrated superior alignment rates over minimap2, particularly for ONT direct RNA-sequencing reads with dense base modifications, making it well-suited for analyzing heavily modified RNA molecules.
For research purposes only, not intended for personal diagnosis, clinical testing, or health assessment