next generation sequencing research papers

Search Menu
Advance Articles
Clinical Case Studies
Journal Club
Clinical Chemistry Podcasts
Clinical Trainee Council
Special Issues
Clinical Chemistry Guide to Scientific Writing
Clinical Chemistry Guide to Manuscript Review
Author Guidelines
Submission Site
Self-Archiving Policy
Call for Papers
Why Publish?
About Clinical Chemistry
Editorial Board
Advertising & Corporate Services
Journals on Oxford Academic
Books on Oxford Academic

Article Contents

Fundamentals of ngs platforms, the impact of ngs on basic research, ngs data analysis, a clinical future for ngs, technologies on the horizon, conclusions.

< Previous

Next-Generation Sequencing: From Basic Research to Diagnostics

Article contents
Figures & tables
Supplementary Data

Karl V Voelkerding, Shale A Dames, Jacob D Durtschi, Next-Generation Sequencing: From Basic Research to Diagnostics, Clinical Chemistry , Volume 55, Issue 4, 1 April 2009, Pages 641–658, https://doi.org/10.1373/clinchem.2008.112789

Permissions Icon Permissions

Background: For the past 30 years, the Sanger method has been the dominant approach and gold standard for DNA sequencing. The commercial launch of the first massively parallel pyrosequencing platform in 2005 ushered in the new era of high-throughput genomic analysis now referred to as next-generation sequencing (NGS).

Content: This review describes fundamental principles of commercially available NGS platforms. Although the platforms differ in their engineering configurations and sequencing chemistries, they share a technical paradigm in that sequencing of spatially separated, clonally amplified DNA templates or single DNA molecules is performed in a flow cell in a massively parallel manner. Through iterative cycles of polymerase-mediated nucleotide extensions or, in one approach, through successive oligonucleotide ligations, sequence outputs in the range of hundreds of megabases to gigabases are now obtained routinely. Highlighted in this review are the impact of NGS on basic research, bioinformatics considerations, and translation of this technology into clinical diagnostics. Also presented is a view into future technologies, including real-time single-molecule DNA sequencing and nanopore-based sequencing.

Summary: In the relatively short time frame since 2005, NGS has fundamentally altered genomics research and allowed investigators to conduct experiments that were previously not technically feasible or affordable. The various technologies that constitute this new paradigm continue to evolve, and further improvements in technology robustness and process streamlining will pave the path for translation into clinical diagnostics.

In 1977, 2 landmark articles describing methods for DNA sequencing were published. Allan Maxam and Walter Gilbert reported an approach in which terminally labeled DNA fragments were subjected to base-specific chemical cleavage and the reaction products were separated by gel electrophoresis( 1 ). In an alternative approach, Frederick Sanger and colleagues described the use of chain-terminating dideoxynucleotide analogs that caused base-specific termination of primed DNA synthesis( 2 ). Refinement and commercialization of the latter method led to its broad dissemination throughout the research community and, ultimately, into clinical diagnostics. In an industrial, high-throughput configuration, Sanger technology was used in the sequencing of the first human genome, which was completed in 2003 through the Human Genome Project, a 13-year effort with an estimated cost of $2.7 billion. In 2008, by comparison, a human genome was sequenced over a 5-month period for approximately $1.5 million( 3 ). The latter accomplishment highlights the capabilities of the rapidly evolving field of “next-generation” sequencing (NGS) 1 technologies that have emerged during the past 5 years. Currently, 5 NGS platforms are commercially available, with additional platforms on the horizon. To add to this pace, the US National Human Genome Research Institute (NHGRI) announced funding in August 2008 for a series of projects as part of its Revolutionary Genome Sequencing Technologies program, which has as its goal the sequencing of a human genome for $1000 or less ( http://www.genome.gov/27527585 ). This review describes NGS technologies, reviews their impact on basic research, and explores how they have the translational potential to substantially impact molecular diagnostics.

NGS platforms share a common technological feature—massively parallel sequencing of clonally amplified or single DNA molecules that are spatially separated in a flow cell. This design is a paradigm shift from that of Sanger sequencing, which is based on the electrophoretic separation of chain-termination products produced in individual sequencing reactions. In NGS, sequencing is performed by repeated cycles of polymerase-mediated nucleotide extensions or, in one format, by iterative cycles of oligonucleotide ligation. As a massively parallel process, NGS generates hundreds of megabases to gigabases of nucleotide-sequence output in a single instrument run, depending on the platform. These platforms are reviewed next.

roche/454 life sciences

The 454 technology ( http://www.454.com ) is derived from the technological convergence of pyrosequencing and emulsion PCR. In 1993, Nyren et al. described a sequencing approach based on chemiluminescent detection of pyrophosphate released during polymerase-mediated deoxynucleoside triphosphate (dNTP) incorporation( 4 ). Refinement by Ronaghi et al. served as the foundation for the commercial development of pyrosequencing( 5 )( 6 ). On a separate front, Tawfik and Griffiths described single-molecule PCR in microcompartments consisting of water-in-oil emulsions( 7 ). In 2000, Jonathan Rothberg founded 454 Life Sciences, which developed the first commercially available NGS platform, the GS 20, launched in 2005. Combining single-molecule emulsion PCR with pyrosequencing, Margulies and colleagues at 454 Life Sciences performed shotgun sequencing of the entire 580 069 bp of the Mycoplasma genitalia genome at 96% coverage and 99.96% accuracy in a single GS 20 run( 8 ). In 2007, Roche Applied Science acquired 454 Life Sciences and introduced the second version of the 454 instrument, the GS FLX. Sharing the same core technology as the GS 20, the GS FLX flow cell is referred to as a “picotiter well” plate, which is made from a fused fiber-optic bundle. In its newest configuration, approximately 3.4 × 10 6 picoliter-scale sequencing-reaction wells are etched into the plate surface, and the well walls have a metal coating to improve signal-to-noise discrimination. For sequencing ( Fig. 1 ), a library of template DNA is prepared by fragmentation via nebulization or sonication. Fragments several hundred base pairs in length are end-repaired and ligated to adapter oligonucleotides. The library is then diluted to single-molecule concentration, denatured, and hybridized to individual beads containing sequences complementary to adapter oligonucleotides. The beads are compartmentalized into water-in-oil microvesicles, where clonal expansion of single DNA molecules bound to the beads occurs during emulsion PCR. After amplification, the emulsion is disrupted, and the beads containing clonally amplified template DNA are enriched. The beads are again separated by limiting dilution, deposited into individual picotiter-plate wells, and combined with sequencing enzymes. Loaded into the GS FLX, the picotiter plate functions as a flow cell wherein iterative pyrosequencing is performed by successive flow addition of the 4 dNTPs. A nucleotide-incorporation event in a well containing clonally amplified template produces pyrophosphate release and picotiter-plate well–localized luminescence, which is transmitted through the fiber-optic plate and recorded on a charge-coupled device camera. With the flow of each dNTP reagent, wells are imaged, analyzed for their signal-to-noise ratio, filtered according to quality criteria, and subsequently algorithmically translated into a linear sequence output. With the newest chemistry, termed “Titanium,” a single GS FLX run generates approximately 1 × 10 6 sequence reads, with read lengths of ≥400 bases yielding up to 500 million base pairs (Mb) of sequence. A recognized strength of the 454 technology is the longer read length, which facilitates de novo assembly of genomes( 9 ). An outstanding concern has been the accurate determination of homopolymers >3–4 bases in length. A 6-base homopolymer should theoretically yield twice the luminescence of a 3-base homopolymer. Operationally, this luminescence yield varies, and estimates of homopolymer length are less accurate with increasing length( 8 )( 10 ). 454 has reported that the metal coating of the walls of picotiter wells mentioned above improves the accuracy of homopolymer determination. Sequence coverage depth and accuracy for the 454 technology is discussed below in the NGS Data Analysis section.

Roche 454 GS FLX sequencing.

Template DNA is fragmented, end-repaired, ligated to adapters, and clonally amplified by emulsion PCR. After amplification, the beads are deposited into picotiter-plate wells with sequencing enzymes. The picotiter plate functions as a flow cell where iterative pyrosequencing is performed. A nucleotide-incorporation event results in pyrophosphate (PP i ) release and well-localized luminescence. APS, adenosine 5′-phosphosulfate.

illumina/solexa

In 1997, British chemists Shankar Balasubramanian and David Klenerman conceptualized an approach for sequencing single DNA molecules attached to microspheres. They founded Solexa in 1998, and their goal during early development of sequencing single DNA molecules was not achieved, requiring a shift toward sequencing clonally amplified templates. By 2006, the Solexa Genome Analyzer, the first “short read” sequencing platform, was commercially launched. Acquired by Illumina ( http://www.Illumina.com ) in 2006, the Genome Analyzer uses a flow cell consisting of an optically transparent slide with 8 individual lanes on the surfaces of which are bound oligonucleotide anchors ( Fig. 2 ). Template DNA is fragmented into lengths of several hundred base pairs and end-repaired to generate 5′-phosphorylated blunt ends. The polymerase activity of Klenow fragment is used to add a single A base to the 3′ end of the blunt phosphorylated DNA fragments. This addition prepares the DNA fragments for ligation to oligonucleotide adapters, which have an overhang of a single T base at their 3′ end to increase ligation efficiency. The adapter oligonucleotides are complementary to the flow-cell anchors. Under limiting-dilution conditions, adapter-modified, single-stranded template DNA is added to the flow cell and immobilized by hybridization to the anchors. In contrast to emulsion PCR, DNA templates are amplified in the flow cell by “bridge” amplification, which relies on captured DNA strands “arching” over and hybridizing to an adjacent anchor oligonucleotide. Multiple amplification cycles convert the single-molecule DNA template to a clonally amplified arching “cluster,” with each cluster containing approximately 1000 clonal molecules. Approximately 50 × 10 6 separate clusters can be generated per flow cell. For sequencing, the clusters are denatured, and a subsequent chemical cleavage reaction and wash leave only forward strands for single-end sequencing. Sequencing of the forward strands is initiated by hybridizing a primer complementary to the adapter sequences, which is followed by addition of polymerase and a mixture of 4 differently colored fluorescent reversible dye terminators. The terminators are incorporated according to sequence complementarity in each strand in a clonal cluster. After incorporation, excess reagents are washed away, the clusters are optically interrogated, and the fluorescence is recorded. With successive chemical steps, the reversible dye terminators are unblocked, the fluorescent labels are cleaved and washed away, and the next sequencing cycle is performed. This iterative, sequencing-by-synthesis process requires approximately 2.5 days to generate read lengths of 36 bases. With 50 × 10 6 clusters per flow cell, the overall sequence output is >1 billion base pairs (Gb) per analytical run( 11 ). The newest platform, the Genome Analyzer II, has optical modifications enabling analysis of higher cluster densities. Coupled with ongoing improvements in sequencing chemistry and projected read lengths of 50-plus bases, further increases in output should be realized. Illumina and other NGS technologies have devised strategies to sequence both ends of template molecules. Such “paired-end” sequencing provides positional information that facilitates alignment and assembly, especially for short reads( 12 )( 13 ). A technical concern of Illumina sequencing is that base-call accuracy decreases with increasing read length( 14 ). This phenomenon is primarily due to “dephasing noise.” During a given sequencing cycle, nucleotides can be under- or overincorporated, or block removal can fail. With successive cycles, these aberrations accumulate to produce a heterogeneous population in a cluster of strands of varying lengths. This heterogeneity decreases signal purity and reduces precision in base calling, especially at the 3′ ends of reads. Modifications in sequencing chemistry and algorithms for data-image analysis and interpretation are being pursued to mitigate dephasing( 15 ). Investigators at the Wellcome Trust Sanger Institute, who have extensive experience with the Illumina platform, have published a series of technical improvements for library preparation, including methods for increasing the reproducibility of fragmentation by adaptive focused acoustic wave sonication, enhanced efficiency of adapter ligation by use of an alternate ligase, and reducing the G+C bias that has been observed in Illumina reads via a modified gel-extraction protocol( 16 ).

Illumina Genome Analyzer sequencing.

Adapter-modified, single-stranded DNA is added to the flow cell and immobilized by hybridization. Bridge amplification generates clonally amplified clusters. Clusters are denatured and cleaved; sequencing is initiated with addition of primer, polymerase (POL) and 4 reversible dye terminators. Postincorporation fluorescence is recorded. The fluor and block are removed before the next synthesis cycle.

applied biosystems/solid

The SOLiD (Supported Oligonucleotide Ligation and Detection) System 2.0 platform, which is distributed by Applied Biosystems ( http://www.solid.appliedbiosystems.com ), is a short-read sequencing technology based on ligation. This approach was developed in the laboratory of George Church and reported in 2005 along with the resequencing of the Escherichia coli genome( 17 ). Applied Biosystems refined the technology and released the SOLiD instrumentation in 2007. Sample preparation shares similarities with the 454 technology in that DNA fragments are ligated to oligonucleotide adapters, attached to beads, and clonally amplified by emulsion PCR. Beads with clonally amplified template are immobilized onto a derivitized-glass flow-cell surface, and sequencing is begun by annealing a primer oligonucleotide complementary to the adapter at the adapter–template junction ( Fig. 3 ). Instead of providing a 3′ hydroxyl group for polymerase-mediated extension, the primer is oriented to provide a 5′ phosphate group for ligation to interrogation probes during the first “ligation sequencing” step. Each interrogation probe is an octamer, which consists of (in the 3′-to-5′ direction) 2 probe-specific bases followed by 6 degenerate bases with one of 4 fluorescent labels linked to the 5′ end. The 2 probe-specific bases consist of one of 16 possible 2-base combinations (for example TT, GT, and so forth). In the first ligation-sequencing step, thermostable ligase and interrogation probes representing the 16 possible 2-base combinations are present. The probes compete for annealing to the template sequences immediately adjacent to the primer. After annealing, a ligation step is performed, followed by wash removal of unbound probe. Fluorescence signals are optically collected before cleavage of the ligated probes, and a wash is performed to remove the fluor and regenerate the 5′ phosphate group. In the subsequent sequencing steps, interrogation probes are ligated to the 5′ phosphate group of the preceding pentamer. Seven cycles of ligation, referred to as a “round,” are performed to extend the first primer. The synthesized strand is then denatured, and a new sequencing primer offset by 1 base in the adapter sequence (n − 1) is annealed. Five rounds total are performed, each time with a new primer with a successive offset (n − 2, n − 3, and so on). By this approach, each template nucleotide is sequenced twice. A 6-day instrument run generates sequence read lengths of 35 bases. Sequence is inferred by interpreting the ligation results for the 16 possible 2 base–combination interrogation probes. With the use of offset primers, several bases of the adapter are sequenced. This information provides a sequence reference starting point that is used in conjunction with the color space–coding scheme to algorithmically deconvolute the downstream template sequence (see Fig. 1 in the Data Supplement that accompanies the online version of this review at http://www.clinchem.org/content/vol55/issue4 ). Placing 2 flow-cell slides in the instrument per analytical run produces a combined output of 4 Gb of sequence or greater. Unextended strands are capped before the ligation to mitigate signal deterioration due to dephasing. Capping coupled with high-fidelity ligation chemistry and interrogation of each nucleotide base twice during independent ligation cycles yields a company-reported sequence consensus accuracy of 99.9% for a known target at a 15-fold sequence coverage over sequence reads of 25 nucleotides. On an independent track, the Church laboratory has collaborated with Danaher Motion and Dover Systems to develop and introduce an alternative sequencing-by-ligation platform, the Polonator G.007 ( http://www.polonator.org ). Table 1 summarizes GS FLX, Genome Analyzer, and SOLiD platform features.

Applied Biosystems SOLiD sequencing by ligation.

Top: SOLiD color-space coding. Each interrogation probe is an octamer, which consists of (3′-to-5′ direction) 2 probe-specific bases followed by 6 degenerate bases (nnnzzz) with one of 4 fluorescent labels linked to the 5′ end. The 2 probe-specific bases consist of one of 16 possible 2-base combinations. Bottom: (A), The P1 adapter and template with annealed primer (n) is interrogated by probes representing the 16 possible 2-base combinations. In this example, the 2 specific bases complementary to the template are AT. (B), After annealing and ligation of the probe, fluorescence is recorded before cleavage of the last 3 degenerate probe bases. The 5′ end of the cleaved probe is phosphorylated (not shown) before the second sequencing step. (C), Annealing and ligation of the next probe. (D), Complete extension of primer (n) through the first round consisting of 7 cycles of ligation. (E), The product extended from primer (n) is denatured from the adapter/template, and the second round of sequencing is performed with primer (n − 1). With the use of progressively offset primers, in this example (n − 1), adapter bases are sequenced, and this known sequence is used in conjunction with the color-space coding for determining the template sequence by deconvolution (see Fig. 1 in the online Data Supplement). In this technology, template bases are interrogated twice.

Comparison of NGS platforms.

helicos biosciences and single-molecule sequencing

The first single-molecule sequencing platform, the HeliScope, is now available from Helicos BioSciences ( http://www.helicosbio.com ) with a company-reported sequence output of 1 Gb/day. This technology stems from the work of Braslavsky et al., published in 2003( 18 ). Having obviated clonal amplification of template, the method involves fragmenting sample DNA and polyadenylation at the 3′ end, with the final adenosine fluorescently labeled. Denatured polyadenylated strands are hybridized to poly(dT) oligonucleotides immobilized on a flow-cell surface at a capture density of up to 100 × 10 6 template strands per square centimeter. After the positional coordinates of the captured strands are recorded by a charge-coupled device camera, the label is cleaved and washed away before sequencing. For sequencing, polymerase and one of 4 Cy5-labeled dNTPs are added to the flow cell, which is imaged to determine incorporation into individual strands. After label cleavage and washing, the process is repeated with the next Cy5-labeled dNTP. Each sequencing cycle, which consists of the successive addition of polymerase and each of the 4 labeled dNTPs, is termed a “quad.” The number of sequencing quads performed is approximately 25–30, with read lengths of up to 45–50 bases having been achieved. The Helicos platform was used to sequence the 6407-base genome of bacteriophage M13( 19 ). This study demonstrated both the potential and important technical issues that may be relevant to all single-molecule sequencing methods that are based on sequencing by synthesis. First, sequencing accuracy was appreciably improved when template molecules were sequenced twice (“2-pass” sequencing). Second, the accuracy of sequencing homopolymers was compromised by the polymerase adding additional bases of the same identity in a homopolymeric stretch in a given dNTP addition. Helicos has since developed proprietary labeled dNTPs, termed “virtual terminators,” which the company reports reduce polymerase processivity so that only single bases are added, improving the accuracy of homopolymer sequencing. Interestingly, the percentage of strands in which longer read lengths can be achieved (e.g., 50 nucleotides) is substantially lower than that obtained with shorter (e.g., 25 nucleotides) read-length sequencing, possibly reflecting secondary structures (e.g., hairpins) assumed by the template molecules.

In the short 4 years since the first commercial platform became available, NGS has markedly accelerated multiple areas of genomics research, enabling experiments that previously were not technically feasible or affordable. We describe major applications of NGS and then review the analysis of NGS data.

genomic analysis

The high-throughput capacity of NGS has been leveraged to sequence entire genomes, from microbes to humans( 3 )( 8 )( 9 )( 11 )( 20 )( 21 )( 22 )( 23 )( 24 ), including the recent sequencing of the genome of cytogenetically normal acute myeloid leukemia cells, which identified novel, tumor-specific gene mutations( 25 ). The longer read lengths of the 454 technology, compared with the Illumina and SOLiD short-read technologies, facilitate the assembly of genomes in the absence of a reference genome (i.e., de novo assembly). For resequencing, both long- and short-read technologies have been used successfully. In one comparative study, the 454, Illumina, and SOLiD technologies all accurately detected single-nucleotide variations when coverage depth was ≥15-fold per allele( 20 ) (the critical issue of coverage depth is discussed further in the NGS Data Analysis section). The 454 read lengths provide nucleotide haplotype information over a range of several hundred base pairs and are predicted to be better suited for detecting larger insertions and deletions and for producing alignments in areas containing repetitive sequences. Further studies are needed to compare technology performance for detecting insertions and deletions. Each platform has an optional strategy for sequencing both ends of DNA libraries (paired-end sequencing). In addition to effectively doubling sequence output, knowing that reads are associated with each other on a given fragment augments alignment and assembly, especially for short reads. Paired-end sequencing has been used to map genomic structural variation, including deletions, insertions, and rearrangements( 12 )( 13 )( 26 )( 27 ). The ability to sequence complete human genomes at a substantially reduced cost with NGS has energized an international effort to sequence thousands of human genomes over the next decade ( http://www.1000genomes.org ), which will lead to the characterization and cataloging of human genetic variation at an unprecedented level.

targeted genomic resequencing

Sequencing of genomic subregions and gene sets is being used to identify polymorphisms and mutations in genes implicated in cancer and in regions of the human genome that linkage and whole-genome association studies have implicated in disease( 28 )( 29 ). Especially in the latter setting, regions of interest can be hundreds of kb’s to several Mb in size. To best use NGS for sequencing such candidate regions, several genomic-enrichment steps, both traditional and novel, are being incorporated into overall experimental designs. Overlapping long-range PCR amplicons (approximately 5–10 kb) can be used for up to several hundred kb’s, but this approach is not practical for larger genomic regions. More recently, enrichment has been achieved by hybridizing fragmented, denatured human genomic DNA to oligonucleotide capture probes complementary to the region of interest and subsequently eluting the enriched DNA( 30 )( 31 )( 32 )( 33 ). Capture probes can be immobilized on a solid surface (Roche NimbleGen, http://www.nimblegen.com ; Agilent Technologies, http://www.agilent.com ; and Febit, http://www.febit.com ) or used in solution (Agilent). Current NimbleGen arrays contain 350 000 oligonucleotides of 60–90 bp in length that are typically spaced 5–20 nucleotides apart, with oligonucleotides complementary to repetitive regions being excluded. For enrichment, 5–20 μg of genomic DNA is fragmented and ligated to oligonucleotide linkers containing universal PCR priming sites. This material is denatured, hybridized to an array for 3 days, and eluted, with the enriched DNA amplified by the PCR before NGS library preparation. In reported studies, up to 5 Mb of sequence has been captured on the 350K array, with 60%–75% of sequencing reads mapping to targeted regions; other reads mapping to nontargeted regions reflect nonspecific capture. In development by NimbleGen is the use of an array of 2.1 × 10 6 features for capturing larger genomic regions. Agilent’s solution-based technology uses oligonucleotides up to 170 bases in length, with each end containing sequences for universal PCR priming and with primer sites containing a restriction endonuclease–recognition sequence. The oligonucleotide library is amplified by the PCR, digested with restriction enzymes, and ligated to adapters containing the T7 polymerase promoter site. In vitro transcription is performed with biotinylated UTP to generate single-stranded biotinylated cRNA capture sequences. For capture, 3 μg of fragmented, denatured genomic DNA is hybridized with cRNA sequences for 24 h in solution. After hybridization, duplexes consisting of single-stranded DNA and cRNA are bound to streptavidin-coated magnetic beads; the cRNA is then enzymatically digested, leaving enriched single-stranded DNA that is subsequently processed for NGS. An alternative enrichment approach developed by RainDance Technologies ( http://www.raindancetechnologies.com ) uses a novel microfluidics technology in which individual pairs of PCR primers for the genomic regions of interest are segregated in water in emulsion droplets and then pooled to create a “primer library.” Separately, emulsion droplets containing genomic DNA and PCR reagents are prepared. Two separate droplet streams are created, one with primer-library droplets and the other with droplets containing genomic DNA/PCR reagents. The 2 streams are merged and primer-library droplets and genomic DNA/PCR reagent droplets are paired in a 1:1 ratio. As paired droplets proceed through the microfluidic channel, they pass an electrical impulse that causes them to physically coalesce. The coalesced droplets containing individual primer pairs and genomic DNA/PCR reagents are deposited in a 96-well plate and amplified by the PCR. After amplification, the emulsions are disrupted, and the amplicons are pooled and processed for NGS.

metagenomics

NGS has had a tremendous impact on the study of microbial diversity in environmental and clinical samples. Operationally, genomic DNA is extracted from the sample of interest, converted to an NGS library and sequenced. The sequence output is aligned to known reference sequences for microorganisms that are predicted to be present in the sample. Closely related species can be discerned, and more distantly related species can be inferred. In addition, de novo assembly of the data set can yield information to support the presence of known and potentially new species. Qualitative genomic information is obtained, and analysis of the relative abundance of the sequence reads can be used to derive quantitative information on individual microbial species. To date, most NGS-based metagenomic analyses have used the 454 technology and its associated longer read lengths to facilitate alignment to microbial reference genomes and for de novo assembly of previously uncharacterized microbial genomes. Examples of metagenomic studies include the analysis of microbial populations in the ocean( 34 )( 35 ) and soil( 36 ), the identification of a novel arenavirus in transplantation patients( 37 ), and the characterization of microflora present in the human oral cavity( 38 ) and the guts of obese and lean twins( 39 ).

transcriptome sequencing

NGS has provided a powerful new approach, termed “RNA-Seq,” for mapping and quantifying transcripts in biological samples. Total, ribosomal RNA–depleted, or poly(A)+ RNA is isolated and converted to cDNA. A typical protocol would involve the generation of first-strand cDNA via random hexamer–primed reverse transcription and subsequent generation of second-strand cDNA with RNase H and DNA polymerase. The cDNA is then fragmented and ligated to NGS adapters. For small RNAs such as microRNAs (miRNAs) and short interfering RNAs, preferential isolation via a small RNA–enrichment method, size selection on an electrophoresis gel, or a combination of these approaches is commonly used. RNA ligase is used to join adapter sequences to the RNA; this step is often followed by a PCR amplification step before NGS processing. After sequencing, reads are aligned to a reference genome, compared with known transcript sequences, or assembled de novo to construct a genome-scale transcription map. Although RNA-Seq is in its early stages as a technology, it has already shown some advantages over gene expression arrays( 40 ). First, arrays depend on tiling existing genomic sequences, whereas RNA-Seq is not constrained by this limitation, allowing characterization of transcription without prior knowledge of the genomic sites of transcription origin. RNA-Seq is capable of single-base resolution and, compared with arrays, demonstrates a greater ability to distinguish RNA isoforms, determine allelic expression, and reveal sequence variants. Expression levels are deduced from the total number of reads that map to the exons of a gene, normalized by the length of exons that can be uniquely mapped. Results obtained with this approach have shown close correlation with those of quantitative PCR and RNA-spiking experiments. The dynamic range of RNA-Seq for determining expression levels is 3–4 orders of magnitude, compared with 2 orders of magnitude for expression arrays. In this context, RNA-Seq has shown improved performance for the quantitative detection of both highly produced transcripts and transcripts produced at low levels. RNA-Seq is being used to confirm and revise gene annotation, including 5′ and 3′, and exon/intron boundaries; the latter is achieved by mapping reads to exon junctions defined by GT-AG splicing consensus sites. Both qualitative and quantitative information regarding splicing diversity can be deduced. RNA-Seq has been applied to a variety of organisms, including Saccharomyces cerevisiae , Arabidopsis thaliana , mice, and human cells( 40 )( 41 )( 42 )( 43 )( 44 )( 45 )( 46 )( 47 )( 48 )( 49 )( 50 )( 51 ).

mapping of dna-binding proteins and chromatin analysis

The delineation of regulatory proteins associated with genomes was substantially accelerated by the introduction of chromatin immunoprecipitation and microarray hybridization (ChIP-on-chip) technology( 52 ). In this approach, proteins in contact with genomic DNA are chemically cross-linked (typically with mild formaldehyde treatment) to their binding sites, and the DNA is fragmented by sonication or digestion with micrococcal nuclease. The proteins cross-linked with DNA are immunoprecipitated with antibodies specific for the proteins of interest. The DNA in the immunoprecipitate is purified and hybridized to an oligonucleotide array consisting of sequences from the genome, allowing identification of the protein-binding sites. This approach has been successfully used to identify binding sites for transcription factors and histone proteins. ChIP-on-chip technology is now being supplanted in a variety of experimental settings with ChIP-Seq, in which the DNA harvested from the immunoprecipitate is converted into a library for NGS. The obtained reads are mapped to the reference genome of interest to generate a genome-wide protein-binding map( 53 )( 54 )( 55 ). Studies to date that have examined the genomic-binding sites of the human NRSF (neuron restrictive silencer factor) and STAT1 (signal transducer and activator of transcription 1) proteins indicate the resolution of ChIP-Seq to be greater than for ChIP-on-chip, as evidenced by confirmation of previously identified binding sites and identification of novel binding sites( 56 )( 57 ). Analogous to RNA-Seq, ChIP-Seq has the important advantage of not requiring prior knowledge of genomic locations of protein binding. In addition to the study of transcription factors, NGS is being used to map genomic methylation. One approach involves traditional bisulfite conversion of DNA followed by NGS, which has been applied to the study of entire genomes or genomic subregions( 58 )( 59 ). Ongoing studies are attempting to develop a variant of ChIP-Seq in which genomic methylation is assayed by coupling immunoprecipitation with a monoclonal antibody directed against methylated cytosine and subsequent NGS( 60 ).

NGS experiments generate unprecedented volumes of data, which present challenges and opportunities for data management, storage, and, most importantly, analysis( 61 ). NGS data begin as large sets of tiled fluorescence or luminescence images of the flow-cell surface recorded after each iterative sequencing step ( Fig. 4 ). This volume of data requires a resource-intensive data-pipeline system for data storage, management, and processing. Data volumes generated during single runs of the 454 GS FLX, Illumina, and SOLiD instruments are approximately 15 GB, 1 TB, and 15 TB, respectively. The main processing feature of the data pipeline is the computationally intensive conversion of image data into sequence reads, known as base calling. First, individual beads or clusters are identified and localized in an image series. Image parameters such as intensity, background, and noise are then used in a platform-dependant algorithm to generate read sequences and error probability–related quality scores for each base. Although many researchers use the base calls generated by the platform-specific data-pipeline software, alternative base-calling programs that use more advanced software and statistical techniques have been developed. Features of these alternative programs include the incorporation of ambiguous bases into reads, improved removal of poor-quality bases from read ends( 62 ), and the use of data sets for software training( 15 ). Incorporation of these features has been shown to reduce read error and improve alignment, especially as platforms are pushed to generate longer reads. These advantages, however, must be weighed against the substantial computer resources required by the large volumes of image data.

Pseudocolor image from the Illumina flow cell.

Each fluorescence signal originates from a clonally amplified template cluster. Top panel illustrates 4 emission wavelengths of fluorescent labels depicted in red, green, blue, and yellow. Images are processed to identify individual clusters and to remove noise or interference. The lower panel is a composite image of the 4 fluorescence channels.

Miscall probabilities of 0.1 (10%), 0.01 (1%), and 0.001 (0.1%) yield phred scores of 10, 20, and 30, respectively. The NGS error rates estimated by quality values depend on several factors, including signal-to-noise levels, cross talk from nearby beads or clusters, and dephasing. Substantial effort has been made to understand and improve the accuracy of quality scores and the underlying error sources( 10 )( 14 ), including inaccuracies in homopolymer run lengths on the 454 platform and base-substitution error biases with the Illumina format. Study of these error traits has led to examples of software that require no additional base calling but that improve quality-score accuracies and thus improve sequencing accuracy( 65 )( 66 ). Quality values are an important tool for rejecting low-quality reads, trimming low-quality bases, improving alignment accuracy, and determining consensus-sequence and variant calls( 67 ).

Alignment and assembly are substantially more difficult for NGS data than for Sanger data because of the shorter reads lengths in the former. One limitation of short-read alignment and assembly is the inability to uniquely align large portions of a read set when the read length becomes too short. Similarly, the number of uniquely aligned reads is reduced when aligning to larger, more complex genomes or reference sequences because of their having a higher probability of repetitive sequences. A case in point is a modeling study that indicated that 97% of the E. coli genome can be uniquely aligned with 18-bp reads but that only 90% of the human genome can be uniquely aligned with 30-bp reads( 68 )( 69 ). Unique alignment or assembly is reduced not only by the presence of repeat sequences but also by shared homologies within closely related gene families and pseudogenes. Nonunique read alignment is handled in software by read distribution between multiple alignment positions or leaving alignment gaps. De novo assembly will reject these reads, leading to shorter and more numerous assembled contigs. These factors are relevant when choosing an appropriate sequencing platform with its associated read length, particularly for de novo assembly( 9 ).

Error rates for individual NGS reads are higher than for Sanger sequencing. The higher accuracy of Sanger sequencing reflects not only the maturity of the chemistry but also the fact that a Sanger trace peak represents highly redundant, multiple terminated extension reactions. Accuracy in NGS is achieved by sequencing a given region multiple times, enabled by the massively parallel process, with each sequence contributing to “coverage” depth. Through this process, a “consensus” sequence is derived. To assemble, align, and analyze NGS data requires an adequate number of overlapping reads, or coverage. In practice, coverage across a sequenced region is variable, and factors other than the Poisson-like randomness of library preparation that may contribute to this variability include differential ligation of adapters to template sequences and differential amplification during clonal template generation( 11 )( 70 ). Beyond sequence errors, inadequate coverage can cause failure to detect actual nucleotide variation, leading to false-negative results for heterozygotes( 3 )( 11 ). Studies have shown that coverages of less than 20- to 30-fold begin to reduce the accuracy of single-nucleotide polymorphism calls in data on the 454 platform( 65 ). For the Illumina system, higher coverage depths (50- to 60-fold) have been used in an effort to improve short-read alignment, assembly, and accuracy, although coverage in the 20- to 30-fold range may be sufficient for certain resequencing applications( 14 ). As noted above, one comparative study of a yeast genome showed that the 454, Illumina, and SOLiD technologies all accurately detected single-nucleotide variations when the coverage depth was ≥15-fold per allele( 20 ). Coverage gaps can occur when sequences are not aligned because of substantial variance from a reference. Alignment of repetitive sequences in repeat regions of a target sequence can also affect the apparent coverage. Reads that align equally well at multiple sites can be randomly distributed to the sites or in some cases discarded, depending on the alignment software. In de novo–assembly software, reads with ambiguous alignments are typically discarded, yielding multiple aligned read groups, or contigs, with no information regarding relative order.

A large variety of software programs for alignment and assembly have been developed and made available to the research community (see Table 1 in the online Data Supplement). Most use the Linux operating system, and a few are available for Windows. Many require a 64-bit operating system and can use ≥16 MB of RAM and multiple central-processing unit cores. The range of data volumes, hardware, software packages, and settings leads to processing times from a few minutes to multiple hours, emphasizing the need for sufficient computational power. Although a growing set of variations in alignment and assembly algorithms are available, there remains the trade-off between speed and accuracy in which many but not all possible alignments are evaluated, with a balance having to be struck between ideal alignment and computational efficiency.

NGS software features vary with the application and in general may include alignment, de novo assembly, alignment viewing, and variant-discovery programs. In addition some NGS statistical data-analysis tools are being developed (such as JMP Genomics; SAS Institute). Software packages available for alignment and assembly to a reference sequence include Zoom( 71 ), MAQ( 67 ), Mosaik( 72 ), SOAP( 73 ), and SHRiMP ( http://compbio.cs.toronto.edu/shrimp/ ), which supports SOLiD color-space analysis. Software for de novo assembly includes Edina( 70 ), EULER-SR( 74 ), SHARCGS( 75 ), SSAKE( 69 ), Velvet( 76 ), and SOAPdenovo ( http://soap.genomics.org.cn/ ). Recently released commercial software for alignment and de novo assembly includes packages from DNAStar ( www.dnastar.com ), SoftGenetics ( www.softgenetics.com ), and CLC bio ( www.clcbio.com ) that feature data viewers that allow the user to see read alignments, coverage depth, genome annotations, and variant analysis. Fig. 5 presents some examples of NGS data viewed in 2 different software systems.

Examples of NGS data viewed in 2 different software systems.

(A), Roche Amplicon Variant Analyzer software displaying GS FLX data from the CFTR gene [cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7)]. Lower pane shows reference sequence (green) above 18 of 68 aligned reads. Column highlighted in yellow and blue shows a heterozygous single-nucleotide polymorphism (SNP). Single T/A insertions (red) may represent errors. Upper pane shows percent variation from reference (vertical bars) and coverage (pale blue line). (B), DNAStar SeqMan Pro software displaying Illumina data from Mycobacterium massiliense . Lower pane shows reference sequence above aligned reads. Green and red arrows show direction of sequencing; base calls at variance with reference are indicated (red). Three columns in agreement (red) indicate presumptive SNPs. Other bases in red may be errors. Upper pane shows read coverage, with relative alignment positions above the graph. See the Acknowledgments for disclosure information on the CFTR -gene analysis performed on residual, deidentified DNA.

RNA-Seq data analysis poses unique challenges and requires sequence alignment across spliced regions of transcripts as well as poly(A) tails. Current software has made strong inroads, however, with incorporation of motif recognition at splice junctions and identification of intron–exon borders through regions of low alignment coverage( 41 ). Deciphering multiple transcript isoforms involves mapping reads to known and putative splicing junctions and, in one approach, requires that each isoform be supported by multiple independent splice-junction reads with independent start sites( 51 ). ERANGE software has been used in the analysis of the mouse transcriptome( 43 ). ERANGE maps unique reads to their genomic site of origin and maps reads that match to several sites, or multireads, to a most likely site of origin. Reads that do not map to a known exon are grouped together by homology into candidate exons or parts of exons. The near proportional nature of NGS transcriptome data allows quantification of RNA production from the coverage of the assembled or aligned data. ERANGE uses normalized counts of unique reads, spliced reads, and multireads to quantify transcripts. Additional analytical considerations are needed for miRNA studies, including RNA secondary-structure analysis for hairpins, alignment to known miRNA databases, and searches within the NGS data set for complementary miRNA strands, as described in studies of developing rice grains( 77 ) and chicken embryos( 78 ).

Research with ChIP-Seq has led to analysis methods and software that exploit the advantages over ChIP-in-chip, namely a larger, more information-rich data set. The single-base resolution of the data allows improved estimation of binding-site positions in the programs QuEST( 79 ) and MACS( 80 ). Aligned data at the protein-binding regions typically have 2 characteristic offset peaks, each of which is populated by only forward or only reverse reads. These peaks are hallmarks of the immunoprecipitated short ChIP-Seq DNA fragments with a binding site near the center and are used by the software to estimate binding-site location near the mean peak position. Additional program features include advancements in statistical analysis to minimize miscalled binding sites, error probability estimation, and motif analysis (see Table 1 in the online Data Supplement).

From the impact that NGS has made at the basic-research level, we can anticipate its translation into molecular diagnostics. Key issues that will need to be addressed in this transition will include complexity of technical procedures, robustness, accuracy, and cost. By all these measures, NGS platforms will benefit from continued process streamlining, automation, chemistry refinements, cost reductions, and improved data handling. The cost of NGS is currently substantial in terms of the investment in capital equipment (from approximately $600 000 for the Roche/454 Life Science, Illumina, and Applied Biosystems SOLiD platforms to $1.35 million for the HeliScope platform) and costs of sequencing reagents (from approximately $3500–$4500 for the Illumina, Applied Biosystems, and Roche/454 platforms to $18 000 for the HeliScope platform). Nonetheless, the cost per base is substantially lower than for Sanger sequencing, and combined with the tremendous output, it is straightforward to see why genome centers, core facilities, and commercial contract-sequencing enterprises have readily adopted this new technology. Work flow considerations include the fact that preparation of a sample library requires multiple molecular biology steps and 2–4 days to complete, depending on the platform. In addition to the required molecular biology expertise, data analysis requires expertise in bioinformatics facilitated by a knowledge of Linux operating systems. Leveraging the high-throughput capacity of NGS platforms can be facilitated by analyzing multiple samples with separate flow-cell lanes or compartments. In addition, unique identifier sequences or “bar codes” can be ligated to individual samples, which can subsequently be pooled and sequenced. After sequencing, sequences of individual samples are derived by data deconvolution( 81 )( 82 )( 83 ).

The transition of NGS into clinical diagnostics is in the early stages of development in large reference laboratories and is being leveraged for applications that require large amounts of sequence information, relative quantification, and high-sensitivity detection. Examples that meet these criteria include the aforementioned detection of mutations in tumor cells from biopsies or in the circulation. In the area of mitochondrial disorders, NGS can be used to sequence the entire 16.5-kb mitochondrial genome, determine mutation heteroplasmy percentage, and analyze nuclear genes whose protein products affect mitochondrial metabolism—all in a single analytical run. In the authors’ laboratory, sequencing of mycobacterial genomes is ongoing as an approach to refine organism identification and support clinical epidemiologic investigations. HIV quasi-species detection and relative quantification have been demonstrated and can be used to monitor emerging drug resistance( 84 ). For human genetics, there is an increasing need to analyze multiple genes that, when mutated, lead to overlapping physical findings and clinical phenotypes. For example, 16 different genes are implicated in the pathogenesis of hypertrophic cardiomyopathy( 85 )( 86 ). For a comprehensive diagnostic evaluation in such settings, it will be necessary to sequence upwards of 100 000 to 200 000 bp. The coupling of NGS with the genomic-enrichment techniques described above offers a promising approach to this technical challenge.

Recently, investigative groups led by Y.M. Dennis Lo and Stephen Quake have applied NGS to the detection of fetal chromosomal aneuploidy( 87 )( 88 ). Prior work had demonstrated that cell-free fetal nucleic acids (DNA and RNA) are present in maternal blood during pregnancy, along with maternally derived cell-free nucleic acids. Several analytical approaches that use cell-free fetal nucleic acids have been developed to determine fetal aneuploidy, including the analysis of placental mRNA derived from the chromosomes of interest (e.g., chromosome 21) and the determination of relative chromosomal dosage via digital PCR analysis of a large number of target chromosomal loci compared with reference chromosomal loci( 89 )( 90 )( 91 ). Building on the concept of relative chromosome dosage, the Lo and Quake groups have independently shown the feasibility of converting cell-free DNA from maternal blood into an Illumina library, followed by sequencing and mapping the reads to the reference human genome. Counting the number of reads that map to each chromosome allows the relative dosage of each chromosome to be ascertained. If fetal aneuploidy is present, the number of sequence reads mapping to the affected chromosome would be expected to be statistically overrepresented in the data set. This expectation was confirmed in trisomy 21 pregnancies, with additional supporting evidence obtained for trisomy 18 and 13 pregnancies. These studies open a new avenue for assessing fetal aneuploidy and provide a foundation for NGS-based analysis of cell-free DNA in both nonpathologic and pathophysiological states.

New single-molecule sequencing technologies in development may decrease sequencing time, reduce costs, and streamline sample preparation. Real-time sequencing by synthesis is being developed by VisiGen ( http://www.visigenbio.com ) and Pacific Biosciences ( http://www.pacificbiosciences.com ). VisiGen’s approach uses DNA polymerase modified with a fluorescent donor molecule. Attached to a glass slide surface, the polymerase directs strand extension from primed DNA templates. Nucleotides are modified with fluorescent acceptor molecules, and light energy is used during incorporation to invoke fluorescence resonance energy transfer between polymerase and nucleotide fluorescent moieties, the latter being in the γ-phosphate position and cleaved away during incorporation. The company envisions its platform will consist of a massively parallel array of tethered DNA polymerases that will generate 1 × 10 6 bp of sequence per second.

Pacific Biosciences performs single-molecule real-time sequencing and uses phospholinked fluorescently labeled dNTPs. DNA sequencing is performed in thousands of reaction wells 50–100 nm in diameter that are fabricated with a thin metal cladding film deposited on an optical waveguide consisting of a solid, transparent silicon dioxide substrate. Each reaction well is a nanophotonic chamber in which only the bottom third is visualized, producing a detection volume of approximately 20 zL (20 × 10 −21 L). DNA polymerase/template complexes are immobilized to the well bottoms, and 4 differently labeled dNTPs are added. As the DNA polymerase incorporates complementary nucleotides, each base is held within the detection volume for tens of milliseconds, orders of magnitude longer than the amount of time it takes for a nucleotide to diffuse in and out of the detection volume. Laser excitation enables the incorporation events in individual wells to be captured through the optical waveguide, with the fluorescent color detected reflecting the identity of the dNTP incorporated. For sequencing, Pacific Biosciences uses a modified phi29 DNA polymerase that has enhanced kinetic properties for incorporating the system’s phospholinked fluorescently labeled dNTPs. In addition, phi29 DNA polymerase is highly processive, with strand-displacement activity. By taking advantage of these properties, Pacific Biosciences has demonstrated sequencing reads exceeding 4000 bases when a circularized single-stranded DNA molecule is used as template. In this configuration, the phi29 DNA polymerase carries out multiple laps of DNA strand-displacement synthesis around the circular template. The mean DNA-synthesis rate was determined to be approximately 4 bases/s. The observed errors, including deletions, insertions, and mismatches can be addressed by developing a consensus sequence read derived from the multiple rounds of template sequencing. Further refinement of the chemistries and platform instrumentation are ongoing, with a 2010 target date for commercial launch( 92 )( 93 )( 94 ).

Farther out toward the horizon is sequencing based on monitoring the passage of DNA molecules through nanopores 2–5 nm or greater in diameter. Nanopores are being fabricated in inorganic membranes (solid-state nanopores), assembled from protein channels in lipid membranes, or configured in polymer-based nanofluidic channels. In some configurations, current is applied across nanopore membranes to drive the translocation of negatively charged DNA molecules through pores while monitoring changes in membrane electrical conductance measured in the picoampere range. NABsys ( http://www.nabsys.com ) is pursuing a combination of nanopores with sequencing by hybridization in which single-stranded DNA molecules are hybridized with a library of hexamers of known sequence. The hybridized DNA is interrogated through a nanopore, with the current changes being different in regions of hexamer hybridization. The patterns of hybridization are used to map annealing regions and determine sequence. Oxford Nanopore Technologies ( http://www.nanoporetech.com ) is developing nanopore-based sequencing that uses an α-hemolysin protein channel in reconstituted lipid bilayers. The nanopores are situated in individual array wells, and single DNA molecules are introduced into the wells and progressively digested by exonuclease. The released single-nucleotide bases enter the nanopore and alter the electrical current, creating a characteristic current change for each individual base( 95 )( 96 ). Although the technology is seemingly futuristic, considerable NHGRI funding is being directed toward a variety of nanopore technologies under development as part of the goal of achieving the $1000 genome. For further descriptions of nanopore technologies, the reader is referred to recent reviews( 97 )( 98 ).

The past few years have witnessed the emergence of NGS technologies that share a common basis, massively parallel sequencing of clonally amplified DNA molecules. In 2008, the first NGS platform based on single-molecule DNA sequencing was launched. On the horizon are real-time single-molecule DNA-sequencing technologies and approaches based on nanopores. NGS has had a substantial impact on basic genomics research in terms of scale and feasibility. Over the next several years, NGS is anticipated to transition into clinical-diagnostics use. Essential elements to make this transition successful will be the requirement of streamlining the processes, especially sample preparation, coupled with improvements in technology robustness and characterization of accuracy through validation studies. The large amounts of sequence-data output will pose a bioinformatics challenge for the clinical laboratory. In addition to data processing, the interpretation of sequencing results will require further characterization of the genomic variation present in the regions analyzed. Although considerable work lies ahead to implement NGS into clinical diagnostics, the potential applications are exciting and numerous.

Author Contributions: All authors confirmed they have contributed to the intellectual content of this paper and have met the following 3 requirements: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; and (c) final approval of the published article.

Authors’ Disclosures of Potential Conflicts of Interest: No authors declared any potential conflicts of interest.

Role of Sponsor: The funding organizations played a direct role in the preparation of the manuscript and in the final approval of the manuscript.

Acknowledgments: The analysis of the CFTR gene illustrated in Fig. 5 was performed on residual, deidentified DNA under the approval of University of Utah Institutional Review Board, human subjects protocol number 7275.

Nonstandard abbreviations: NGS, next-generation sequencing; NHGRI, National Human Genome Research Institute; dNTP, deoxynucleoside triphosphate; Mb, million base pairs; Gb, billion base pairs; miRNA, microRNA.

S.A. Dames and J.D. Durtschi contributed equally to the review.

Maxam AM, Gilbert W. A new method for sequencing DNA. Proc Natl Acad Sci U S A 1977 ; 74 : 560 -564.

Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 1977 ; 74 : 5463 -5467.

Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 2008 ; 452 : 872 -876.

Nyren P, Pettersson B, Uhlen M. Solid phase DNA minisequencing by an enzymatic luminometric inorganic pyrophosphate detection assay. Anal Biochem 1993 ; 208 : 171 -175.

Ronaghi M, Karamohamed S, Pettersson B, Uhlen M, Nyren P. Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem 1996 ; 242 : 84 -89.

Ronaghi M, Uhlen M, Nyren P. A sequencing method based on real-time pyrophosphate. Science 1998 ; 281 : 363 -365.

Tawfik DS, Griffiths AD. Man-made cell-like compartments for molecular evolution. Nat Biotechnol 1998 ; 16 : 652 -656.

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005 ; 437 : 376 -380.

Pearson BM, Gaskin DJ, Segers RP, Wells JM, Nuijten PJ, van Vliet AH. The complete genome sequence of Campylobacter jejuni strain 81116 (NCTC11828). J Bacteriol 2007 ; 189 : 8402 -8403.

Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol 2007 ; 8 : R143 .

Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008 ; 456 : 53 -59.

Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 2007 ; 318 : 420 -426.

Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet 2008 ; 40 : 722 -729.

Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 2008 ; 36 : e105 .

Erlich Y, Mitra PP, delaBastide M, McCombie WR, Hannon GJ. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat Methods 2008 ; 5 : 679 -682.

Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, et al. A large genome center’s improvements to the Illumina sequencing system. Nat Methods 2008 ; 5 : 1005 -1010.

Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 2005 ; 309 : 1728 -1732.

Braslavsky I, Hebert B, Kartalov E, Quake SR. Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci U S A 2003 ; 100 : 3960 -3964.

Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, Braslavsky I, et al. Single-molecule DNA sequencing of a viral genome. Science 2008 ; 320 : 106 -109.

Smith DR, Quinlan AR, Peckham HE, Makowsky K, Tao W, Woolf B, et al. Rapid whole-genome mutational profiling using next-generation sequencing technologies. Genome Res 2008 ; 18 : 1638 -1642.

Quinn NL, Levenkova N, Chow W, Bouffard P, Boroevich KA, Knight JR, et al. Assessing the feasibility of GS FLX Pyrosequencing for sequencing the Atlantic salmon genome. BMC Genomics 2008 ; 9 : 404 .

Satkoski JA, Malhi R, Kanthaswamy S, Tito R, Malladi V, Smith D. Pyrosequencing as a method for SNP identification in the rhesus macaque (Macaca mulatta) . BMC Genomics 2008 ; 9 : 256 .

Borneman AR, Forgan AH, Pretorius IS, Chambers PJ. Comparative genome analysis of a Saccharomyces cerevisiae wine strain. FEMS Yeast Res 2008 ; 8 : 1185 -1195.

Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, et al. The diploid genome sequence of an Asian individual. Nature 2008 ; 456 : 60 -65.

Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 2008 ; 456 : 66 -72.

Kim PM, Lam HY, Urban AE, Korbel JO, Affourtit J, Grubert F, et al. Analysis of copy number variants and segmental duplications in the human genome: evidence for a change in the process of formation in recent evolutionary history. Genome Res 2008 ; 18 : 1865 -1874.

Chen J, Kim YC, Jung YC, Xuan Z, Dworkin G, Zhang Y, et al. Scanning the human genome at kilobase resolution. Genome Res 2008 ; 18 : 751 -762.

Yeager M, Xiao N, Hayes RB, Bouffard P, Desany B, Burdett L, et al. Comprehensive resequence analysis of a 136 kb region of human chromosome 8q24 associated with prostate and colon cancers. Hum Genet 2008 ; 124 : 161 -170.

Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008 ; 455 : 1069 -1075.

Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, et al. Direct selection of human genomic loci by microarray hybridization. Nat Methods 2007 ; 4 : 903 -905.

Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, et al. Genome-wide in situ exon capture for selective resequencing. Nat Genet 2007 ; 39 : 1522 -1527.

Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME. Microarray-based genomic selection for high-throughput resequencing. Nat Methods 2007 ; 4 : 907 -909.

Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, et al. Multiplex amplification of large sets of human exons. Nat Methods 2007 ; 4 : 931 -936.

Huber JA, Mark Welch DB, Morrison HG, Huse SM, Neal PR, Butterfield DA, Sogin ML. Microbial population structures in the deep marine biosphere. Science 2007 ; 318 : 97 -100.

Sogin ML, Morrison HG, Huber JA, Mark Welch D, Huse SM, Neal PR, et al. Microbial diversity in the deep sea and the underexplored “rare biosphere.”. Proc Natl Acad Sci U S A 2006 ; 103 : 12115 -12120.

Urich T, Lanzen A, Qi J, Huson DH, Schleper C, Schuster SC. Simultaneous assessment of soil microbial community structure and function through analysis of the meta-transcriptome. PLoS ONE 2008 ; 3 : e2527 .

Palacios G, Druce J, Du L, Tran T, Birch C, Briese T, et al. A new arenavirus in a cluster of fatal transplant-associated diseases. N Engl J Med 2008 ; 358 : 991 -998.

Keijser BJ, Zaura E, Huse SM, van der Vossen JM, Schuren FH, Montijn RC, et al. Pyrosequencing analysis of the oral microflora of healthy adults. J Dent Res 2008 ; 87 : 1016 -1020.

Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, et al. A core gut microbiome in obese and lean twins. Nature 2008 ; 457 : 480 -484.

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009 ; 10 : 57 -63.

Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 2008 ; 320 : 1344 -1349.

Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, et al. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 2008 ; 453 : 1239 -1243.

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008 ; 5 : 621 -628.

Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR. Highly integrated single-base resolution maps of the epigenome in Arabidopsis . Cell 2008 ; 133 : 523 -536.

Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods 2008 ; 5 : 613 -619.

Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008 ; 18 : 1509 -1517.

Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T, et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques 2008 ; 45 : 81 -94.

Morin RD, O'Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, et al. Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res 2008 ; 18 : 610 -621.

Emrich SJ, Barbazuk WB, Li L, Schnable PS. Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Res 2007 ; 17 : 69 -73.

Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 2008 ; 40 : 1413 -1415.

Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, et al. Alternative isoform regulation in human tissue transcriptomes. Nature 2008 ; 456 : 470 -476.

Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, et al. Genome-wide location and function of DNA binding proteins. Science 2000 ; 290 : 2306 -2309.

Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, et al. High-resolution profiling of histone methylations in the human genome. Cell 2007 ; 129 : 823 -837.

Schones DE, Zhao K. Genome-wide approaches to studying chromatin modifications. Nat Rev Genet 2008 ; 9 : 179 -191.

Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, et al. A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome Res 2008 ; 18 : 1051 -1063.

Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science 2007 ; 316 : 1497 -1502.

Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 2007 ; 4 : 651 -657.

Korshunova Y, Maloney RK, Lakey N, Citek RW, Bacher B, Budiman A, et al. Massively parallel bisulphite pyrosequencing reveals the molecular complexity of breast cancer-associated cytosine-methylation patterns obtained from tissue and serum DNA. Genome Res 2008 ; 18 : 19 -29.

Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature 2008 ; 452 : 215 -219.

Marguerat S, Wilhelm BT, Bahler J. Next-generation sequencing: applications beyond genomes. Biochem Soc Trans 2008 ; 36 : 1091 -1096.

Pop M, Salzberg SL. Bioinformatics challenges of new sequencing technology. Trends Genet 2008 ; 24 : 142 -149.

Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 2008 ; 9 : 431 .

Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998 ; 8 : 186 -194.

Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 1998 ; 8 : 175 -185.

Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, et al. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res 2008 ; 18 : 763 -770.

Smith AD, Xuan Z, Zhang MQ. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 2008 ; 9 : 128 .

Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008 ; 18 : 1851 -1858.

Whiteford N, Haslam N, Weber G, Prugel-Bennett A, Essex JW, Roach PL, et al. An analysis of the feasibility of short read sequencing. Nucleic Acids Res 2005 ; 33 : e171 .

Warren RL, Sutton GG, Jones SJ, Holt RA. Assembling millions of short DNA sequences using SSAKE. Bioinformatics 2007 ; 23 : 500 -501.

Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res 2008 ; 18 : 802 -809.

Lin H, Zhang Z, Zhang MQ, Ma B, Li M. ZOOM! Zillions Of Oligos Mapped. Bioinformatics 2008 ; 24 : 2431 -2437.

Li R, Li Y, Kristiansen K, Wang J. SOAP: short oligonucleotide alignment program. Bioinformatics 2008 ; 24 : 713 -714.

Chaisson MJ, Pevzner PA. Short read fragment assembly of bacterial genomes. Genome Res 2008 ; 18 : 324 -330.

Dohm JC, Lottaz C, Borodina T, Himmelbauer H. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing. Genome Res 2007 ; 17 : 1697 -1706.

Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008 ; 18 : 821 -829.

Zhu QH, Spriggs A, Matthew L, Fan L, Kennedy G, Gubler F, Helliwell C. A diverse set of microRNAs and microRNA-like small RNAs in developing rice grains. Genome Res 2008 ; 18 : 1456 -1465.

Glazov EA, Cottee PA, Barris WC, Moore RJ, Dalrymple BP, Tizard ML. A microRNA catalog of the developing chicken embryo identified by a deep sequencing approach. Genome Res 2008 ; 18 : 957 -964.

Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, et al. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods 2008 ; 5 : 829 -834.

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008 ; 9 : R137 .

Binladen J, Gilbert MT, Bollback JP, Panitz F, Bendixen C, Nielsen R, Willerslev E. The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS ONE 2007 ; 2 : e197 .

Meyer M, Stenzel U, Hofreiter M. Parallel tagged sequencing on the 454 platform. Nat Protoc 2008 ; 3 : 267 -278.

Meyer M, Stenzel U, Myles S, Prufer K, Hofreiter M. Targeted high-throughput sequencing of tagged nucleic acid samples. Nucleic Acids Res 2007 ; 35 : e97 .

Wang C, Mitsuya Y, Gharizadeh B, Ronaghi M, Shafer RW. Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res 2007 ; 17 : 1195 -1201.

Fokstuen S, Lyle R, Munoz A, Gehrig C, Lerch R, Perrot A, et al. A DNA resequencing array for pathogenic mutation detection in hypertrophic cardiomyopathy. Hum Mutat 2008 ; 29 : 879 -885.

Morita H, Rehm HL, Menesses A, McDonough B, Roberts AE, Kucherlapati R, et al. Shared genetic causes of cardiac hypertrophy in children and adults. N Engl J Med 2008 ; 358 : 1899 -1908.

Chiu RW, Chan KC, Gao Y, Lau VY, Zheng W, Leung TY, et al. Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma. Proc Natl Acad Sci U S A 2008 ; 105 : 20458 -20463.

Fan HC, Blumenfeld YJ, Chitkara U, Hudgins L, Quake SR. Noninvasive diagnosis of fetal aneuploidy by shotgun sequencing DNA from maternal blood. Proc Natl Acad Sci U S A 2008 ; 105 : 16266 -16271.

Fan HC, Quake SR. Detection of aneuploidy with digital polymerase chain reaction. Anal Chem 2007 ; 79 : 7576 -7579.

Lo YM, Lun FM, Chan KC, Tsui NB, Chong KC, Lau TK, et al. Digital PCR for the molecular detection of fetal chromosomal aneuploidy. Proc Natl Acad Sci U S A 2007 ; 104 : 13116 -13121.

Dennis Lo YM, Chiu RW. Prenatal diagnosis: progress through plasma nucleic acids. Nat Rev Genet 2007 ; 8 : 71 -77.

Korlach J, Marks PJ, Cicero RL, Gray JJ, Murphy DL, Roitman DB, et al. Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructures. Proc Natl Acad Sci U S A 2008 ; 105 : 1176 -1181.

Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW. Zero-mode waveguides for single-molecule analysis at high concentrations. Science 2003 ; 299 : 682 -686.

Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, et al. Real-time DNA sequencing from single polymerase molecules. Science 2008 ; 323 : 133 -138.

Astier Y, Braha O, Bayley H. Toward single molecule DNA sequencing: direct identification of ribonucleoside and deoxyribonucleoside 5′-monophosphates by using an engineered protein nanopore equipped with a molecular adapter. J Am Chem Soc 2006 ; 128 : 1705 -1710.

Wu HC, Astier Y, Maglia G, Mikhailova E, Bayley H. Protein nanopores with covalently attached molecular adapters. J Am Chem Soc 2007 ; 129 : 16142 -16148.

Gupta PK. Single-molecule DNA sequencing technologies for future genomics research. Trends Biotechnol 2008 ; 26 : 602 -611.

Rhee M, Burns MA. Nanopore sequencing technology: research trends and applications. Trends Biotechnol 2006 ; 24 : 580 -586.

Supplementary data

Email alerts, citing articles via.

Recommend to Your Librarian
Advertising and Corporate Services
Journals Career Network

Affiliations

Online ISSN 1530-8561
Print ISSN 0009-9147
Copyright © 2024 Association for Diagnostics & Laboratory Medicine
About Oxford Academic
Publish journals with us
University press partners
What we publish
New features
Open access
Institutional account management
Rights and permissions
Get help with access
Accessibility
Advertising
Media enquiries
Oxford University Press
Oxford Languages
University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Copyright © 2024 Oxford University Press
Cookie settings
Cookie policy
Privacy policy
Legal notice

This Feature Is Available To Subscribers Only

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

next generation sequencing research papers

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser .

We're Hiring!
Help Center

Next-generation Sequencing

Most Cited Papers
Most Downloaded Papers
Newest Papers
Save to Library
Biocuration Follow Following
Cancer Genetics Follow Following
Cryptic Species Follow Following
Nudibranch Follow Following
Small Molecule Inhibitor Follow Following
Coral Physiology Follow Following
RAD Seq Follow Following
Cancer biomarker Follow Following
Cancer molecular biology Follow Following
Molecular targeting therapy Follow Following

Enter the email address you signed up with and we'll email you a reset link.

Academia.edu Publishing
We're Hiring!
Help Center
Find new research papers in:
Health Sciences
Earth Sciences
Cognitive Science
Mathematics
Computer Science
Academia ©2024

ORIGINAL RESEARCH article

Next generation sequencing-aided screening, isolation, molecular identification, and antimicrobial potential for bacterial endophytes from the medicinal plant, elephantorrhiza elephantina.

1 North-West University, Mafikeng, South Africa
2 University of Venda, Thohoyandou, Limpopo, South Africa
3 University of Johannesburg, Johannesburg, Gauteng, South Africa

The final, formatted version of the article will be published soon.

Select one of your emails

You have multiple emails registered with Frontiers:

Notify me on publication

Please enter your email address:

If you already have an account, please login

You don't have a Frontiers account ? You can register here

Elephantorrhiza elephantina, a wild plant in southern Africa, is utilized in traditional medicine for various ailments, leading to its endangerment and listing on the Red List of South African Plants. To date, there have been no reports on bacterial endophytes from this plant, their classes of secondary metabolites, and potential medicinal properties. This study presents (i) taxonomic characterization of bacterial endophytes in leaf and root tissues using 16S rRNA, (ii) bacterial isolation, morphological, and phylogenetic characterization, (iii) bacterial growth, metabolite extraction, and LC-MS-based metabolite fingerprinting, and (iv) antimicrobial testing of bacterial crude extracts. Next-generation sequencing yielded 693 and 2459 DNA read counts for the rhizomes and leaves, respectively, detecting phyla including Proteobacteria,

Keywords: Elephantorriza elephantina, Next-generation sequencing, Taxonomy, bacterial endophytes, phylogeny, secondary metabolites, Antimicrobial activity

Received: 08 Feb 2024; Accepted: 06 May 2024.

Copyright: © 2024 Tlou, Ndou, Mabona, Khwathisi, Ateba, Madala and Serepa-Dlamini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Matsobane Tlou, North-West University, Mafikeng, South Africa

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
My Account Login
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 05 April 2023

Systematic comparison of tools used for m 6 A mapping from nanopore direct RNA sequencing

Zhen-Dong Zhong 1 na1 ,
Ying-Yuan Xie 1 na1 ,
Hong-Xuan Chen 1 ,
Ye-Lin Lan 1 ,
Xue-Hong Liu 1 ,
Jing-Yun Ji 1 ,
Lingmei Jin ORCID: orcid.org/0000-0003-2909-2730 2 ,
Jiekai Chen ORCID: orcid.org/0000-0001-5168-7074 2 ,
Daniel W. Mak 3 ,
Zhang Zhang 1 &
Guan-Zheng Luo ORCID: orcid.org/0000-0002-6797-9319 1

Nature Communications volume 14 , Article number: 1906 ( 2023 ) Cite this article

10k Accesses

14 Citations

21 Altmetric

Metrics details

Data processing
RNA sequencing

N6-methyladenosine (m6A) has been increasingly recognized as a new and important regulator of gene expression. To date, transcriptome-wide m6A detection primarily relies on well-established methods using next-generation sequencing (NGS) platform. However, direct RNA sequencing (DRS) using the Oxford Nanopore Technologies (ONT) platform has recently emerged as a promising alternative method to study m6A. While multiple computational tools are being developed to facilitate the direct detection of nucleotide modifications, little is known about the capabilities and limitations of these tools. Here, we systematically compare ten tools used for mapping m6A from ONT DRS data. We find that most tools present a trade-off between precision and recall, and integrating results from multiple tools greatly improve performance. Using a negative control could improve precision by subtracting certain intrinsic bias. We also observed variation in detection capabilities and quantitative information among motifs, and identified sequencing depth and m6A stoichiometry as potential factors affecting performance. Our study provides insight into the computational tools currently used for mapping m6A based on ONT DRS data and highlights the potential for further improving these tools, which may serve as the basis for future research.

Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore

GLORI for absolute quantification of transcriptome-wide m6A at single-base resolution

RNA modifications detection by comparative Nanopore direct RNA sequencing

Introduction.

Over 150 types of RNA modifications have been discovered across all domains of life 1 . The most abundant and best-characterized type of modification found in the eukaryotic mRNA is N6-methyladenosine (m6A) 2 , 3 . The discovery of the m6A demethylase fat mass- and obesity-associated protein (FTO) highlights the reversible and dynamic nature of this RNA modification, which may serve as a novel regulator in gene expression 4 . Subsequent studies have revealed that m6A is modulated by a complex network composed of “writer”, “eraser” and “reader” proteins 3 , through which it can co-transcriptionally and post-transcriptionally influence essential RNA metabolic processes, including pre-mRNA splicing and polyadenylation as well as mRNA export, decay and translation 5 . Recent functional studies have demonstrated the essential role of m6A in physiological and pathological processes in various organisms, further advancing our knowledge and understanding of RNA biology 6 , 7 .

Owing to the rapid development of m6A detection methods that have been developed on the next-generation sequencing (NGS) platform, the m6A landscape has since been documented in a number of different organisms 8 . Currently, there are two main NGS-based strategies that are employed in transcriptome-wide m6A mapping: (i) antibody-dependent methods, such as MeRIP-seq and miCLIP 9 , 10 , and (ii) antibody-independent methods, such as m6A-REF-seq/MAZTER-seq 11 , 12 and DART-seq 13 . The most widely used MeRIP-seq protocol employs antibodies to enrich RNA fragments containing m6A; however, this approach suffers from the drawback of high false-positive rates as a result of non-specific antibody binding 14 , 15 , 16 and hence, inaccurate stoichiometric estimates. Meanwhile, m6A-REF-seq/MAZTER-seq utilizes an endonuclease that explicitly cleaves unmethylated ACA motifs but leaves the (m6A)CA intact. This approach precisely identifies m6A sites with single-base resolution and accurately quantifies methylation rates, yet it is limited to a specific sequence context. For example, while the endonuclease MazF specifically recognizes ACA motifs, it only covers approximately 16-25% of m6A sites of the whole transcriptome 11 . In DART-seq, the efficiency and specificity of exogenously expressed APOBEC1-YTH fusion proteins, which are known to induce C-to-U deamination at sites adjacent to m6A methylated sites, have not yet been thoroughly characterized. Most importantly, all of these NGS-based methods are limited to the detection of aggregate RNA molecule information and are incapable in accurately quantifying the stoichiometry at the molecular level. Accordingly, a novel m6A detection method with both high resolution and quantification capability is greatly needed to overcome these limitations.

In the last decade, third-generation sequencing (TGS) technologies represented by two platforms, namely Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have emerged as promising alternatives for mapping nucleotide modifications 17 . PacBio sequencing identifies DNA modifications using the interpulse duration (IPD), an approach requiring high sequencing coverage but suffers from low discriminatory power 18 . On the other hand, the ONT sequencing platform directly sequences DNA or RNA based on the principle of monitoring shifts in electric current caused by the traversing of a single molecule across a nanopore embedded within a synthetic polymer membrane. The effect on the flow of the current through the pore by modified bases may be distinct from their unmodified counterpart, thereby allowing for the detection of nucleotide modifications based on the identified electric current differentials 19 , 20 .

A pioneering study published in 2018 reported for the first time the direct sequencing of native RNA molecules using the ONT platform 20 . Since then, at least ten computational tools have been developed to map the location of m6A methylation and to determine its stoichiometry from direct RNA sequencing (DRS) (Supplementary Fig. 1a and Supplementary Table 1 ). These tools can generally be divided into two classes based on the strategies employed to identify modified nucleotides (Fig. 1a ). The first class relies on the detection of electric current differentials. Briefly, a continuous current is subdivided into “events” and then assigned to each nucleotide in a reference sequence using one of the two widely-used algorithms, namely Nanopolish-eventalign 19 or Tombo-resquiggle 21 . For each nucleotide, current features such as the median, mean, standard deviation and dwell-time (i.e., the time a 5-mer sequence resided in the nanopore) are extracted and used as inputs for one of three different classification methods: statistical testing (e.g., used in Tombo 21 ), machine learning (e.g., used in MINES 22 , Nanom6A 23 and m6Anet 24 ) or clustering (e.g., used in Nanocompore 25 and xPore 26 ) (Fig. 1a ). The second class of tools utilizes base-calling “errors” resulting from the change in current caused by the presence of RNA modifications. Base-calling “errors” may represent mismatches, insertions, deletions or variability of base quality scores. These types of information are collected and classified using alignment results. To identify modified bases among the “errors”, Epinano 27 uses a pre-trained machine learning model, while other tools (e.g., DiffErr 28 , DRUMMER 29 and ELIGOS 30 ) perform statistical testing to compare “errors” with an internal model or control sample. Given the high false-positive rate in defining “errors”, a control sample with no (or low levels of) RNA modification is usually necessary.

a Pipeline and strategies employed by various m6A mapping tools. Information on the requirement for a negative control, minimum coverage and motif specificity for each tool is shown in brackets. “√” and “x” indicate the requirement for a negative control, while “-” indicates that a negative control is optional. “N” indicates a lack of motif limitations, while “D” and “R” indicate DRACH and RRACH, respectively. Note that MINES detects m6A in AGACT and GGACH motifs. b m6A detection and quantification by LC-MS/MS using WT and Mettl3 KO samples. c Violin plots of ONT data features for WT and KO samples, the upper and lower limits represent the 75th and 25th percentiles, respectively, while the center line represents the median; upper and lower whiskers indicate ±1.5× the interquartile range (IQR). The number of reads is indicated in parentheses. d Metagene distribution plot with motif preferences of m6A peaks identified from the MeRIP-seq data.

As ONT DRS sequencing captures both isoform and nucleotide modification information, it has been successfully applied in dissecting the role of m6A in shaping transcriptome complexity, such as through alternative splicing 29 , 31 , alternative polyadenylation 23 , circular RNA biogenesis 32 and other transcriptional events 33 . Despite a number of highly sophisticated state-of-the-art computational tools that have been developed to detect and quantify m6A, a thorough evaluation and comparison of these tools is currently lacking. To address this gap, we compared ten computational tools used for m6A mapping from ONT DRS data and quantitatively evaluated their performance based on two continuous evaluation metrics (Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves), followed by the identification of m6A under a cutoff that is associated with the best F1 score. We found that most tools presented a trade-off between precision and recall, and their performance largely improved with the integration of the results from multiple tools. We also assessed the intrinsic preference for m6A-irrelevant sites and demonstrated that introduction of a m6A-free negative control improved the performances of most tools. In addition, we found that the detection capability varied among motifs, as current differentials were less easily detected in certain sequence contexts. For the quantitative tools, a wide discrepancy was observed among their stoichiometric estimates, albeit with acceptable results in some of the motifs. We showed that this comprehensive comparison between the multiple tools serves as a guide for best practices in m6A mapping and quantification, as well as conducting of further analysis on the ONT DRS platform.

Datasets used for ONT tool comparisons

In order to compare the computational tools currently available for m6A detection (Fig. 1a , Supplementary Fig. 1a , and Supplementary Table 1 ), we conducted ONT DRS for mRNA from wild-type (WT) mouse embryonic stem cells (mESCs) and corresponding m6A-deficient samples with the major m6A methyltransferase Mettl3 knockout (KO). The mRNA derived from Mettl3 KO samples were confirmed to be virtually absent of m6A by liquid chromatography-mass spectrometry (LC-MS/MS) (Fig. 1b ). We obtained 1.31 and 1.32 million reads for the WT and KO samples, respectively, with similar median read length of 1 kb (Fig. 1c ). The average read quality scores were greater than 10 and the median identities of aligned reads were 90.7% and 92.1% for WT and KO samples, respectively (Fig. 1c ). The average quality and alignment identity of reads from WT samples were slightly lower than that from KO samples, suggesting that the presence of m6A may cause electric current variability, and thus lead to more base-calling “errors” 27 (Fig. 1c ). We also calculated the coverage of aligned reads to annotated transcripts and found that a large proportion of reads were almost full-length in both samples (Supplementary Fig. 1b ). Overall, the quality of the DRS data was sufficient for m6A detection.

To obtain validation datasets, we performed the well-characterized MeRIP-seq assay on WT and Mettl3 KO samples. Peaks identified in WT samples were distributed around stop codons with a consensus sequence RRACH (R = A/G, H = A/T/C), while peaks from KO samples were randomly distributed (Supplementary Fig. 1c ). Given the absence of m6A methylation in the KO samples, these m6A-irrelevant peaks were mainly due to the non-specific antibody binding. After excluding the m6A-irrelevant peaks, a total of 18,069 high-confidence m6A peaks were identified (Fig. 1d ). The ONT DRS platform detected 903,637 and 895,031 RRACH motifs in WT and KO samples, respectively, and more than 40% (370,000) were supported by at least five reads (Supplementary Fig. 1d ). The m6A sites identified in the same cell line by three independent NGS-based methods, i.e., m6A-REF-seq 12 , miCLIP 34 and miCLIP2 35 , were also included as validation sets. To further validate our results, two published DRS datasets derived from WT and Mettl3 KO mESCs samples were included as replicates 30 (Supplementary Table 2 ). In addition, we also included published DRS datasets from WT Arabidopsis (Col-0) and mutants defective of VIRILIZER ( vir-1 ) as validation in distant species 28 (Supplementary Table 2 ).

A comparison between the performance of ten computational tools

In order to assess the aforementioned ten tools’ capability to identify m6A sites, we evaluated their performance based on m6A sites identified using MeRIP-seq, miCLIP and miCLIP2 assays as ground truth (see Methods). In considering the limitations of each NGS-based method, we took the m6A sites identified by each NGS-based method or their intersection and union as validation sets separately. Since some tools are limited to RRACH motifs, we only retained the sites found in this specific sequence context (Supplementary Table 3 ). We then applied each computational tool to predict m6A and ranked the sites according to the confidence scores reported by each tool. As some tools require (DiffErr, DRUMMER, xPore, Nanocompore) or support (ELIGOS_diff, Epinano_delta, Tombo_com) to use negative control sample, while others (ELIGOS, Epinano, Tombo, m6Anet, MINES and Nanom6A) do not, we therefore evaluated their performances in two separate modes which are hereinafter referred to as the compare-mode and single-mode.

We first quantified their performance using two continuous evaluation metrics, the area under the ROC curve (ROC AUC) and PR curve (PR AUC). Among the tools using single-mode, m6Anet outperformed others by achieving a ROC AUC of 0.68-0.94 and PR AUC of 0.20-0.52, followed by ELIGOS, Epinano and Nanom6A, while Tombo achieved the lowest ROC AUC and PR AUC (Fig. 2a, b and Supplementary Fig. 2a ). As MINES trained its models on 12 RRACH motifs but filtered the output m6A sites except that located on four sequences (GGACT, GGACA, GGACC, AGACT), we modified its code to output all sites. Even though MINES achieved the second lowest ROC AUC, the PR AUC was relatively high (Fig. 2a, b and Supplementary Fig. 2a ). Notably, all tools performed much better on four motifs (GGACT, GGACA, GGACC, AGACT), though their ranking remained unchanged (Supplementary Fig. 3 ). In the compare-mode, xPore achieved the best ROC AUC of 0.75-0.91 and PR AUC of 0.27-0.67, which were slightly higher than those of Epinano_delta. Three error-based tools, ELIGOS_diff, DRUMMER and DiffErr, achieved almost the same performance, which may be due to the similar principle they follow (Fig. 2c, d and Supplementary Fig. 2b ). When using different validation sets, the ranking of each tool remained consistent, indicating that the relative performance of the tools is not affected by the choice of validation set (Fig. 2a–d and Supplementary Fig. 2a, b ). In addition, overall ranking was not changed though the ROC and PR AUC of all tools improved by taking the maximum of the predicted probability within each peak from MeRIP-seq results (see Methods) (Supplementary Fig. 2c ).

a Receiver operating characteristic (ROC) curve (top) and Precision Recall (PR) curve (bottom) for candidate sites detected by single-mode tools using the miCLIP data as ground truth. Area under the curve (AUC) is indicated in parentheses. b ROC AUC (top) and PR AUC (bottom) for candidate sites detected by single-mode tools using different NGS-based datasets as ground truth. c ROC curve (top) and PR curve (bottom) for candidate sites detected by compare-mode tools using the miCLIP data as ground truth. AUC is indicated in parentheses. d ROC AUC (top) and PR AUC (bottom) for candidate sites detected by compare-mode tools using different NGS-based datasets as ground truth. e Precision curve for top 20,000 m6A sites detected by single-mode tools using the miCLIP data as ground truth. f Percentage for m6A sites supported by various number of NGS-based methods. Top 2000 m6A sites detected by single-mode tools are counted. g Precision curve for top 20,000 m6A sites detected by compare-mode tools using the miCLIP data as ground truth. h Percentage for m6A sites supported by various number of NGS-based methods. Top 2000 m6A sites detected by compare-mode tools are counted. i Metagene plots (left) of the transcriptome-wide m6A distribution for top 2000 m6A sites detected by single-mode tools. The sequence motifs of these sites plotted by Seqlogo are shown on the right. j Metagene plots (left) of the transcriptome-wide m6A distribution for top 2000 m6A sites detected by compare-mode tools. The sequence motifs of these sites plotted by Seqlogo are shown on the right.

We next assessed the precision of the top 20,000 m6A sites predicted by each tool. Among the tools in single-mode, m6Anet constantly showed the highest precision, with 90% of the top 2000 sites or 80% of the top 5000 sites being supported by at least one NGS-based method (Fig. 2e, f and Supplementary Fig. 2d, e ). In the compare-mode, xPore achieved the best precision for top 4000 m6A sites, but its performance decreased sharply with the inclusion of additional sites. In contrast, Epinano_delta constantly showed the best precision for top 4000 to 20,000 m6A sites (Fig. 2g, h and Supplementary Fig. 2f, g ). When analyzing the relative location of top m6A sites, we found that those with higher precision showed greater enrichment around stop codons, a characteristic distribution of m6A (Fig. 2i, j ). This was consistent with previous findings in mammalian species 9 , 10 . The presence of the GGACU consensus sequence was more pronounced in m6A sites identified by tools with higher precision, also supporting previous observations about the distribution of m6A (Fig. 2i, j ).

To further confirm the evaluation of each tool, we repeated the aforementioned process on a published ONT DRS dataset that was derived from WT and Mettl3 KO mESCs samples 30 . We found that all single-mode tools achieved nearly identical performance to their performance using our data; tools in the compare-mode, on the other hand, generally performed worse, though their ranking remained unchanged (Supplementary Fig. 4a–d ). Notably, the published mESCs was found to be an incomplete Mettl3 KO cell line with an estimated ~40% of m6A still present 30 , 36 . This implies that a perfect negative control absent of any m6A would be crucial for these compare-mode tools. Likewise, these tools showed superior m6A detection capability in the aforementioned motifs (GGACT, GGACA, GGACC, AGACT) (Supplementary Fig. 4c, d ).

As some tools were constructed and tested on training data from mammalian species, they may have different performance in non-mammalian species in which the sequence preference and stoichiometry of m6A may vary. To further compare these tools in non-mammalian species, we evaluated their performance using published DRS data from the Arabidopsis 28 . As expected, the performance of machine-learning based tools was compromised when applied to the Arabidopsis dataset, particularly m6Anet and MINES which were trained on DRS data from human samples using miCLIP as the ground truth (Supplementary Fig. 5 ). We reasoned that these two tools are not trained extensively on AAACH motifs, which are the predominantly occurring m6A context in Arabidopsis rather than AGACT and GGACH motifs in human or mouse (Supplementary Table 4 ). On the other hand, most error-based tools performed better in Arabidopsis (Supplementary Fig. 5 ), likely benefiting from a higher coverage (Supplementary Table 3 ). Accordingly, given that the size of Arabidopsis genome is only 1/20 compared to that of the mouse genome, a higher coverage was achieved with similar sequencing reads (Supplementary Table 3 ), leading to more base-calling “errors” occurred in the presence of m6A and increasing the statistical power of error-based tools. However, as more than 10% of m6A were retained in vir-1 samples 28 , the performance of the compare-mode did not significantly improve (Supplementary Fig. 5 ). Similarly, the performance of the tools improved in four motifs (GGACT, GGACA, GGACC, AGACT), particularly for m6Anet and MINES (Supplementary Fig. 6 ). We observed exactly the same results from other replicate dataset from Arabidopsis (Supplementary Fig. 7 ).

In summary, we found that the performance of computational tools for detecting m6A sites varied significantly. Among those that only used a wild-type (WT) sample, m6Anet performed the best. In contrast, tools that used both WT and m6A-deficient (KO) samples exhibited a comparable performance, with the exception of xPore and Nanocompore, which omitted a number of m6A sites to be tested but had higher precision within the top sites. Tombo, ELIGOS and Epinano all had improved performance when using their compare-mode with a nearly m6A-free sample as negative control. In addition, machine-learning based tools may be affected by species-specific differences in m6A distribution and stoichiometry.

Intrinsic bias of ONT tools in m6A detection

To determine the optimal cut-off value for each tool, we calculated the F1 scores under continuous cut-offs and selected the value that corresponded to the summed maximum F1 score when applied to the validation sets from miCLIP and miCLIP2 (Supplementary Fig. 8 and Supplementary Table 5 ). We detected 6410 to 35,475 m6A sites from WT samples and compared them to the validation sets from miCLIP and miCLIP2 (Fig. 3a, b and Supplementary Fig. 9a, b ). Among the single-mode tools, Tombo, Nanom6A, ELIGOS and Epinano detected more m6A sites with higher recall rates, but at the cost of lower precision (Fig. 3a ). While m6Anet and MINES omitted many of the sites to be tested and lost recall rates due to the high coverage requirement or motifs limitation, they achieved the highest F1 score thanks to their high precision (Fig. 3a ). Among the tools in the compare-mode, two tools based on clustering principle achieved the highest precision; however, it was challenging to balance precision and recall rate for other tools, likely due to the failure of clustering (see Methods; Fig. 3b ). DiffErr, DRUMMER and ELIGOS_diff achieved almost the same performances (Fig. 3b ). Similar results were found when using the validation set from miCLIP2 (Supplementary Fig. 9a, b ). We also calculated precision, recall and F1 score with higher coverage requirement. Increasing the coverage from 5 to 20 resulted in a better F1 score for all tools, with most of them achieving a better precision and some achieving a better recall. However, increasing the coverage from 20 to 50 did not result in better performance for most tools (and even slightly worse for some). Notably, the ranking of tools was nearly the same regardless of different coverage requirements (Supplementary Fig. 9 ). The performance of Tombo, ELIGOS and Epinano was better in compare-mode than in single-mode, particularly in terms of precision (Fig. 3a, b ). This suggests that there might be an intrinsic preference for certain sites that is unrelated to m6A and can be avoided when using the compare-mode.

a , b Precision, recall and F1 scores for single-mode tools ( a ) and compare-mode tools ( b ) using miCLIP results as validation set. c Counts and percentages of recurrent m6A sites identified amongst samples. One-sided Hypergeometric tests were applied to calculate significance. WT&KO, both in WT and KO; WT&Random, both in WT and Random; KO|Random, the total number in KO or Random. d , e The m6A site counts and precision and F1 scores before and after calibration with the results from KO ( d ) or random ( e ) samples. miCLIP results were used as a validation set. WT-KO, precision and F1 score for sites calibrated using KO results. WT-random, precision and F1 score for sites calibrated with a randomly selected dataset.

As the Mettl3 KO cell we used were verified to be free of m6A (Fig. 1b ), we expected no (or few) m6A sites would be detected using the single-mode tools. Strikingly, a substantial number of m6A sites (from 983 to 30,207) were identified in the KO sample by most tools, with Tombo even detecting a similar number of m6A sites in KO and WT samples (Fig. 3c ). This suggests a systematic overestimation of m6A abundance by the tools. Sites detected in the KO sample did not show clear enrichment around the stop codons when compared to the background (Supplementary Fig. 9c ), thus implying a prevalence of false positives in the results. We found that the recurrence of m6A sites in WT samples was significantly higher in KO samples than in random datasets (see Methods; Fig. 3c ). For instance, 52.79% and 28.68% of the m6A sites detected in the WT samples were also detected in the KO samples by Tombo and Nanom6A, respectively, while only 9.21% and 4.52% of sites could be found in the random datasets (Fig. 3c ), thus suggesting a potential bias towards certain m6A-irrelevant sites. We inspected these sites that were detected in both WT and KO samples and found that single-nucleotide polymorphisms (SNPs) and insertions/deletions (indels) near RRACH motifs could contribute to this bias, particularly for ELIGOS, which detects m6A by comparing “errors” to the background model and treat SNPs/indels as “errors” during m6A detection (Supplementary Fig. 9d, e ). In addition, homopolymer near the RRACH motif was also one of the causes of the bias (Supplementary Fig. 9d, e ), as they may lead to deletion “errors” signal and inaccurate events assignment. Despite these observations, a proportion of sites detected in both WT and KO samples could not be explained.

We next employed the KO samples as a negative control to test whether it could help to identify false positives and thereby refine the list of reported m6A sites. After subtracting the sites detected in the negative control, we found that the precision of most tools in the single-mode were significantly improved, especially for Tombo and Nanom6A (Fig. 3d and Supplementary Fig. 9h ). Despite decreased recall rates, the F1 scores of these tools were improved (Fig. 3d and Supplementary Fig. 9f, h ). In contrast, calibrating m6A sites using sites in random datasets mentioned above did not improve the precision, but decreased the recall rates (Fig. 3e and Supplementary Fig. 9g, i ). ELIGOS and Epinano performed better when using the compare-mode than directly subtracting the sites in negative control sample (Supplementary Fig. 9j, k ). Overall, the intrinsic bias of each tool may lead to pervasive false positives and using a negative control can improve their performance.

Detection capabilities varied among motifs

Most tools train separate machine learning models or build unique reference models to detect m6A sites with different sequence contexts (or motifs). Notably, m6A sites in AGACT/GGACH motifs tend to have better evaluation metrics (Supplementary Fig. 3 ). We hypothesized that detection capabilities may vary among motifs. We found that the recall rates for certain motifs, such as GGACT, GGACA, GGACC and AGACT, were significantly higher than that of other motifs. On the other hand, some motifs such as AAACC, AAACA and GAACA showed extremely low recall rates (Fig. 4a ). In addition to the highest recall rates observed in the four motifs (GGACT, GGACA, GGACC and AGACT), they also consistently showed the highest precision (Fig. 4b ), suggesting much better detection capabilities of most tools for these motifs. This finding was consistent for both current-based and error-based tools (Supplementary Table 6 ), implying a difficulty in differentiating or detecting m6A sites within some 5-mer motifs. Similar results were observed when using the miCLIP2 result as a validation set (Supplementary Fig. 10a, b and Supplementary Table 6 ).

a , b Relative recall rates ( a ) and precision ( b ) for 12 motifs bearing the consensus sequence RRACH. Relative recall rates/precision were calculated as the ratio of the recall/precision for the sites in a specific motif to the overall mean (across all sites). A m6A profile created from miCLIP results was used for validation. c Current intensity of modified and unmodified sites in GGACA or GAACA motifs. The m6A candidate sites (methylation rate of WT–KO > 0.8) were identified using m6A-REF-seq and marked as “Mod” (modified, sites in WT sample) or “UNM” (unmodified, sites in KO sample). This dataset was also used in d , e . For boxplots in c , d , the upper and lower limits represent the 75th and 25th percentiles, respectively, while the center line represents the median; upper and lower whiskers indicate ±1.5× the interquartile range (IQR). The number of reads and events aligned to the sites are indicated in parentheses. d A comparison of mismatch and deletion frequencies for modified and unmodified sites in GGACA and GAACA motifs. The number of sites is indicated in parentheses. e IGV snapshots of two specific sites in different motifs (GGACA and GAACA). Modification status (modified or unmodified) was determined by m6A-REF-seq. Sites with allele frequency ≥0.1 are indicated with corresponding colors. Coverage for each site is indicated on the left of each short sequence.

Given these findings, we suspected that electric current differentials between the modified and unmodified bases may be modest for some motifs, resulting in difficulties of distinguishing m6A from the Adenosine base (A). To test this hypothesis, we inspected 239 GGACA and 59 GAACA sites identified by m6A-REF-seq that were almost fully methylated in WT samples but non-methylated in KO samples (methylation rate of WT - KO > 0.8). Next, we analyzed ONT reads covering these m6A sites (for both WT and KO samples) to examine discrepancies in currents and base-calling “errors” profiles. Notably, current intensity differed greatly between modified and unmodified GGACA motifs but not GAACA motifs (Fig. 4c ). In addition, modified reads with GGACA motif resulted in more base-calling “errors” compared to unmodified reads, especially for deletions and mismatches; in contrast, modified reads with GAACA motif showed little effect on the frequency of base-calling “errors” (Fig. 4d, e and Supplementary Fig. 10c ). Nevertheless, the qualities of modified reads with either GGACA or GAACA motif were significantly lower than that of unmodified reads (Supplementary Fig. 10c ). We concluded that current differentials between m6A and A were relatively small at some sequence contexts, leading to poor performance of most tools encountering such motifs.

Performance of m6A quantification

Among the ten computational tools evaluated, three of the current-based tools (Tombo/Tombo_com, Nanom6A and xPore) are able to quantify methylation rates. Tombo/Tombo_com and Nanom6A detect m6A signal of single reads and thus allow for quantification of m6A at each genomic location covered by multiple reads, while xPore estimates the methylation rate by modeling a mixture of two Gaussian distributions corresponding to unmodified and modified reads. To evaluate the ability of each tool to quantify m6A sites, we compared their results to those obtained from NGS-based methods. We first calculated the enrichment score of each site using MeRIP-seq data, then divided the sites into seven categories according to their enrichment scores. When comparing the methylation rates for sites belonging to different categories, we found that sites with high enrichment scores also showed higher methylation rates (Fig. 5a ). For all three tools, methylation rates were positively correlated with the enrichment score, and xPore showed the highest correlation (Fig. 5a ). To further assess their performance quantitatively, we utilized the validation set from m6A-REF-seq results containing quantitative information for individual m6A sites in four motifs. These three tools showed largely different correlations with the validation set, among which Tombo and xPore achieved the lowest and the highest correlations, respectively (Pearson’ r = 0.21 for Tombo and Pearson’ r = 0.57 for xPore) (Fig. 5b ). Correspondingly, the results of different tools were poorly correlated, with the lowest correlation between Tombo and Nanom6A (Pearson’ r = 0.27), and the highest correlation between Nanom6A and xPore (Pearson’ r = 0.65) (Fig. 5b ). In addition, Tombo_com always showed a higher correlation with other methods than Tombo.

a The distribution of methylation rates determined by each of the three tools. m6A sites are classified into categories according to enrichment scores derived from MeRIP-seq. Violin plots show the distribution of detected methylation rates. In the boxplots, the lower and upper limits represent the 25th and 75th percentiles, respectively; the center line represents the median; and upper and lower whiskers indicate ±1.5×IQR. The number of sites occurring in each peak category are indicated in parentheses. b Correlations of methylation rates derived from m6A-REF-seq and those from each of the three tools. Pearson correlation coefficients and the number of sites for each paired comparison are indicated at the top-right, with 2D-density plots of the methylation rates at the lower-left. The four density plots on the diagonal show the distribution of methylation rates in the m6A-REF-seq results and those of each tool. c Correlation coefficients of methylation rates derived from Tombo and that from Nanom6A. m6A sites in different motifs are separately calculated. d Correlation coefficients of methylation rates derived from paired tools. m6A sites with different sequencing coverage are separately calculated.

Given that previous reports on variable detection capabilities among different motifs, we investigated whether the performance of m6A quantification also varied among different sequence contexts. Despite the low overall consistency among the three tested tools, we found that they consistently showed higher correlations in specific motifs (Fig. 5c and Supplementary Fig. 11a–f ). For instance, when comparing Tombo and Nanom6A, Pearson correlation coefficients for m6A sites within the GGACT and AGACT motifs increased to 0.7, a value significantly higher than that for all sites (0.27) (Fig. 5c ). Similar results were observed for the other five pairs of comparisons (Supplementary Fig. 11a–e ). Furthermore, increasing the coverage did not improve the quantification performance (Fig. 5d and Supplementary Fig. 11g ).

Detection capability is influenced by sequence coverage and m6A stoichiometry

In order to test whether the detection capability is affected by sequencing coverage, we compared the recall and precision across different coverage intervals for each tool. Three error-based tools (ELIGOS/ELIGOS_diff, DRUMMER and DiffErr) and Nanocompore were susceptible to low coverage, with their recall rates increasing dramatically as coverage increased (Fig. 6a ), which explains the superior performance of these tools in Arabidopsis with high sequencing depth (Supplementary Fig. 5 ). xPore and Tombo_com showed slightly lower recall rates at higher coverage interval, while other tools remained stable (Fig. 6a ). In terms of precision, Tombo_com and Epinano_delta benefited most from the high coverage as did the performance of Epinano, Nanom6A and MINES, whereas other tools were less affected by the coverage (Fig. 6b ). These results were consistent when using data from miCLIP2 as a positive control (Supplementary Fig. 12 and Supplementary Table 7 ). In general, the detection capabilities of different tools varied in their response to coverage (Fig. 6c and Supplementary Table 7 ).

a , b Relative recall rates ( a ) and precision ( b ) for m6A sites under different coverage. Relative recall rates/precision were calculated as the ratio of the recall rate/precision for m6A sites with a given coverage to the overall mean (across all sites). The validation set was derived from the miCLIP results. The same validation set is also used in c . c Recall rates and precision of ELIGOS and Tombo_com for m6A detection under different sequencing coverage. d – f Relative recall rates for m6A sites with different methylation rate/site enrichment. Relative recall rates were calculated as the ratio of the recall rate for m6A sites with a given methylation rate/site enrichment to the overall mean (across all sites). The validation sets were derived from the m6A-REF-seq ( d ), miCLIP ( e ) and miCLIP2 ( f ) results.

As most m6A sites were partially methylated, we wondered whether the algorithms employed by different tools were also effective in detecting m6A sites with low methylation level. To test this, we divided m6A sites that were detected by m6A-REF-seq into five categories according to the methylation rate and compared the recall rate between different categories for each tool. As expected, it is difficult for all the tools to detect sites with low m6A levels (0-0.2); in contrast, the recall rate is largely improved for the sites with high m6A levels (0.8–1) (Fig. 6d and Supplementary Table 8 ). To further investigate all RRACH motifs, we divided the m6A sites detected by miCLIP/miCLIP2 into six categories according to the site enrichment score quantified by MeRIP-seq. Similar results were observed for most tools, with some exception when using data from the miCLIP as positive control (Fig. 6e, f and Supplementary Table 8 ). In conclusion, m6A sites with higher methylation rates were easier to detect than those with lower methylation rates.

Integrated analyses of multiple tools

As presented in the above, it can be difficult to balance precision and recall rates for most individual tools, even under their optimal cut-off settings. For example, Epinano_delta achieved the highest recall rate (0.37) but with a very low precision (0.22), whereas m6Anet achieved the best precision (0.42) at the cost of a low recall rate (0.19) (Fig. 3a, b ). To test whether a combination of multiple tools can improve performance, we intersected the top 5000 m6A sites reported by the different tools. Though the results of ELIGIS_diff, DRUMMER and DiffErr overlapped well with each other (0.6–0.8), wherein the overlap ratio was less than 0.5 for most pairs of other tools (Fig. 7a ). A low overlap ratio may indicate many false-positives as reported by most tools; alternatively, different tools may capture different sets of authentic m6A sites and the integration of these results could improve the prediction performance. After merging the results reported by ten tools, we identified a total of 112,885 m6A sites. However, 54.98% (62,060) of the m6A sites were supported by only one tool, and only 6.66% (7523) of m6A sites were supported by more than half of the tools (Fig. 7b ). Strikingly, only 251 (0.22%) of the m6A sites were identified by all of the tools (Fig. 7b ). We then compared the precision, recall and F1 scores of the integrated predictions to that of the same number of top m6A sites from each individual tool and found that the integrated predictions achieved a better performance in terms of precision, recall rate and F1 score (Fig. 7c ). Moreover, using m6A sites supported by more than 4 or 5 tools achieved the highest F1 score (Fig. 7c ), thus highlighting the effectiveness of integrating the predictions from multiple tools.

a Recurrent sites ratio among top 5000 m6A sites detected by ONT tools. Heatmap shows the percentage of sites simultaneously detected by each pair of tools. b Percentage of sites identified by multiple tools. c Precision, recall rate and F1 scores of sites supported by more than N tools versus identical number of top sites from individual tool. The validation set was derived from the miCLIP results. d Merged sites detected across all tools and percentage of these sites found in the miCLIP/miCLIP2 results (left). Non-recalled sites were classified according to coverage (right). e Recall rates for m6A sites located in different motifs, using the merged sites detected across all tools after removing intergenic and low-coverage m6A sites. f Metagene plots of the transcriptome-wide distribution of m6A sites that could not be detected by NGS-based methods but supported by one or more ONT tools.

Considering that TGS-based methods or NGS-based methods for m6A profiling may have their own superiorities and limitations, we compared the m6A sites detected by TGS (obtained by merging the m6A sites from ten ONT tools) to those detected by NGS (obtained by merging the m6A sites from miCLIP and miCLIP2). Even though the number of m6A sites from all ONT tools totaled 112,885, only 51.20% (19,239/37,577) of these sites that were detected by the NGS methods could be recalled (Fig. 7d ). We further investigated the m6A sites detected by the NGS methods but were not recalled by any of the ONT tools and found that 15.51% were located in intergenic regions and 40.51% were covered by less than five ONT reads (Fig. 7d ). After removing the intergenic and low-coverage sites, we re-analyzed the non-recall rates for sites with various sequence motifs, and found that m6A sites within the GGACT motifs exhibited a recall rate of more than 0.9 while AAACC and AAACA motifs exhibited recall rates of 0.3 (Fig. 7e ). Since only a small proportion (19,239/112,885) of merged sites were supported by NGS methods, we investigated the characteristics of the remaining sites. These sites showed a clearer enrichment around the stop codons when supported by more ONT tools (Fig. 7f ), while the overlap ratio of the m6A sites that were detected in mESCs KO samples decreased (Supplementary Fig. 13f ). This implies that the NGS method failed to detect some of authentic m6A sites that were captured by the ONT tools.

With the development of NGS-based m6A detection methods, our understanding of the biological role of m6A has advanced considerably. However, such methods have inherent limitations. Recently, ONT DRS has emerged as a promising alternative method of detecting RNA modifications. To date, approximately ten computational tools have been developed. With the goal of comprehensively evaluating the performance of these tools, we conducted a comparative analysis of each tool’s detection capabilities, thereby providing a guide for future studies using ONT DRS. Our analysis revealed a variable performance when evaluated using the ROC and PR curves as continuous metrics, with which m6Anet and Epinano_delta performed the best in the single-mode tools and compare-mode tools, respectively. Moreover, a similar ranking was observed when using different validation datasets or the union and intersection sets. We then applied these tools to the detected m6A sites under a cut-off which correlated to the highest F1 score and found that it was still challenging to balance precision and recall rates for some tools. For example, m6Anet, xPore and Nanocompore achieved the highest precision at the cost of lowest recall, while Tombo and Tombo_com achieved the opposite. Nevertheless, integrated predictions from all tools achieved better performance than individual tools, especially when using m6A sites which were supported by multiple tools (>4). As for quantitation, methylation rates reported by Tombo/Tombo_com, Nanom6A and xPore showed positive correlation to the enrichment score determined by MeRIP-seq, but were poorly agreed across ONT tools or with the methylation rates quantified by m6A-REF-seq. Taken together, there is still room for improvement in the algorithms for the identification and quantification of RNA modifications from ONT DRS datasets.

There are two main challenges that need to be addressed. The first challenge relates to the variance in electric current intensity observed during repeated sequencing of the same 5-mer motif, which is determined by the capability of ONT platform. Given that the presence of m6A corresponds to small differences in current intensity, large amounts of background variation could render A and m6A indistinguishable 37 . The second challenge relates to the data used to train machine learning models 38 . One common method to generate positive samples involves in vitro transcription with dm6ATP instead of dATP; however, the distribution of m6A on this artificial RNA may differ greatly from that found on a natural RNA. In addition, models trained on the DRS data containing organism-specific site information would have an inherent sequence bias. We believe synthetic positive samples with m6A sites located in predetermined locations with various motifs would be more appropriate 39 .

We observed varying degrees of intrinsic bias towards some specific sites in various tools. Benefited from the use of nearly m6A-free negative control, compared-mode tools generally performed better than single-mode tools. Analogously, introducing a negative control (mESC Mettl3 KO samples) as an extra calibrator effectively improved the precision of single-mode tools, although negative control sample with residual m6A confounded the analysis. However, it is relatively complicated to knockout Mettl3 in most cell types 36 , 40 , 41 with some m6A sites having been shown to be Mettl3 -independent 42 , thereby making it virtually impossible to generate an ideal negative control from a natural sample. Recently, we used a transcriptome-wide modification-free RNA library synthesized through in vitro transcription (IVT) to systematically calibrate epitranscriptomic maps profiled by NGS based methods 16 . In the detection of RNA modifications via ONT DRS, this IVT RNA library holds great promise to serve as a gold standard negative control in the future.

Unexpectedly, all tools displayed very divergent recall rates and precision in different motifs. After excluding poorly sequenced reads and intronic or intergenic m6A sites, approximately 70% of the m6A sites detected by miCLIP/miCLIP2 were recalled from the merged m6A sites. High recall rates of up to 90% were seen in GGACT motifs, whereas lower rates of 20% were observed in AAACC motifs. Minor current differentials between modified and unmodified A’s in some motifs may induce false-negative results. To overcome this bottleneck, chemical decorations would assist in making m6A more distinguishable from unmodified A when traversing the nanopore. Similar methods have been developed to detect inosine using the nanopore sequencing 43 .

Here we represent the first comprehensive comparison of computational tools focused on detecting m6A in the ONT DRS dataset. Rational oligo models and well-established BS-seq data have already been applied to the benchmarking of 5mC detection by nanopore sequencing 44 , 45 . However, there is no gold-standard method for m6A detection with both single-base resolution and stoichiometric information on the NGS platform. Therefore, we conducted MeRIP-seq on the same samples used in the ONT sequencing and employed published m6A sets as identified by miCLIP, miCLIP2 and m6A-REF-seq. Though the value of evaluation metrics (ROC AUC, PR AUC, precision, recall, F1 score) given in this study may not reflect the performance of these tools completely and may be underestimated due to the low sequencing depth of DRS data, the comparison between tools would be informative. Moreover, most of our findings still hold when using different validation sets. In addition, we further observed that a proportion of m6A sites which were consistently captured by ONT tools were likely to be authentic, but were not detected by previous NGS-based methods. In conclusion, we provide a comprehensive comparison of computational tools commonly used in m6A mapping depending on the ONT DRS platform. Comparative information on the detection capability, intrinsic bias and motif preferences between the different tools provide valuable insights for the development of new detection strategies and algorithms in future.

Sample collection

The WT and Mettl3 KO mESCs were validated in a previous study 46 . Briefly, exons 5–7 of Mettl3 were replaced with puromycin and hygromycin resistance genes using CRISPR–Cas9 system. After selecting with 1 μg/ml puromycin (Gibco, A1113802) and 200 μg/ml hygromycin B (Sigma, V900372) for one week, diallelic KO colonies were cultured and verified by western blotting (see results in Liu et al. 46 ). After extracting total RNA from the samples using TRIzol (Thermo Fisher Scientific, 15596026), mRNA was isolated via a Dynabeads mRNA Purification Kit (Thermo Fisher Scientific, 61006) with genomic DNA removal using TuRBO DNase kit (Thermo Fisher, AM2238).

Nanopore direct RNA sequencing and data preprocessing

Nanopore direct RNA sequencing was conducted following instructions provided by Oxford Nanopore Technologies (Oxford, UK) using DRS kits (SQK-RNA002) and MinION flowcells (FLO-MIN106D, R9.4.1 pore). After live basecalling using Guppy (v4.0.11) in MinKNOW, reads that passed the quality threshold (7) were then subjected to post-run basecalling with the latest version of Guppy (v5.0.7) under default parameters. A reference transcriptome file was generated from the mouse genome reference and corresponding annotation (GENCODE.VM18) using convert2bed (v2.4.38) and bedtools (v2.26.0). The resulting FASTQ files were aligned to the mouse reference transcriptome using Minimap2 (v2.17) with parameter settings “-ax map-ont --MD”. Samtools (v1.6) was used to filter secondary and supplementary alignments and convert aligned reads to the bam format. Public datasets were preprocessed in the same way except that Arabidopsis samples were aligned to the reference transcriptome generated from TAIR10 reference genome with Araport11 reference annotation.

NGS-based validation sets

The raw MeRIP-seq data was downloaded from Zhang et al. 16 (GSE151028). After quality control, clean reads were aligned to the mouse genome reference file (GRCm38/mm10) with HISAT2 (v2.1.0) for peak calling using MACS2 (v2.1.2) (callpeak --keep-dup all -g mm --nomodel --extsize 200 -q 0.0001 --fe-cut-off 2). High confidence peaks were obtained by first merging peaks from two WT replicates and then subtracting peaks identified in both KO replicates. Only peaks containing RRACH motif were retained since most ONT tools were limited to this motif. Sites detected by ONT tools would be considered as true positives when they were found to be located in the peaks from MeRIP-seq. When using public miCLIP 34 , miCLIP2 35 and m6A-REF-seq 12 datasets, m6A sites only in the RRACH motifs were kept. m6A profiles derived from miCLIP and miCLIP2 results were two main validation sets for all analysis. MeRIP-seq results were also used to compare tools with continuous evaluation metrics. When using MeRIP-seq data as evaluation set, we roughly considered all sites falling in the peaks as true positives. m6A-REF-seq results provided reference modification rate in the quantification part. As for the validation sets for Arabidopsis , we also filtered sites not containing RRACH motifs from the public MeRIP-seq 47 and miCLIP 28 datasets.

m6A detection with multiple tools

We ran the tools with parameter settings described as follows, and the detailed command and code are available at https://github.com/zhongzhd/ont_m6a_detection 48 .

Tombo (v1.5.1) needs to re-squiggle reads to assign raw current signals to each base before further processing can be carried out. Tombo now provides two methods for m6A detection: (1) a de novo non-canonical base method with the command “detect_modifications de_novo” (represented by Tombo); and (2) a canonical sample comparison method with the command “detect_modifications model_sample_compare” (represented by Tombo_com). The first approach was used for routine analysis of single samples, while the second approach was used to combine analyses for negative control samples. Several statistics were obtained from Tombo, including the coverage and estimated fraction-modified values, and the latter were treated as scores for ranking.

MINES uses coverage and fraction-modified values calculated by Tombo as input to its random forest models (only for AGACT and GGACH motifs) and only reports significant sites. Therefore, we modified the code to output the modification probability of all sites in all RRACH motifs. The modification probabilities of sites were treated as scores for ranking. The modified version is available at https://github.com/zhongzhd/ont_m6a_detection 48 .

Nanom6A (v2.0) extracts the median, mean, standard deviation and dwell time from normalized raw signals for each read after re-squiggling with Tombo. The above features are then used as input for its XGBoost model which assigns a modification probability for each read. Reads with a modification probability >0.5 were considered modified, and the modified fraction of sites were treated as scores for ranking.

m6Anet (v1.0) requires the results generated from running the eventalign module in Nanopolish (0.13.2), which assigns raw current signal to each base (like Tombo). The results are used as input to the m6Anet’s neural-network based Multiple Instance Learning model which produces probability-modified values per site. The probability-modified values corresponding to each site were treated as scores for ranking.

Nanocompore (v1.0.0) collapses the results from the Nanopolish eventalign module to generate a file containing the median intensity and dwell time; it also makes pairwise comparisons to the collapsed results from a control sample. The Gaussian mixture model (GMM) logit p values from statistical test results of sites were treated as scores for ranking. Note that this tool outputs many sites with “nan” value due to failure in clustering, so sites with the “nan” values were assigned a p value of 1.

xPore (v2.0) also uses the results from the Nanopolish eventalign module for all samples to make a configuration file, and applies the multi-sample two-Gaussian mixture model to obtain the estimated modification rate for each sample, a test statistic ( z score) and p value on differential modification rates for each pairwise condition. The z scores from statistical test results were treated as scores for ranking (as recommended by the authors). Similar to Nanocompore, this tool omitted many sites in its result due to failure in clustering, so we assigned a z score of 0 to these sites. Sites with modified distributions of current assigned in the opposite direction, were assigned a z score of 0 according to xPore’s methods. In addition, sites with higher methylation rate in KO than WT were also assigned a z score of 0.

DiffErr (v0.2) uses alignment files with a control as input. The −log10 p values from statistical test results of sites were treated as scores for ranking, and sites filtered internally or with odds ratio (log) ≤0 were assigned a −log10 p value of 0.

DRUMMER also uses alignment files for the sample and a control as input. The p values (odds ratio) from statistical test results of sites were treated as scores for ranking, and sites filtered internally or with odds ratio ≤1 were assigned a p value of 1.

ELIGOS (v2.0.1) offers two methods for the detection of modified nucleotides: (1) the identification of RNA modifications compared to a rBEM5 + 2 model using “rna_mod” (represented by ELIGOS); and (2) the identification of RNA modifications compared to control conditions using “pair_diff_mod” (represented by ELIGOS_diff). The first method was used for routine analyses of single samples, while the second was implemented for combined analyses including negative control samples. The p values (odds ratio) from statistical test results of sites were treated as scores for ranking, and sites filtered internally or with odds ratio ≤1 were assigned a p value of 1.

Epinano (v1.2.0) offers two kinds of Support Vector Machine models for m6A detection. These models have different features: (1) the mean quality, and mismatch, insertion and deletion frequency of each base (represented by Epinano); and (2) the difference in mean quality, and in the mismatch, insertion and deletion frequency between the sample and a control for each base (represented by Epinano_delta). The first method was used by default and the second when comparing to a negative control. The probability-modified from predicted results were treated as scores for ranking.

Tools comparison using ROC and PR curves

Sites in RRACH motif with more than 5 reads (except for m6Anet that required at least 20 reads) were tested by each tool and ranked according to the significant score as described above. We then converted the transcriptome coordinates to the genomic coordinates to match with the validation sets. Sites derived from multiple transcriptomic coordinates but aligned to the same genomic coordinate were merged as follows: for Tombo/Tombo_com, MINES, Nanom6A, m6Anet and Epinano/Epinano_delta, the scores (fraction-modified/probability-modified) were weighted averaging according coverage; for Nanocompore, xPore, DiffErr, DRUMMER and ELIGOS, the best scores ( p value/ z score) were selected as representative. To objectively compare the tools, we only kept the sites that were covered by all methods and with a coverage ≥20 in the final dataset for evaluation using the ROC and PR curve. MeRIP-seq, miCLIP and miCLIP2 results mentioned above were applied as ground truth, and we also used the validation sets in the form of union and intersection of all the NGS-based methods. As for Arabidopsis , MeRIPseq and miCLIP results and their union and intersection were used as the validation sets.

m6A determination with optimal cut-off

To determine the optimal cut-off for m6A detection, we varied the cut-offs and calculated F1 scores of sites in the RRACH motif covered by at least 5 reads using the MeRIP-seq, miCLIP and miCLIP2 results as validation sets separately (note that m6Anet requires at least 20 reads and MINES only output results in 4 motifs).

As the distribution of F1 scores were inconsistent between the validation set from MeRIP-seq and miCLIP/miCLIP2, we selected the cut-off corresponded to the maximum F1 score when adding up that of miCLIP and miCLIP2 for following m6A detection process. Validation sets of miCLIP and miCLIP2 were used for the following precision and recall calculation. Note that in evaluating the performance of these m6A detection tools, we only considered sites detected by NGS methods (MeRIP-seq, miCLIP and miCLIP2) that were covered by minimal DRS reads (5, 20, 50 DRS reads requirement). When setting the requirement at 5 DRS reads, we also took into consideration sites omitted by m6Anet (<20 reads) and MINES (not in four specific motif) and treated them with the lowest score for recall calculation. m6A sites detected under 5 DRS reads requirement were used for further analysis.

Intrinsic bias assessment

Sites in the RRACH motif with minimum 5 reads (the pool) in Mettl3 KO samples were tested by each tool and determined as m6A using the optimal cut-off described above. We then randomly sampled an identical number of sites (== the number of m6A sites detected in KO samples) from the pool as the random dataset for each tool. To test whether these tools prefer some irrelevant m6A sites (intrinsic bias), we compared two intersection “WT&KO” (m6A sites detected in WT and KO samples) and “WT&Random” (m6A sites detected in WT samples and existed in random dataset). If there were certain intrinsic bias, the WT&KO intersection should be greater than WT&Random intersection; we applied one-side Hypergeometric tests to calculate the significance. To determine for the cause of the intrinsic bias, the input data of MeRIP-seq were used to call SNPs/indels compared to mm10 reference using bcftools with the key parameter “QUAL > 100”, and SNPs/indels were kept when supported by more than 2 of 4 samples (WT and KO with replicates). In addition, the flanking sequence (upstream and downstream 10 bp) of all candidate sites were extracted and inspected for homopolymer. We then compared the ratio of sites near SNPs/indels (within 5 bp) or homopolymer between the “WT&KO” intersection and sites in the RRACH motif with more than 5 reads (background) and applied one-sided Binomial tests to calculate significance.

Comparisons of current intensities and base-calling “errors” between modified and unmodified 5-mer

Nanopolish eventalign was used to assign raw current to each base in the reference genome. The results of sam2tsv pileup (v0558422) were used to calculate the frequency of mismatch, deletion and insertion and base quality. Highly-methylated m6A sites in GGACA and GAACA motifs with a methylation rate differences >0.8 between WT and KO samples as determined by m6A-REF-seq, were selected. The current intensities and base-calling “errors” of these sites extracted from WT samples were treated as current intensities and base-calling “errors” of modified 5-mer, while these extracted from KO samples were treated as current intensities and base-calling “errors” of unmodified 5-mer. We compared the current intensities and base-calling “errors” between modified and unmodified 5-mer for both GGACA and GAACA motifs.

m6A quantification

Tombo/Tombo_com and Nanom6A detect m6A at read level and support methylation rates calculation. Sites in the RRACH motif with more than 5 reads were calculated. xPore outputs methylation rates directly in its result and tests the difference of methylation rates between WT and KO samples, so we used the methylation rates of WT samples for all sites in the RRACH motif with more than 5 reads except for sites which failed to cluster. Methylation rates from the m6A-REF-seq results were used for comparison. We also calculated the enrichment score for these sites using the MeRIP-seq data (IP FPKM/Input FPKM) with featureCounts (v2.0.1) for comparison.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All data generated for this paper have been deposited in the NCBI Gene Expression Omnibus (GEO) database under accession number GSE195618 . Published DRS datasets derived from WT and Mettl3 KO mESC samples are obtained from the work of Jenjaroenpun et al. 30 and are available through NCBI Sequence Read Archive (SRA) database under accession number SRP166020 . Published Arabidopsis DRS datasets along with their KO variants are form Parker et al. 28 and are available through European Nucleotide Archive (ENA) database under accession number PRJEB32782 .

Code availability

The analysis pipeline and all custom scripts used for this paper are available at https://github.com/zhongzhd/ont_m6a_detection 48 .

Boccaletto, P. et al. MODOMICS: a database of RNA modification pathways. 2021 update. Nucleic Acids Res. 50 , D231–D235 (2022).

Article CAS PubMed Google Scholar

Roundtree, I. A., Evans, M. E., Pan, T. & He, C. Dynamic RNA modifications in gene expression regulation. Cell 169 , 1187–1200 (2017).

Article CAS PubMed PubMed Central Google Scholar

Zaccara, S., Ries, R. J. & Jaffrey, S. R. Reading, writing and erasing mRNA methylation. Nat Rev Mol Cell Biol. 20 , 608–624 (2019).

Jia, G. et al. N6-methyladenosine in nuclear RNA is a major substrate of the obesity-associated FTO. Nat. Chem. Biol. 7 , 885–887 (2011).

Yang, Y., Hsu, P. J., Chen, Y. S. & Yang, Y. G. Dynamic transcriptomic m(6)A decoration: writers, erasers, readers and functions in RNA metabolism. Cell Res. 28 , 616–624 (2018).

Yu, Q. et al. RNA demethylation increases the yield and biomass of rice and potato plants in field trials. Nat. Biotechnol. 39 , 1581–1588 (2021).

Jiang, X. et al. The role of m6A modification in the biological functions and diseases. Signal Transduct. Target. Ther. 6 , 74 (2021).

Tang, Y. et al. m6A-Atlas: a comprehensive knowledgebase for unraveling the N6-methyladenosine (m6A) epitranscriptome. Nucleic Acids Res. 49 , D134–D143 (2021).

Meyer, K. D. et al. Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons. Cell 149 , 1635–1646 (2012).

Dominissini, D. et al. Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq. Nature 485 , 201–206 (2012).

Article ADS CAS PubMed Google Scholar

Garcia-Campos, M. A. et al. Deciphering the “m(6)A Code” via antibody-independent quantitative profiling. Cell. 178 , 731–747 (2019).

Zhang, Z. et al. Single-base mapping of m(6)A by an antibody-independent method. Sci. Adv. 5 , x250 (2019).

Article ADS Google Scholar

Meyer, K. D. DART-seq: an antibody-free method for global m(6)A detection. Nat. Methods 16 , 1275–1280 (2019).

Helm, M., Lyko, F. & Motorin, Y. Limited antibody specificity compromises epitranscriptomic analyses. Nat. Commun. 10 , 5669 (2019).

Article ADS PubMed PubMed Central Google Scholar

McIntyre, A. et al. Limits in the detection of m(6)A changes using MeRIP/m(6)A-seq. Sci. Rep. 10 , 6590 (2020).

Article ADS CAS PubMed PubMed Central Google Scholar

Zhang, Z. et al. Systematic calibration of epitranscriptomic maps using a synthetic modification-free RNA library. Nat. Methods 18 , 1213–1222 (2021).

Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21 , 30 (2020).

Article PubMed PubMed Central Google Scholar

Flusberg, B. A. et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 7 , 461–465 (2010).

Simpson, J. T. et al. Detecting DNA cytosine methylation using nanopore sequencing. Nat. Methods 14 , 407–410 (2017).

Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15 , 201–206 (2018).

Stoiber, M. et al. De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. Preprint at bioRxiv https://doi.org/10.1101/094672, (2017).

Lorenz, D. A., Sathe, S., Einstein, J. M. & Yeo, G. W. Direct RNA sequencing enables m(6)A detection in endogenous transcript isoforms at base-specific resolution. RNA 26 , 19–28 (2020).

Gao, Y. et al. Quantitative profiling of N(6)-methyladenosine at single-base resolution in stem-differentiating xylem of Populus trichocarpa using Nanopore direct RNA sequencing. Genome Biol. 22 , 22 (2021).

Hendra, C. et al. Detection of m6A from direct RNA sequencing using a multiple instance learning framework. Nat. Methods 19 , 1590–1598 (2022).

Leger, A. et al. RNA modifications detection by comparative Nanopore direct RNA sequencing. Nat. Commun. 12 , 7198 (2021).

Pratanwanich, P. N. et al. Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore. Nat. Biotechnol. 39 , 1394–1402 (2021).

Liu, H. et al. Accurate detection of m(6)A RNA modifications in native RNA sequences. Nat. Commun. 10 , 4079 (2019).

Parker, M. T. et al. Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m(6)A modification. ELife 9 , e49658 (2020).

Price, A. M. et al. Direct RNA sequencing reveals m(6)A modifications on adenovirus RNA are necessary for efficient splicing. Nat. Commun. 11 , 6016 (2020).

Jenjaroenpun, P. et al. Decoding the epitranscriptional landscape from native RNA sequences. Nucleic Acids Res. 49 , e7 (2021).

Watabe, E. et al. m(6) A-mediated alternative splicing coupled with nonsense-mediated mRNA decay regulates SAM synthetase homeostasis. EMBO J. 40 , e106434 (2021).

Wang, Y. et al. Profiling of circular RNA N6‐methyladenosine in moso bamboo (Phyllostachys edulis) using nanopore‐based direct RNA sequencing. J. Integr. Plant Biol. 62 , 1823–1838 (2020).

Wang, Y., Zhao, Y., Bollas, A., Wang, Y. & Au, K. F. Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol. 39 , 1348–1365 (2021).

Ke, S. et al. m(6)A mRNA modifications are deposited in nascent pre-mRNA and are not required for splicing but do specify cytoplasmic turnover. Genes Dev. 31 , 990–1006 (2017).

Kortel, N. et al. Deep and accurate detection of m6A RNA modifications using miCLIP2 and m6Aboost machine learning. Nucleic Acids Res. 49 , e92 (2021).

Batista, P. J. et al. m(6)A RNA modification controls cell fate transition in mammalian embryonic stem cells. Cell Stem Cell. 15 , 707–719 (2014).

Begik, O. et al. Quantitative profiling of pseudouridylation dynamics in native RNAs with nanopore sequencing. Nat. Biotechnol. 39 , 1278–1291 (2021).

Alfonzo, J. D. et al. A call for direct sequencing of full-length RNAs to identify all modifications. Nat. Genet. 53 , 1113–1116 (2021).

Workman, R. E. et al. Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat. Methods 16 , 1297–1305 (2019).

Poh, H. X., Mirza, A. H., Pickering, B. F. & Jaffrey, S. R. Alternative splicing of METTL3 explains apparently METTL3-independent m6A modifications in mRNA. PLoS Biol. 20 , e3001683 (2022).

Lin, Z. et al. Mettl3-/Mettl14-mediated mRNA N(6)-methyladenosine modulates murine spermatogenesis. Cell Res. 27 , 1216–1230 (2017).

Satterwhite, E. R. & Mansfield, K. D. RNA methyltransferase METTL16: targets and function. Wiley Interdiscip. Rev. RNA 13 , e1681 (2021).

PubMed PubMed Central Google Scholar

Ramasamy, S. et al. Chemical probe-based nanopore sequencing to selectively assess the RNA modifications. ACS Chem. Biol. 17 , 2704–2709 (2022).

Liu, Y. et al. DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation. Genome Biol. 22 , 295 (2021).

Yuen, Z. W. et al. Systematic benchmarking of tools for CpG methylation detection from nanopore sequencing. Nat. Commun. 12 , 3438 (2021).

Liu, J. et al. The RNA m(6)A reader YTHDC1 silences retrotransposons and guards ES cell identity. Nature 591 , 322–326 (2021).

Shen, L. et al. N(6)-methyladenosine RNA modification regulates shoot stem cell fate in Arabidopsis. Dev. Cell 38 , 186–200 (2016).

Zhong, Z. et al. Systematic comparison of tools used for m6A mapping from nanopore direct RNA sequencing. https://github.com/zhongzhd/ont_m6a_detection , https://doi.org/10.5281/zenodo.7651738 (2023).

Download references

Acknowledgements

This work was supported by the Ministry of Science and Technology of China (National Science and Technology Major Project, 2018YFA0109100, 2019YFA0802203, 2022YFA0912900, and 2022YFC3400400), National Natural Science Foundation of China (92253202, 32271499, 32270644, and 32100461), and Shenzhen Bay Scholars Program.

Author information

These authors contributed equally: Zhen-Dong Zhong, Ying-Yuan Xie.

Authors and Affiliations

MOE Key Laboratory of Gene Function and Regulation, Guangdong Province Key Laboratory of Pharmaceutical Functional Genes, State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, 510275, Guangzhou, China

Zhen-Dong Zhong, Ying-Yuan Xie, Hong-Xuan Chen, Ye-Lin Lan, Xue-Hong Liu, Jing-Yun Ji, Fu Wu, Zhang Zhang & Guan-Zheng Luo

CAS Key Laboratory of Regenerative Biology, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, China

Lingmei Jin & Jiekai Chen

School of Biomedical Sciences, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong, China

Daniel W. Mak

You can also search for this author in PubMed Google Scholar

Contributions

G.-Z.L. conceived the project; Z.-D.Z. and Y.-Y.X. analyzed the data and wrote the manuscript; H.-X.C. conducted the experiments with the assistance from X.-H.L., J.-Y.J., Y.-L.L., F.W., L.J., and J.C.; Z.Z., D.W.M. and G.-Z.L. revised the manuscript. All authors reviewed the results and approved the final version of the manuscript.

Corresponding authors

Correspondence to Zhang Zhang or Guan-Zheng Luo .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer review file, reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Zhong, ZD., Xie, YY., Chen, HX. et al. Systematic comparison of tools used for m 6 A mapping from nanopore direct RNA sequencing. Nat Commun 14 , 1906 (2023). https://doi.org/10.1038/s41467-023-37596-5

Download citation

Received : 23 March 2022

Accepted : 21 March 2023

Published : 05 April 2023

DOI : https://doi.org/10.1038/s41467-023-37596-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

Explore articles by subject
Guide to authors
Editorial policies

IMAGES

Next-Generation Sequencing (NGS)- Principle, Types, Uses
Next-generation sequencing (NGS) overview
An Overview of Next-Generation Sequencing
new generation sequencing
Next Generation Sequencing Online Course
[PDF] Nextgeneration Sequencing Current Technologies And Applications

VIDEO

Next Generation Sequencing (NGS)
NGS Revolutionizing Cancer Research: Advantages & Applications #NGS #cancerresearch #biotechnology
Next-Generation Sequencing (NGS): Types, Advantages & Applications #NGS #bioinformatics #biotech
The basics of Next Generation Sequencing
Common Laboratory Errors in Next-Generation Sequencing Assays
Experiences with targeted sequencing

COMMENTS

Next-Generation Sequencing Technology: Current Trends and Advancements
Next-generation sequencing (NGS) is a powerful tool used in genomics research. NGS can sequence millions of DNA fragments at once, providing detailed information about the structure of genomes, genetic variations, gene activity, and changes in gene behavior. Recent advancements have focused on faster and more accurate sequencing, reduced costs ...
(PDF) Next Generation Sequencing
The next generation sequencing (NGS) technology refers to non-Sanger based DNA sequencing methods which have replaced conventional sequencing methods. They have been vividly used for analyses of ...
Next-generation sequencing
Next-generation sequencing refers to non-Sanger-based high-throughput DNA sequencing technologies. Millions or billions of DNA strands can be sequenced in parallel, yielding substantially more ...
Coming of age: ten years of next-generation sequencing technologies
Whole-exome and targeted sequencing 83 are also proving invaluable to sequencing research. By limiting the size of the genomic material used, more individual samples can be sequenced within a ...
Next-generation sequencing
Read the latest Research articles in Next-generation sequencing from Nature. ... The Cancer Genome Atlas Research Network reports an integrative analysis of more than 400 samples of clear cell ...
Next-generation sequencing technologies: An overview
Since the days of Sanger sequencing, next-generation sequencing technologies have significantly evolved to provide increased data output, efficiencies, and applications. These next generations of technologies can be categorized based on read length. This review provides an overview of these technologies as two paradigms: short-read, or ...
Library preparation for next generation sequencing: A review of
Nowadays, next generation sequencing (NGS) is used to sequence complete genomes and the costs of sequencing a human genome lie around the $1000-mark (National Human Genome Research Institute, 2019). As the technique has evolved, the throughput of machines has steadily increased while sequencing costs per base have decreased ( National Human ...
Next-Generation Sequencing: From Basic Research to Diagnostics
Abstract. Background: For the past 30 years, the Sanger method has been the dominant approach and gold standard for DNA sequencing. The commercial launch of the first massively parallel pyrosequencing platform in 2005 ushered in the new era of high-throughput genomic analysis now referred to as next-generation sequencing (NGS).
Next generation sequencing technology: Advances and applications
Introduction. The growing power and reducing cost sparked an enormous range of applications of Next generation sequencing (NGS) technology. Gradually, sequencing is starting to become the standard technology to apply, certainly at the first step where the main question is "what's all involved", "what's the basis".
Nanopore sequencing technology, bioinformatics and applications
Nanopore design. The concept of nanopore sequencing emerged in the 1980s and was realized through a series of technical advances in both the nanopore and the associated motor protein 1,4,5,6,7,8 ...
A Brief Review of the Next Generation Sequencing
The so-called "next-generation " sequenc-. ing (NGS) technologies allows us, in a short time, low cost and in. parallel, to sequence millions to billions of DNA nucleotides, mini-. mizing the ...
Next generation sequencing Research Papers
Next-generation sequencing is one of the strongest tools for studying genetic diseases and gene mutations. Additionally, brain-computer interface could be used in the near future to assist people with paralysis or other related disorders and physical injuries to move toward into a better way of life, restoring memory or improving the way of ...
wgbstools: A computational suite for DNA methylation sequencing data
Next-generation methylation-aware sequencing of DNA sheds light on the fundamental role of methylation in cellular function in health and disease. These data are commonly represented at a single CpG resolution, while single-molecule fragment-level analysis is often overlooked. Here, we present wgbstools, an extensive computational suite tailored for methylation sequencing data. wgbstools ...
Sequencing
The second paper, by a group of researchers at 454 Life Sciences led by Jonathan Rothberg, introduced a high-throughput sequencing platform based on ePCR and pyrosequencing.
Next-generation Sequencing Research Papers
Next-generation sequencing technology helps us to depict the pioneering microflora of any ecological niche based on metagenomic approach. Our study represents the first Illumina based metagenomic study of Deulajhari hot spring DH1, and DH2 of the cluster with temperature 65 °C to 55 °C respectively establishing a difference of 10 °C.
Frontiers
This article is part of the Research Topic Plant Microbiome: Interactions, Mechanisms of Action, and Applications, Volume III View all 7 articles. Next generation sequencing-aided screening, isolation, molecular identification, and antimicrobial potential for bacterial endophytes from the medicinal plant, Elephantorrhiza elephantina ...
De novo genome assembly for an endangered lemur using ...
While the advent of next generation sequencing technologies and the resulting reduction of associated costs have led to the proliferation of genomic data and high-quality reference genomes, global discrepancies in genomic sequencing capabilities often result in biological samples from biodiverse host countries being exported to facilities in ...
Systematic comparison of tools used for m6A mapping from ...
In the last decade, third-generation sequencing (TGS) technologies represented by two platforms, namely Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have emerged as ...