U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Nature Portfolio

Logo of npgopen

A comparative genomics multitool for scientific discovery and conservation

Zoonomia consortium.

1 Broad Institute of MIT and Harvard, Cambridge, MA USA

Associated Data

The project website is http://zoonomiaproject.org/ . Details of each Zoonomia genome assembly—including NCBI GenBank 63 accession numbers—are provided in Supplementary Table 1 . Sequence data and genome assemblies are available at https://www.ncbi.nlm.nih.gov/ . Variant lists for each species are provided at http://broad.io/variants . Further source data for Fig. ​ Fig.2 2 are provided in the Zoonomia GitHub repository (10.5281/zenodo.3887432). The Cactus alignment is provided at https://cglgenomics.ucsc.edu/data/cactus/ . A visualization of the alignments and phyloP data is available by loading our assembly hub into the UCSC browser 64 by copying the hub link https://comparative-genomics-hubs.s3-us-west-2.amazonaws.com/200m_hub.txt into the Track Hubs page. There are no restrictions on use. Source data are provided with this paper.

The DISCOVAR de novo assembly code is available at https://github.com/broadinstitute/discovar_de_novo/releases/tag/v52488 (10.5281/zenodo.3870889), the Cactus pipeline is available at https://github.com/ComparativeGenomicsToolkit/cactus (10.5281/zenodo.3873410) and code for other analyses is available at https://github.com/broadinstitute/Zoonomia/ (10.5281/zenodo.3887432).

The Zoonomia Project is investigating the genomics of shared and specialized traits in eutherian mammals. Here we provide genome assemblies for 131 species, of which all but 9 are previously uncharacterized, and describe a whole-genome alignment of 240 species of considerable phylogenetic diversity, comprising representatives from more than 80% of mammalian families. We find that regions of reduced genetic diversity are more abundant in species at a high risk of extinction, discern signals of evolutionary selection at high resolution and provide insights from individual reference genomes. By prioritizing phylogenetic diversity and making data available quickly and without restriction, the Zoonomia Project aims to support biological discovery, medical research and the conservation of biodiversity.

A whole-genome alignment of 240 phylogenetically diverse species of eutherian mammal—including 131 previously uncharacterized species—from the Zoonomia Project provides data that support biological discovery, medical research and conservation.

The genomics revolution is enabling advances not only in medical research 1 , but also in basic biology 2 and in the conservation of biodiversity, where genomic tools have helped to apprehend poachers 3 and to protect endangered populations 4 . However, we have only a limited ability to predict which genomic variants lead to changes in organism-level phenotypes, such as increased disease risk—a task that, in humans, is complicated by the sheer size of the genome (about three billion nucleotides) 5 .

Comparative genomics can address this challenge by identifying nucleotide positions that have remained unchanged across millions of years of evolution 6 (suggesting that changes at these positions will negatively affect fitness), focusing the search for disease-causing variants. In 2011, the 29 Mammals Project 7 identified 12-base-pair (bp) regions of evolutionary constraint that in total comprise 4.2% of the genome, by measuring sequence conservation in humans plus 28 other mammals. These regions proved to be more enriched for the heritability of complex diseases than any other functional mark, including coding status 8 . By expanding the number of species and making an alignment that is independent of any single reference genome, the Zoonomia Project was designed to detect evolutionary constraint in the eutherian lineage at increased resolution, and to provide genomic resources for over 130 previously uncharacterized species.

Designing a comparative-genomics multitool

When selecting species, we sought to maximize evolutionary branch length, to include at least one species from each eutherian family, and to prioritize species of medical, biological or biodiversity conservation interest. Our assemblies increase the percentage of eutherian families with a representative genome from 49% to 82%, and include 9 species that are the sole extant member of their family and 7 species that are critically endangered 9 (Fig. ​ (Fig.1): 1 ): the Mexican howler monkey ( Alouatta palliata mexicana ), hirola ( Beatragus hunteri ), Russian saiga ( Saiga tatarica tatarica ), social tuco-tuco ( Ctenomys sociabilis ), indri ( Indri indri ), northern white rhinoceros ( Ceratotherium simum cottoni ) and black rhinoceros ( Diceros bicornis ).

An external file that holds a picture, illustration, etc.
Object name is 41586_2020_2876_Fig1_HTML.jpg

Phylogenetic tree of the mammalian families in the Zoonomia Project alignment, including both our new assemblies and all other high-quality mammalian genomes publicly available in GenBank when we started the alignment (March 2018) (Supplementary Table 2 ). Tree topology is based on data from TimeTree ( www.timetree.org) 47 . Existing taxonomic classifications recognize a total of 127 extant families of eutherian mammal 48 , including 43 families that were not previously represented in GenBank (red boxes) and 41 families with additional representative genome assemblies (pink boxes). Of the remaining families, 21 had GenBank genome assemblies but no Zoonomia Project assembly (grey boxes) and 22 had no representative genome assembly (white boxes). Parenthetical numbers indicate the number of species with genome assemblies in a given family. Image credits: fossa, Bertal/Wikimedia (CC BY-SA); Arctic fox, Michael Haferkamp/Wikimedia (CC BY-SA); hirola, JRProbert/Wikimedia (CC BY-SA); bumblebee bat, Sébastien J. Puechmaille (CC BY-SA); snowshoe hare, Denali National Park and Preserve/Wikimedia (public domain); aye-aye, Tom Junek/Wikimedia (CC BY-SA); Geoffroy’s spider monkey, Patrick Gijsbers/Wikimedia (CC BY-SA); southern three-banded armadillo, Hedwig Storch/Wikimedia (CC BY-SA); giant anteater, Graham Hughes/Wikimedia (CC BY-SA); brown-throated sloth, Dick Culbert from Gibsons, B.C., Canada/Wikimedia (CC BY).

We collaborated with 28 institutions to collect samples, nearly half (47%) of which were provided by The Frozen Zoo of San Diego Zoo Global (Supplementary Table 1 ). Since 1975, The Frozen Zoo has stored renewable cell cultures for about 10,000 vertebrate animals that represent over 1,100 taxa, including more than 200 species that are classified as vulnerable, endangered, critically endangered or extinct by the International Union for Conservation of Nature (IUCN) 10 . For 36 target species we were unable to acquire a DNA sample of sufficient quality, even though our requirements were modest (Methods), which highlights a major impediment to expanding the phylogenetic diversity of genomics.

We used two complementary approaches to generate genome assemblies (Extended Data Table ​ Table1). 1 ). First, for 131 genomes we generated assemblies by performing a single lane of sequencing (2× 250-bp reads) on PCR-free libraries and assembling with DISCOVAR de novo 11 (referred to here as ‘DISCOVAR assemblies’). This method does not require intact cells and uses less than two micrograms of medium-quality DNA (most fragments are over 5 kilobases (kb) in size), which allowed us to include species that are difficult to access (Extended Data Figs. ​ Figs.1, 1 , ​ ,2) 2 ) while achieving ‘contiguous sequences constructed from overlapping short reads’ (contig) lengths comparable to those of existing assemblies (median contig N50 of 46.8 kb, compared to 47.9 kb for Refseq genome assemblies).

An external file that holds a picture, illustration, etc.
Object name is 41586_2020_2876_Fig4_ESM.jpg

Sequences from species with notable phenotypes can inform human medicine, basic biology and biodiversity conservation, but sample collection can be challenging. a , The Jamaican fruit bat ( Artibeus jamaicensis ) maintains constant blood glucose across intervals of fruit-eating and fasting 66 , achieving homeostasis to a degree that is unknown in the treatment of human diabetes. b , The North American beaver ( Castor canadensis ) avoids tooth decay by incorporating iron rather than magnesium into tooth enamel, which yields an orange hue 67 . c , The thirteen-lined ground squirrel ( Ictidomys tridecemlineatus ) prepares for hibernation by rapidly increasing the thermogenic activity of brown fat 68 , a process that—in humans—is connected to improved glucose homeostasis and insulin sensitivity 69 – 71 . d , The tiny bumblebee bat ( Craseonycteris thonglongyai ) is among the smallest of mammals, making it a sparse source of DNA. e , The remote habitat of the very rare Amazon River dolphin ( Inia geoffrensis ) precludes collection of the high-molecular weight DNA. Image sources: Merlin D. Tuttle/Science Source ( a ); Stephen J. Krasemann/Science Source ( b ); Allyson Hindle ( c ); Sébastien J. Puechmaille (CC BY-SA) ( d ); M. Watson/Science Source ( e ).

An external file that holds a picture, illustration, etc.
Object name is 41586_2020_2876_Fig5_ESM.jpg

To enable the inclusion of species from across the eutherian tree (including from the 50% of mammalian families not represented in existing genome databases), the Zoonomia Project needed sequencing and assembly methods that produce reliable data from DNA collected in remote locations, sometimes in only modest quantities and often without benefit of cold chains for transport. a , For the marine species such as the narwhal ( Monodon monoceros ), simply accessing an individual in the wild can prove challenging. For example, to sample DNA from the near-threatened narwhal, M.N. and Inuit guide D. Angnatsiak camped on the edge of an ice floe between Pond Inlet and Bylot Island, at the northeastern tip of Baffin Island. After a narwhal was collected by Inuit hunters as part of an annual hunt, hours of flensing were necessary for the collection of tissue samples. From left to right, F. McCann, H. C. Schmidt, F. Eichmiller, M.N., J. Orr (facing backward) and J. Orr (standing). b , For endangered species such as the Hispaniolan solenodon ( S. paradoxus ), sample collection must be designed to minimize stress to the individual, limiting the amount of DNA that can be collected 22 . To collect DNA from the endangered solenodon without imposing stress on individuals in the wild, N.R.C. turned to the world’s only captive solenodons, which are housed off-exhibit at ZOODOM in the Dominican Republic. With help from veterinarians at the zoo, N.R.C. collected a small amount of blood from the rugged tail of the solenodon. Narwhal photograph by G. Freund, and courtesy of M.N. Solenodon photograph courtesy of L. Emery.

Extended Data Table 1

The Zoonomia Project data includes 132 genome assemblies

An external file that holds a picture, illustration, etc.
Object name is 41586_2020_2876_Tab1_ESM.jpg

These assemblies include 131 different species, with 2 narwhals (male and female), and 10 genomes upgraded to longer contiguity (including upgrade of an existing assembly for E. telfairi ).

Species of concern on the IUCN Red List are indicated as near-threatened (NT), vulnerable (V), endangered (EN) or critically endangered (CR).

*Upgraded to longer contiguity.

†Upgraded to longer contiguity using existing assembly.

For nine DISCOVAR assemblies and one pre-existing assembly (the lesser hedgehog tenrec ( Echinops telfairi )), we increased contiguity 200-fold (the median scaffold length increased from 90.5 kb to 18.5 megabases (Mb)) through proximity ligation, which uses chromatin interaction data to capture the physical relationships among genomic regions 12 . Unlike short-contiguity genomes, these assemblies capture structural changes such as chromosomal rearrangements 13 . The upgraded assemblies increase the number of eutherian orders that are represented by a long-range assembly (contig N50 > 20 kb and scaffold N50 > 10 Mb) from 12 to 18 (out of 19). We are working on upgrading the assembly of the large treeshrew ( Tupaia tana ) for the remaining order (Scandentia).

Comparative power of 240 species

The Zoonomia alignment includes 120 newly generated assemblies and 121 existing assemblies, representing a total of 240 species (the dataset includes assemblies for two different dogs) and spanning about 110 million years of mammalian evolution (Supplementary Table 2 ). With a total evolutionary branch length of 16.6 substitutions per site, we expect only 191 positions in the human genome (0.000006%) to be identical across the aligned species owing to chance (false positives) rather than evolutionary constraint (Extended Data Table ​ Table2). 2 ). We applied this same calculation to data from The Exome Aggregation Consortium (ExAC)—who analysed exomes for 60,706 humans 14 —and estimated that 88% of positions would be expected to have no variation. This illustrates the potential for relatively small cross-species datasets to inform human genetic studies—even for diseases driven by high-penetrance coding mutations, for which ExAC data are optimally powered 15 .

Extended Data Table 2

Power to detect constraint across datasets

An external file that holds a picture, illustration, etc.
Object name is 41586_2020_2876_Tab2_ESM.jpg

The expected number of variants conserved by chance (false positives) was estimated for four genomic resources (the 29 Mammals Project 7 dataset, the human-only ExAC 14 and gnomAD v.3 65 datasets, and the Zoonomia Project dataset) by applying a Poisson model of the distribution of substitution counts in the genome. Branch length for gnomAD was estimated by dividing 526,001,545 single-nucleotide variants by 3.088 gigabases (size of the human genome). Branch length for Zoonomia was measured as the number of substitutions per site in the phyloP analysis of the Cactus alignment.

Biological insights from additional assemblies

The scope and species diversity in the Zoonomia Project supports evolutionary studies in many lineages. Previously published papers (discussed in the subsections below), and the demonstrated utility of existing comparative genomics resources 16 , 17 , illustrate the benefits of making newly generated genome assemblies and alignments accessible to all researchers without restrictions on use.

Comparing our assembly for the endangered Mexican howler monkey ( Alouatta palliata mexicana , a subspecies of the mantled howler monkey) with the Guatemalan black howler monkey ( Alouatta pigra )—which has a neighbouring range—suggests that different forms of selection shape the reproductive isolation of the two species 18 . Initial divergence in allopatry was followed by positive selection on postzygotic isolating mechanisms, which offers empirical support for a speciation process that was first outlined by Dobzhansky in 1935 19 .

Protection from cancer

Using our assembly for the capybara ( Hydrochoerus hydrochaeris ) (a giant rodent), a previous publication 20 has identified positive selection on anti-cancer pathways, echoing previous reports 21 that other large mammal species—the African and Asian elephants ( Loxodonta africana and Elephas maximus indicus , respectively) —carry extra copies (retrogenes) of the tumour-suppressor gene TP53 . This offers a possible resolution to Peto’s paradox—the observation that cancer in large mammals is rarer than expected—and could reveal anti-cancer mechanisms.

Convergent evolution of venom

A previous publication 22 has used our assembly for the Hispaniolan solenodon ( Solenodon paradoxus ) (Extended Data Fig. ​ Fig.2) 2 ) to investigate venom production—a trait that is found in only a few eutherian lineages, including shrews and solenodons. They identified paralogous copies of a kallikrein 1 serine protease gene ( KLK1 ) that together encode solenodon venom, and showed that the KLK1 gene was independently co-opted for venom production in both solenodons and shrews, in an example of molecular convergence.

Informing biodiversity conservation strategies

A previous analysis 23 of our giant otter ( Pteronura brasiliensis ) assembly found low diversity and an elevated burden of putatively deleterious genetic variants, consistent with the recent population decline of this species through overhunting and habitat loss. The giant otter had fewer putatively deleterious variants than either the southern or northern sea otter ( Enhydra lutris nereis and E. lutris kenyoni , respectively), which suggests that it has highest potential for recovery among these species if populations are protected.

Rapid assessment of species infection risk

Using the Zoonomia alignment and public genomic data from hundreds of other vertebrates, a previous publication 24 compared the structure of ACE2—the receptor for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative agent of coronavirus disease 2019 (COVID-19)—and identified 47 mammals that have a high or very high likelihood of being virus reservoirs, intermediate hosts or good model organisms for the study of COVID-19, and detected positive selection in the ACE2 receptor-binding domain that is specific to bats.

Genetic diversity and extinction risk

We next asked whether a reference genome from a single individual can help to identify populations with low genetic diversity to prioritize in efforts to conserve biodiversity. Diversity metrics reflect demographic history 25 , 26 , and heterozygosity is lower in threatened species 27 . This analysis was feasible because we used a single sequencing and assembly protocol for all DISCOVAR assemblies, which minimized variation in accuracy, completeness and contiguity due to the sequencing technology and the assembly process that would otherwise confound species comparisons.

We estimated genetic diversity for 130 of our DISCOVAR assemblies, each of which represented a different species (Supplementary Table 3 ). Four of these estimates failed during analysis. For the remaining 126 DISCOVAR assemblies, we calculated 2 metrics: (1) the fraction of sites at which the sequenced individual is heterozygous (overall heterozygosity); and (2) the proportion of the genome that resides in an extended region without any variation (segments of homozygosity (SoH)). The SoH measurement is designed for short-contiguity assemblies, in which scaffolds are potentially shorter than runs of homozygosity. Overall, heterozygosity and SoH values are correlated (Pearson correlation r  = −0.56, P  = 1.8 × 10 −9 , n  = 98). Although overall heterozygosity is correlated with contig N50 values (Pearson correlation r het  = −0.39, P het  = 4 × 10 −5 , n het  = 105) (probably owing to the difficulty of assembling more heterozygous genomes 28 ), SoH values are not (Pearson correlation r SoH  = 0.09, P SoH  = 0.38, n SoH  = 98). Overall heterozygosity and SoH values are highly correlated between the lower- and high-contiguity versions of the upgraded assemblies (Pearson correlation r het  = 0.999, P het  = 5 × 10 −7 , n het  = 7; r SoH  = 0.996, P SoH  = 1.4 × 10 −6 , n SoH  = 7).

Genomic diversity varies significantly among species in different IUCN conservation categories, as measured by overall heterozygosity (Fig. ​ (Fig.2a) 2a ) and SoH values (Fig. ​ (Fig.2b). 2b ). SoH values increase ( P  = 0.024, R 2  = 0.055, n  = 94) with increasing levels of conservation concern, whereas heterozygosity decreases ( P  = 0.011, R 2  = 0.064, n  = 101). There is no significant difference between wild and captive populations in overall heterozygosity (Fig. ​ (Fig.2c) 2c ) or SoH values (Fig. ​ (Fig.2d 2d ).

An external file that holds a picture, illustration, etc.
Object name is 41586_2020_2876_Fig2_HTML.jpg

a , b , Heterozygosity declines ( a ) and SoH value increases ( b ) with the level of concern for species conservation, as assessed by IUCN conservation categories. Horizontal grey lines indicate median. c , d , Comparing individuals sampled from wild and captive populations, we saw no statistically significant difference (independent samples t -test) in overall heterozygosity ( c ) or per cent SoH ( d ), with similar means (horizontal grey lines) between types of birth population. In a – d , there was a total of 105 species, with n for each tested category indicated on the x axis. Statistical tests were two-sided. LC, least concern. e , Overall heterozygosity and SoH values for all genomes analysed (including those with high allelic balance ratio; n  = 124 species), with median SoH (17.1%, horizontal dashed line) and median overall heterozygosity (0.0026, vertical dashed line) for species categorized as least concern. Values for individuals from the seven critically endangered species are shown in red.

Source data

Unusual diversity values can suggest particular population demographics, although data from more than a single individual are needed to confirm these inferences. All seven critically endangered species have SoH values that are higher than the median for species categorized as of least concern (Fig. ​ (Fig.2e). 2e ). The genomes with the lowest heterozygosity and highest SoH values were the social tuco-tuco (heterozygosity = 0.00063 and SoH = 78.7%), which was sampled from a small laboratory colony with only 12 founders 29 , and the eastern mole ( Scalopus aquaticus ) (heterozygosity = 0.0008 and SoH = 81.3%), which was supplied by a professional mole catcher and was probably from a population that had experienced a bottleneck owing to pest control measures.

The correlation between diversity metrics and IUCN category is not explained by other species-level phenotypes. For species of least concern ( n  = 75), we assessed 21 phenotypes that are catalogued in the PanTHERIA 30 database for correlation with heterozygosity or SoH values. The most significant was between SoH value and litter size, a trait that has previously been shown to predict extinction risk 31 ( P SoH  = 0.02), but none is significant after Bonferroni correction (Extended Data Table ​ Table3 3 ).

Extended Data Table 3

Diversity statistics are not correlated with other species-level phenotypes

An external file that holds a picture, illustration, etc.
Object name is 41586_2020_2876_Tab3_ESM.jpg

All phenotypes in the PanTHERIA database 30 for which at least 75% of the 75 species of least concern had a value were included in the analysis. For continuous phenotypes, values were standardized to Z -scores before analysis (latitude was calculated as an absolute value) and correlation measured by fitting a linear model using the core R function lm. For categorical phenotypes with more than two categories, group means were compared using the core R function aov to fit an analysis of variance model. None was significant after Bonferroni correction for the number of traits considered (21).

Our inference that diversity trends lower in species at a higher risk of extinction comes from a small fraction (2.6%) of threatened mammals 9 . Whether this is a direct correlation with extinction risk or arises from an association between diversity and species-level phenotypes such as litter size, it suggests that valuable information can be gleaned from sequencing only a single individual. Should this pattern prove robust across more species, diversity metrics from a single reference genome could help to identify populations that are at risk—even when few species-level phenotypes are documented—and to prioritize species for follow-up at the population level.

Resources for biodiversity conservation

For each genome assembly, we catalogued all high-confidence variant sites ( http://broad.io/variants ) to support the design of cost-effective and accurate genetic assays that are usable even when the sample quality is low 32 ; such assays are often preferable to designing expensive custom tools, relying on tools from related species or sequencing random regions 33 . The reference genomes themselves support the development of technologies such as using gene drives to control invasive species or pursuing ‘de-extinction’ through cloning and genetic engineering 34 .

Our genomes have two notable limitations. We sequenced only a single individual for each species, which is insufficient for studying population origins, population structure and recent demographic events 35 , 36 , and the shorter contiguity of our assemblies prevented us from analysing runs of homozygosity 26 . This highlights a dilemma that faces all large-scale genomics initiatives: determining when the value of sequencing additional individuals exceeds the value of improving the reference genome itself.

Whole-genome alignment

We aligned the genomes of 240 species (our assemblies and other mammalian genomes that were released when we started the alignment) as part of a 600-way pan-amniote alignment using the Cactus alignment software 37 (Supplementary Table 2 ). Rather than aligning to a single anchor genome, Cactus infers an ancestral genome for each pair of assemblies (Fig. ​ (Fig.3a). 3a ). Consistent with our predictions, we have increased power to detect sequence constraint at individual bases relative to previous studies 7 , 38 . We detect 3.1% of bases in the human genome to be under purifying selection in the eutherian lineage (false-discovery rate (FDR) < 5%), without using windowing or other means to integrate contextual information across neighbouring bases. This is more than double the number from the largest previous 100-vertebrate alignment 38 (Fig. ​ (Fig.3b), 3b ), with improvements being most notable in the non-coding sequence (Fig. ​ (Fig.3c) 3c ) and in the increased resolution of individual features (Fig. ​ (Fig.3d). 3d ). This represents a substantial proportion—but not all—of the 5 to 8% of the human genome that has previously been suggested to be under purifying selection 7 , 39 .

An external file that holds a picture, illustration, etc.
Object name is 41586_2020_2876_Fig3_HTML.jpg

a , Cactus alignments are reference-genome-free, enabling the detection of sequence that is absent from human (red) or other clades (purple), lineage-specific innovations (orange and green) and eutherian-mammal-specific sequence (blue). b , We compared phyloP predictions of conserved positions for a widely used 100-vertebrate alignment ( n  = 100 vertebrate species) (grey) to the Zoonomia alignment ( n  = 240 eutherian species) (red). The cumulative portion of the genome expected to be covered by true- versus false-positive calls is shown, starting from the highest confidence calls (solid line) and proceeding to calls with lower confidence (dashed lines); the horizontal line indicates the point at which the confidence level drops below an expected FDR of 0.05 (two-sided). c , A higher proportion of functionally annotated bases are detected as highly conserved (FDR < 0.05) in the Zoonomia alignment (red) than the 100-vertebrate alignment (grey), most notably in non-coding regions. lncRNA, long non-coding RNA; UTR, untranslated region. d , At a putative androgen-receptor binding site, phyloP scores predict that seven bases are under purifying selection (FDR < 0.05, two-sided) in the Zoonomia alignment (dark red), peaking in positions with the most information content in the androgen receptor JASPAR 49 motif, compared to one (dark grey) in the 100-vertebrate alignment, with scores at FDR > 0.05 shown in light red (top) and light grey (bottom).

Using our alignment of 240 mammalian genomes, we are pursuing four key strategies of analysis. First, we aim to provide the largest eutherian phylogeny based on nuclear genomes by building a comprehensive phylogeny and time tree, including trees partitioned by functional annotations, mode of inheritance and long-term recombination rates. Second, we will produce a detailed map of evolutionary constraint, identifying highly conserved genomic regions, regions under accelerated evolution in particular lineages and changes that probably affect phenotype, leveraging functional data from ENCODE 40 , GTEx 41 and the Human Cell Atlas 42 . Third, we will use genotype–phenotype correlations to investigate patterns of constraint in regions associated with disease in humans, identify patterns of convergent adaptive evolution 2 and apply a forward genomics strategy to link functional elements to traits. Finally, we will explore the evolution of genome structure by mapping syntenic regions between genomes, identifying evolutionary breakpoints and characterizing the repeat landscape.

The Zoonomia Project has captured mammalian diversity at a high resolution, and is among the first of many projects that are underway to sequence, catalogue and characterize whole branches of the eukaryotic biodiversity of the Earth. On the basis of our experience, we propose the following principles for realizing the full value of large-scale comparative genomics.

First, we should prioritize sample collection. We must support field researchers who collect samples and understand species ecology and behaviour, develop strategies for sample collection that do not rely on bulky laboratory equipment or cold chains, develop technology for using non-invasive types of sampling and establish more repositories of renewable cell cultures 10 .

Second, we need accessible and scalable tools for computational analysis. Few research groups have access to the computational resources necessary for work with massive genomic datasets. We must address the shortage of skilled computational scientists, and design software and data-storage systems that make powerful computational pipelines accessible to all researchers.

Finally, we should promote rapid data-sharing. Data embargoes must not be permitted to delay analyses that directly benefit the conservation of endangered species, human health or progress in basic science. Genomic data should be shared as quickly as possible and without restrictions on use.

Numerous large-scale genome-sequencing efforts are now underway, including the Earth BioGenome Project 43 , Genome 10K 44 , the Vertebrate Genomes Project, Bat 1K 45 , Bird 10K 46 and DNA Zoo. As the number of genomes grows, so too will the usefulness of comparative genomics in disease research and the development of therapeutic strategies. Preserving, rather than merely recording, the biodiversity of the Earth must be a priority. Through global scientific collaborations, and by making genomic resources available and accessible to all research communities, we can ensure that the legacy of genomics is not a digital archive of lost species.

The number of samples (species) required to detect evolutionary conservation at a single base was estimated by applying a Poisson model of the distribution of substitution counts in the genome.

Species selection, sample shipping and regulatory approvals

Species were selected to maximize branch length across the eutherian mammal phylogeny, and to capture genomes of species from previously unrepresented eutherian families. Of 172 species initially selected for inclusion, we obtained sufficiently high-quality DNA samples for genome sequencing for 137. DNA samples from collaborating institutions were shipped to the Broad Institute ( n  = 69) or Uppsala University ( n  = 68). For samples received at the Broad Institute that were then sent to Uppsala, shipping approval was secured from the US Fish and Wildlife Service. Institutional Animal Care and Use Committee approval was not required.

Sample quality control, library construction and sequencing

DNA integrity for each sample was visualized via agarose gel (at the Broad Institute) or Agilent tape station (at Uppsala University). Samples passed quality control if the bulk of DNA fragments were greater than 5 kb. DNA concentration was then determined using Invitrogen Qubit dsDNA HS assay kit. For each of the samples that passed quality control, 1–3 μg of DNA was fragmented on the Covaris E220 Instrument using the 400-bp standard programme (10% duty cycle, 140 PIP, 200 cycles per burst, 55 s). Fragmented samples underwent SPRI double-size selection (0.55×, 0.7 ×  f ) followed by PCR-free Illumina library construction following the manufacturer’s instructions (Kapa no. KK8232) using PCR-free adapters from Illumina (no. FC-121-3001). Final library fragment size distribution was determined on Agilent 2100 Bioanalyzer with High Sensitivity DNA Chips. Paired-end libraries were pooled, and then sequenced on a single lane of the Illumina HiSeq2500, set for Version 2 chemistry and 2×250-bp reads. This yielded a total of mean 375 million (s.d. = 125 million) reads per sample.

Assembly and validation

For each species, we applied DISCOVAR de novo 11 (discovardenovo-52488) (ftp://ftp.broadinstitute.org/pub/crd/DiscovarDeNovo/) to assemble the 2×250-bp read group, using the following command: DiscovarDeNovo READS = [READFILE] OUT_DIR = [SPECIES_ID]//[SPECIES_ID].discovar_files NUM_THREADS = 24 MAX_MEM_GB = 200G.

Coverage for each genome was automatically calculated by DISCOVAR, with a mean coverage of 40.1× (s.d.± 14×). We assessed genome assembly, gene set and transcriptome completeness using Benchmarking Universal Single-Copy Orthologs (BUSCO), which provides quantitative measures on the basis of gene content from near-universal single-copy orthologues 50 . BUSCO was run with default parameters, using the mammalian gene model set (mammalia_odb9, n  = 4,104), using the following command: python ./BUSCO.py -i [input fasta] -o [output_file] -l ./mammalia_odb9/ -m genome -c 1 -sp. human.

Median contig N50 for existing RefSeq assemblies was calculated using the assembly statistics for the most recent release of 118 eutherian mammals with RefSeq assembly accession numbers. Assemblies were all classified as either reference genome or representative genome. Assembly statistics were downloaded from the NCBI on 10 April 2019.

Genome upgrades

We selected genomes from each eutherian order without a pre-existing long-contiguity assembly on the basis of (1) whether the underlying assembly met the minimum quality threshold needed for HiRise upgrades; and (2) whether a second sample of sufficient quality could be obtained from that individual. All upgrades were done with Dovetail Chicago libraries and assembled with HiRise 2.1, as previously described 51 .

Estimating heterozygosity

Selection of assemblies for heterozygosity analysis.

Heterozygosity statistics were calculated for all but four of our short read assemblies ( n  = 126) as well as eight Dovetail-upgraded genomes. Four failed because they were either too fragmented to analyse ( n  = 3) or because of undetermined errors ( n  = 1). One assembly was excluded because it was a second individual from a species that was already represented.

Heterozygosity calls

We applied the standard GATK pipeline with genotype quality banding to identify the callable fraction of the genome 52 , 53 . First, we used samtools to subsample paired reads from the unmapped .bam files 54 . After removing adaptor sequences from the selected reads, we used BWA-MEM to map reads to the reference genome scaffolds of >10 kb, removing duplicates using the PicardTools MarkDuplicates utility 55 . We then called heterozygous sites using standard GATK-Haplotypecaller specifications, and with additional gVCF banding at 0, 10, 20, 30, 40, 50 and 99 qualities. We used the fraction of the genome with genotype quality >15 for subsequent analyses. For the lists of high-confidence variant sites, we include only heterozygous positions after filtering at GQ >20, maximum DP <100, minimum DP >6, as described in the README file at http://broad.io/variants .

Inferring overall heterozygosity

To avoid confounding by sex chromosomes or complex regions, we excluded all scaffolds with less than 0.5 or greater than 2× of the average sample read depth, then calculated global heterozygosity as the fraction of heterozygous calls over the whole callable genome.

Calling SoH

We estimated the proportion of the genome within SoH using a metric designed for genomes with scaffold N50 shorter than the expected maximum length of runs of homozygosity (our median scaffold N50 is 62 kb). We first split all scaffolds into windows with a maximum length of 50 kb, with windows ranging from 20 to 50 kb for scaffolds <50 kb. For each window, we calculated the average number of heterozygous sites per bp. We discriminated windows with extremely low heterozygosity by using the Python 3.5.2 pomegranate package to fit a two-component Gaussian mixture model to the joint distribution of window heterozygosity, forcing the first component to be centred around the lower tail of the distribution and allowing the second to freely capture all the remaining heterozygosity variability 56 , 57 . As heterozygosity cannot be negative, and normal distributions near zero can cross into negative values, we used the normal cumulative distribution function to correct the posterior distribution by the negative excess—effectively fitting a truncated normal to the first component. The final SoH value was calculated using the posterior maximum likelihood classification between both components. We saw no significant correlation between contig N50 and SoH (Pearson correlation = 0.055, P  = 0.57, n  = 112).

Assessing the effect of the percentage of callable genome

We assessed whether the percentage of the genome that was callable (Supplementary Table 3 ) was likely to affect our analysis. The callable percentage was correlated with heterozygosity ( r  = −0.80, P  < 2.2 × 10 −16 , n  = 130), and weakly with SoH values ( r  = 0.18, P  = 0.06, n  = 112). There is no significant difference in callable percentage among IUCN categories (analysis of variance P  = 0.98, n  = 122) or between captive and wild populations ( t -test P  = 0.81, n  = 120).

Analysing patterns of diversity

We excluded two genomes with exceptionally high heterozygosity (heterozygosity >0.02; >5 s.d. above the mean). Both were of non-endangered species, and thus removing them made our determination of lower heterozygosity in endangered species more conservative. Of the remaining 124 genomes, we excluded 19 with allelic balance values that were more than one s.d. above the mean (>0.36). Abnormally high allelic balance can indicate sequencing biases with potential for artefacts in estimates of heterozygosity and/or SoH. Our final dataset contains heterozygosity values for 105 genomes and SoH values for 98 genomes (Supplementary Table 3 ). For seven genomes, we were unable to estimate SoH because the two components of the Gaussian mixture model overlapped completely. To ask about a possible directional relationship between level of IUCN concern and overall heterozygosity or SoH, we applied regression using the IUCN category as an ordinal predictor. We also asked about the relationship of diversity metrics to a set of species-level phenotypes for which correlations were previously reported (Extended Data Table ​ Table3 3 ).

The alignment was generated using the progressive mode of Cactus 37 , 58 . The topology used for the guide tree of the alignment was taken from TimeTree 47 ; the branch lengths of the guide tree were generated by a least-squares fit from a distance matrix. The distance matrix was based on the UCSC 100-way phyloP fourfold-degenerate site tree 38 for those species that had corresponding entries in the 100-way tree. For species not present in the 100-way tree, distance matrix entries were more coarsely estimated using the distance estimated from Mash 59 to the closest relative included in the 100-way data.

Cactus does not attempt to fully resolve the gene tree when multiple duplications take place along a single branch, as there is an implicit restriction in Cactus that a duplication event be represented as multiple regions in the child species aligned to a single region in the parent species. This precludes representing discordance between the gene tree and species tree that could occur with incomplete lineage-sorting or horizontal transfer. However, the guide tree has a minimal effect on the alignment, with little difference between alignments with different trees—even when using a tree that is purposely wrong 37 . Phenomena such as incomplete lineage sorting that affect a subset of species are unlikely to substantially affect the detection of purifying selection across the whole eutherian lineage described in Fig. ​ Fig.3 3 .

The alignment was generated in several steps, on account of its large scale. First, a backbone alignment of several long contiguity assemblies was generated, using the genomes of two non-placental mammals (Tasmanian devil ( Sarcophilus harrisii ) and platypus ( Ornithorhynchus anatinus )), to inform the reconstruction of the placental root. Next, separate clade alignments were generated for each major clade in the alignment: Euarchonta, Glires, Laurasiatheria, Afrotheria and Xenarthra. The roots of these clade alignments were then aligned to the corresponding ancestral genomes from the backbone, stitching these alignments together to create the final alignment. The process of aligning a genome to an existing ancestor is complex and further described in an accompanying Article that introduces the progressive mode of Cactus 37 .

We created a neutral model for the conservation analysis using ancestral repeats detected by RepeatMasker 60 on the eutherian ancestral genome produced in the Cactus alignment (tRNA and low-complexity repeats were removed). To fit the neutral model, we used phyloFit from the PHAST 61 package, using the REV (generalized reversible) model and EM optimization method. The training input was a MAF exported on columns from the set of ancestral repeats mentioned above. Because phyloFit does not support alignment columns that contain duplicates, if a genome had more than one sequence in a single alignment block, these were replaced with a single entry representing the consensus base at each column.

We extracted initial conservation scores using phyloP from the PHAST 61 package on a MAF exported using human as a reference. We converted the phyloP scores (which represent log-scaled P  values of acceleration or conservation) into P  values, then into q  values using the FDR-correction of Benjamini and Hochberg 62 . Any column with a resulting q  value less than 0.05 was deemed significantly conserved or accelerated.

The alignment—as well as conservation annotations—are available at https://cglgenomics.ucsc.edu/data/cactus/ .

Reporting summary

Further information on research design is available in the  Nature Research Reporting Summary linked to this paper.

Online content

Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-020-2876-6.

Supplementary information

This file contains Supplementary Tables 1-3.

Acknowledgements

We thank the many individuals who provided samples and advice, including C. Adenyo, C. Avila, E. Baitchman, R. Behringer, A. Boyko, M. Breen, K. Campbell, P. Campbell, C. J. Conroy, K. Cooper, L. M. Dávalos, F. Delsuc, D. Distel, C. A. Emerling, J. Fronczek, N. Gemmel, J. Good, K. He, K. Helgen, A. Hindle, H. Hoekstra, R. Honeycutt, P. Hulva, W. Israelsen, B. Kayang, R. Kennerley, M. Korody, D. N. Lee, E. Louis, M. MacManes, A. Misuraca, A. Mitelberg, P. Morin, A. Mouton, M. Murayama, M. Nachman, A. Navarro, R. Ogden, B. Pasch, S. Puechmaille, T. J. Robinson, S. Rossiter, M. Ruedi, A. Seifert, S. Thomas, S. Turvey, G. Verbeylen and the late R. J. Baker. We also thank the Broad Institute Genomics Platform and SNP & SEQ Technology Platform (part of the National Genomics Infrastructure (NGI) Sweden and Science for Life Laboratory) and Swedish National Infrastructure for Computing (SNIC) at Uppmax. This project was funded by NIH NHGRI R01HG008742 (E.K.K., B.B., D.P.G., R.S., J.T.-M., J.J., H.J.N., B.P. and J. Armstrong), Swedish Research Council Distinguished Professor Award (K.L.-T., V.D.M., E.M. and J.R.S.M.), Swedish Research Council grant 2018-05973 (K.L.-T.), Knut and Alice Wallenberg Foundation (K.L.-T., V.D.M., E.M. and J.R.S.M.), Uppsala University (K.L.-T., V.D.M., E.M., J.R.S.M., J.J., J. Alfoldi and L.G.), Broad Institute Next10 (L.G.), Gladstone Institutes (K.S.P.), NIH NHGRI 5R01HG002939 (A.F.A.S. and R.H.), NIH NHGRI 5U24HG010136 (A.F.A.S. and R.H.), NIH NHGRI 5R01HG010485 (B.P. and M.D.), NIH NHGRI 2U41HG007234 (B.P., M.D. and J. Armstrong), NIH NIA 5PO1AG047200 (V.N.G.), NIH NIA 1UH2AG064706 (V.N.G.), BFU2017-86471-P MINECO/FEDER, UE (T.M.-B.), Secretaria d’Universitats i Recerca and CERCA Programme del Departament d’Economia i Coneixement de la Generalitat de Catalunya GRC 2017 SGR 880 (T.M.-B.), Howard Hughes International Early Career (T.M.-B.), European Research Council Horizon 2020 no. 864203 (T.M.-B.), Obra Social ‘La Caixa’ (T.M.-B.), BBSRC BBS/E/T/000PR9818, BBS/E/T/ 000PR9783 (W.H. and W.N.), BBSRC Core Strategic Programme Grant BB/P016774/1 (W.H., W.N. and F.D.), Sir Henry Dale Fellowship 200517/Z/16/Z jointly funded by the Wellcome Trust and the Royal Society (N.R.C.), FJCI-2016-29558 MICINN (D.J.), Prince Albert II Foundation of Monaco and Canada, Global Genome Initiative, Smithsonian Institution (M.N.), European Research Council Research Grant ERC-2012-StG311000 (E.C.T.), Irish Research Council Laureate Award (E.C.T.), UK Medical Research Council MR/P026028/1 (W.H. and W.N.), National Science Foundation DEB-1457735 (M.S.S.), National Science Foundation DEB-1753760 (W.J.M.), National Science Foundation IOS-2029774 (E.K.K. and D.P.G.), Robert and Rosabel Osborne Endowment (H.A.L. and J.D.), Swedish Research Council, FORMAS 221-2012-1531 (J.R.S.M.), NSF RoL: FELS: EAGER: DEB 1838283 (D.A.R.) and Academy of Finland grant to Center of Excellence in Tumor Genetics Research no. 312042 (T.K. and J.T.).

Extended data figures and tables

Author contributions.

K.L.-T. conceived the project. J.J., V.D.M., E.M., N.R.C., L.G.C., J.D., V.N.G., M.L.H., K.-P.K., J.R.S.M., W.J.M., M.N., D.A.R., R.S., E.C.T., J. Alfoldi, O.A.R., H.A.L., K.L.-T. and E.K.K. contributed to the acquisition of the samples. J.J., V.D.M., E.M., J.D., L.G., K.-P.K., H.J.N., C.C.S., R.S., J.T.-M., J. Alfoldi, O.A.R., H.A.L., K.L.-T. and E.K.K. contributed to the production of the genome assemblies. D.P.G., A.S., J. Armstrong, J.J., D.J., I.T.F., L.F.K.K., H.A.L., T.M.-B., K.L.-T. and E.K.K. contributed to the data analysis. D.P.G., J.J., V.D.M., G.B., F.D.P., M.D., I.T.F., M.G., V.N.G., W.H., R.H., T.K., E.S.L., J.R.S.M., A.R.P., K.S.P., A.F.A.S., M.S.S., J.T., J. Alfoldi, B.B., O.A.R., H.A.L., B.P., T.M.-B., K.L.-T. and E.K.K. contributed to the design and conduct of the project. D.P.G., E.S.L., W.N., B.S., O.A.R., K.L.-T. and E.K.K. wrote the manuscript, with input from all other authors.

Data availability

Code availability, competing interests.

L.G. is a co-founder of, equity owner in and chief technical officer at Fauna Bio Incorporated.

Peer review information Nature thanks Chris Ponting, Steven Salzberg, Guojie Zhang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A list of authors and their affiliations appears at the end of the paper

These authors contributed equally: Kerstin Lindblad-Toh, Elinor K. Karlsson

Contributor Information

Diane p. genereux, aitor serres.

2 Institute of Evolutionary Biology (UPF-CSIC), PRBB, Barcelona, Spain

Joel Armstrong

3 UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA USA

Jeremy Johnson

Voichita d. marinescu.

4 Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden

Eva Murén

Gill bejerano.

5 Department of Biomedical Data Science, Stanford University, Stanford, CA USA

6 Department of Computer Science, Stanford University, Stanford, CA USA

7 Department of Developmental Biology, Stanford University, Stanford, CA USA

8 Department of Pediatrics, Stanford University, Stanford, CA USA

Nicholas R. Casewell

9 Centre for Snakebite Research and Interventions, Liverpool School of Tropical Medicine, Liverpool, UK

Leona G. Chemnick

10 San Diego Zoo Global, Beckman Center for Conservation Research, San Diego, CA USA

Joana Damas

11 The UC Davis Genome Center, University of California, Davis, Davis, CA USA

Federica Di Palma

12 Department of Biological Sciences, University of East Anglia, Norwich, UK

13 Earlham Institute, Norwich, UK

Mark Diekhans

Ian t. fiddes, manuel garber.

14 Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA USA

Vadim N. Gladyshev

15 Brigham and Women’s Hospital, Harvard Medical School, Boston, MA USA

Linda Goodman

16 Fauna Bio Incorporated, Emeryville, CA USA

Wilfried Haerty

Marlys l. houck, robert hubley.

17 Institute for Systems Biology, Seattle, WA USA

Teemu Kivioja

18 Department of Biochemistry, University of Cambridge, Cambridge, UK

19 Applied Tumor Genomics Research Program, University of Helsinki, Helsinki, Finland

Klaus-Peter Koepfli

20 Smithsonian-Mason School of Conservation, Front Royal, VA USA

Lukas F. K. Kuderna

Eric s. lander.

21 Department of Biology, MIT, Cambridge, MA USA

22 Department of Systems Biology, Harvard Medical School, Boston, MA USA

Jennifer R. S. Meadows

William j. murphy.

23 Veterinary Integrative Biosciences, Texas A&M University, College Station, TX USA

Hyun Ji Noh

Martin nweeia.

24 Marine Mammal Program, Smithsonian Institution, Washington, DC USA

25 Restorative Dentistry and Biomaterials Sciences, Harvard School of Dental Medicine, Boston, MA USA

26 School of Dental Medicine, Case Western Reserve University, Cleveland, OH USA

Andreas R. Pfenning

27 Carnegie Mellon University, School of Computer Science, Department of Computational Biology, Pittsburgh, PA USA

Katherine S. Pollard

28 Chan Zuckerberg Biohub, San Francisco, CA USA

29 Gladstone Institutes, San Francisco, CA USA

30 Department of Epidemiology and Biostatistics, Institute for Computational Health Sciences and Institute for Human Genetics, University of California, San Francisco, San Francisco, CA USA

David A. Ray

31 Department of Biological Sciences, Texas Tech University, Lubbock, TX USA

Beth Shapiro

32 Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA USA

33 Howard Hughes Medical Institute, University of California Santa Cruz, Santa Cruz, CA USA

Arian F. A. Smit

Mark s. springer.

34 Department of Evolution, Ecology and Organismal Biology, University of California, Riverside, Riverside, CA USA

Cynthia C. Steiner

Ross swofford, jussi taipale.

35 Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden

Emma C. Teeling

36 School of Biology and Environmental Science, University College Dublin, Dublin, Ireland

Jason Turner-Maier

Jessica alfoldi, bruce birren, oliver a. ryder.

37 Department of Evolution, Behavior, and Ecology, Division of Biology, University of California, San Diego, La Jolla, CA USA

Harris A. Lewin

38 Department of Evolution and Ecology, University of California, Davis, Davis, CA USA

Benedict Paten

Tomas marques-bonet.

39 Catalan Institution of Research and Advanced Studies (ICREA), Barcelona, Spain

40 Institut Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Barcelona, Barcelona, Spain

41 CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona, Spain

Kerstin Lindblad-Toh

Elinor k. karlsson.

42 Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA USA

Elinor K. Karlsson, Email: gro.etutitsnidaorb@ronile .

Extended data

is available for this paper at 10.1038/s41586-020-2876-6.

Monash University

Restricted Access

Reason: Access restricted by the author. A copy can be requested for private research and study by contacting your institution's library service. This copy cannot be republished

Computational algorithms for comparative genomics

Principal supervisor, year of award, department, school or centre, additional institution or organisation, campus location, degree type, usage metrics.

Faculty of Medicine, Nursing and Health Sciences Theses

Comparative Genomics

Digging for Data

Cite this protocol

Book cover

  • Matthew B. Avison 2  

Part of the book series: Methods in Molecular Biology™ ((MIMB,volume 266))

1715 Accesses

1 Citations

Comparative genomics is a science in its infancy. It has been driven by a huge increase in freely available genome-sequence data, and the development of computer techniques to allow whole-genome sequence analyses. Other approaches, which use hybridization as a method for comparing the gene content of related organisms, are rising alongside these more bioinformatic methods. All these approaches have been pioneered using bacterial genomes because of their simplicity and the large number of complete genome sequences available. The aim of bacterial comparative genomics is to determine what genotypic differences are important for the expression of particular traits (e.g., antibiotic resistance, virulence, or host preference). The benefits of such studies will be a deeper understanding of these phenomena; the possibility of exposing novel drug targets, including those for antivirulence drugs; and the development of molecular techniques that reveal patients who are infected with virulent organisms so that health care resources can be allocated appropriately. With more and more genome sequences becoming available, the rise of comparative genomics continues apace.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Blattner, F. R., Plunkett, G., 3rd, Bloch, C. A., Perna, N. T., Burland, V., Riley, M., et al. (1997) The complete genome sequence of Escherichia coli K-12. Science 277 , 1453–1474.

Article   PubMed   CAS   Google Scholar  

Perna, N. T., Plunkett, G., 3rd, Burland, V., Mau, B., Glasner, J. D., Rose, D. J., et al. (2001) Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409 , 529–533.

Reid, S. D., Herbelin, C. J., Bumbaugh, A. C., Selander, R. K., and Whittam, T. S. (2000) Parallel evolution of virulence in pathogenic Escherichia coli . Nature 406 , 64–67.

Stover, C. K., Pham, X. Q., Erwin, A. L., Mizoguchi, S. D., Warrener, P., Hickey, M. J., et al. (2000) Complete genome sequence of Pseudomonas aeruginosa PA01, an opportunistic pathogen. Nature 406 , 959–964.

Clarke, G. D., Beiko, R. G., Ragan, M. A., and Charlebois, R. L. (2002) Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J. Bacteriol. 184 , 2072–2080.

Edwards, R. A., Olsen, G. J., and Maloy, S. R. (2002) Comparative genomics of closely related salmonellae. Trends Microbiol. 10 , 94–99.

Parkhill, J., Dougan, G., James, K. D., Thomson, N. R., Pickard, D., Wain, J., et al. (2001) Complete genome sequence of a multiple drug resistant Salmonella enterica serovar Typhi CT18. Nature 413 , 848–852.

McClelland, M., Sanderson, K. E., Spieth, J., Clifton, S. W., Latreille, P., Courtney, L., et al. (2001) Complete genome sequence of Salmonella enterica serovar Typhimurium LT2. Nature 413 , 852–856.

Hansen-Wester, I. and Hensel, M. (2002) Genome-based identification of chromosomal regions specific for Salmonella spp. Infect. Immun. 70 , 2351–2360.

Hou, Y. M. (1999) Transfer RNAs and pathogenicity islands. Trends Biochem. Sci. 24 , 295–298.

Parkhill, J., Wren, B. W., Thomson, N. R., Titball, R. W., Holden, M. T., Prentice, M. B., et al. (2001) Genome sequence of Yersinia pestis , the causative agent of plague. Nature 413 , 523–527.

Deng, W., Burland, V., Plunkett, G., 3rd, Boutin, A., Mayhew, G. F., Liss, P., et al. (2002) Genome sequence of Yersinia pestis KIM. J. Bacteriol. 184 , 4601–4611.

Kuroda, M., Ohta, T., Uchiyama, I., Baba, T., Yuzawa, H., Kobayashi, I., et al. (2001) Whole genome sequencing of meticillin-resistant Staphylococcus aureus . Lancet 357 , 1225–1240.

Avison, M. B., Bennett, P. M., Howe, R. A., and Walsh, T. R. (2002) Preliminary analysis of the genetic basis for vancomycin resistance in Staphylococcus aureus strain Mu50. J. Antimicrob. Chemother. 49 , 255–260.

O’Neill, A. J. and Chopra, I. (2002) Insertional inactivation of mutS in Staphylococcus aureus reveals potential for elevated mutation frequencies, although the prevalence of mutators in clinical isolates is low. J. Antimicrob. Chemother. 50 , 161–169.

Article   PubMed   Google Scholar  

Baba, T., Takeuchi, F., Kuroda, M., Yuzawa, H., Aoki, K., Oguchi, A., et al. (2002) Genome and virulence determinants of high virulence community-acquired MRSA. Lancet 359 , 1819–1827.

Herron, L. L., Chakravarty, R., Dwan, C., Fitzgerald, J. R., Musser, J. M., Retzel, E., et al. (2002) Genome sequence survey identifies unique sequences and key virulence genes with unusual rates of amino acid substitution in bovine Staphylococcus aureus . Infect. Immun. 70 , 3978–3981.

Cunningham, M. W. (2000) Pathogenesis of group A streptococcal infections. Clin. Microbiol. Rev. 13 , 470–511.

Smoot, J. C., Barbian, K. D., Van Gompel, J. J., Smoot, L. M., Chaussee, M. S., Sylva, G. L., et al. (2002) Genome sequence and comparative microarray analysis of serotype M18 group A Streptococcus strains associated with acute rheumatic fever outbreaks. Proc. Natl. Acad. Sci. USA 99 , 4668–4673.

Ferretti, J. J., McShan, W. M., Ajdic, D., Savic, D. J., Savic, G., Lyon, K., et al. (2001) Complete genome sequence of an M1 strain of Streptococcus pyogenes . Proc. Natl. Acad. Sci. USA 98 , 4658–4663.

Cole, S. T., Brosch, R., Parkhill, J., Garnier, T., Churcher, C., Harris, D., et al. (1998) Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393 , 537–544.

Cole, S. T. (2002) Comparative and functional genomics of the Mycobacterium tuberculosis complex. Microbiology 148 , 2919–2928.

PubMed   CAS   Google Scholar  

Cole, S. T., Eiglmeier, K., Parkhill, J., James, K. D., Thomson, N. R., Wheeler, P. R., et al. (2001) Massive gene decay in the leprosy bacillus. Nature 409 , 1007–1011.

Janssen, P. J., Audit, B., and Ouzounis, C. A. (2001) Strain-specific genes of Helicobacter pylori : distribution, function and dynamics. Nucleic Acids Res. 29 , 4395–4404.

Garcia-Vallve, S., Janssen, P. J., and Ouzounis, C. A. (2002) Genetic variation between Helicobacter pylori strains: gene acquisition or loss? Trends Microbiol. 10 , 445–447.

Thompson, L. J. and de Reuse, H. (2002) Genomics of Helicobacter pylori . Helicobacter 7 , 1–7.

Ochman, H. and Jones, I. B. (2000) Evolutionary dynamics of full genome content in Escherichia coli . EMBO J. 19 , 6637–6643.

Richmond, C. S., Glasner, J. D., Mau, R., Jin, H., and Blattner, F. R. (1999) Genome-wide expression profiling in Escherichia coli K-12. Nucleic Acids Res. 27 , 3821–3835.

Dziejman, M., Balon, E., Boyd, D., Fraser, C. M., Heidelberg, J. F., and Mekalanos, J. J. (2002) Comparative genomic analysis of Vibrio cholerae : genes that correlate with cholera endemic and pandemic disease. Proc. Natl. Acad. Sci. USA 99 , 1556–1561.

Heidelberg, J. F., Eisen, J. A., Nelson, W. C., Clayton, R. A., Gwinn, M. L., Dodson, R. J., et al. (2000) DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae . Nature 406 , 477–483.

Behr, M. A., Wilson, M. A., Gill, W. P., Salamon, H., Schoolnik, G. K., Rane, S., et al. (1999) Comparative genomics of BCG vaccines by whole-genome DNA microarray. Science 284 , 1520–1523.

Malloff, C. A., Fernandez, R. C., and Lam, W. L. (2001) Bacterial comparative genomic hybridization: a method for directly identifying lateral gene transfer. J. Mol. Biol. 312 , 1–5.

Brown, P. K. and Curtiss, R., 3rd (1996) Unique chromosomal regions associated with virulence of an avian pathogenic Escherichia coli strain. Proc. Natl. Acad. Sci. USA 93 , 11,149–11,154.

Pradel, N., Leroy-Setrin, S., Joly, B., and Livrelli, V. (2002) Genomic subtraction to identify and characterize sequences of Shiga toxin-producing Escherichia coli O91:H21. Appl. Environ. Microbiol. 68 , 2316–2325.

Ahmed, I. H., Manning, G., Wassenaar, T. M., Cawthraw, S., and Newell, D. G. (2002) Identification of genetic differences between two Campylobacter jejuni strains with different colonization potentials. Microbiology 148 , 1203–1212.

Bahrani-Mougeot, F. K., Pancholi, S., Daoust, M., and Donnenberg, M. S. (2001) Identification of putative urovirulence genes by subtractive cloning. J. Infect. Dis. 183(Suppl 1) , S21–S23.

Zhang, L., Foxman, B., Manning, S. D., Tallman, P., and Marrs, C. F. (2000) Molecular epidemiologic approaches to urinary tract infection gene discovery in uropathogenic Escherichia coli . Infect. Immun. 68 , 2009–2015.

Jungblut, P. R. (2001) Proteome analysis of bacterial pathogens. Microbes Infect. 3 , 831–840.

Betts, J. C., Dodson, P., Quan, S., Lewis, A. P., Thomas, P. J., Duncan, K. et al. (2000) Comparison of the proteome of Mycobacterium tuberculosis strain H37Rv with clinical isolate CDC 1551. Microbiology 146 , 3205–3216.

Download references

Author information

Authors and affiliations.

Department of Biochemistry, University of Bristol, School of Medical Sciences, Bristol, UK

Matthew B. Avison

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Antibiotic Resistance Monitoring and Reference Laboratory, Specialist and Reference Microbiology Division, Health Protection Agency-Colindale, London, UK

Neil Woodford  & Alan P. Johnson  & 

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Humana Press Inc.

About this protocol

Avison, M.B. (2004). Comparative Genomics. In: Woodford, N., Johnson, A.P. (eds) Genomics, Proteomics, and Clinical Bacteriology. Methods in Molecular Biology™, vol 266. Humana Press. https://doi.org/10.1385/1-59259-763-7:047

Download citation

DOI : https://doi.org/10.1385/1-59259-763-7:047

Publisher Name : Humana Press

Print ISBN : 978-1-58829-218-6

Online ISBN : 978-1-59259-763-5

eBook Packages : Springer Protocols

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Springer Nature Experiments

Comparative Genome Annotation

Author Email

Series: Methods In Molecular Biology > Book: Comparative Genomics

Overview | DOI: 10.1007/978-1-4939-7463-4_6

  • Institut für Mathematik und Informatik, Ernst Moritz Arndt Universität Greifswald, Greifswald, Germany

Full Text Entitlement Icon

Newly sequenced genomes are being added to the tree of life at an unprecedented fast pace. Increasingly, such new genomes are phylogenetically close to previously sequenced and annotated genomes. In other cases, whole clades of closely related

Newly sequenced genomes are being added to the tree of life at an unprecedented fast pace. Increasingly, such new genomes are phylogenetically close to previously sequenced and annotated genomes. In other cases, whole clades of closely related species or strains ought to be annotated simultaneously. Often, in subsequent studies differences between the closely related species or strains are in the focus of research when the shared gene structures prevail. We here review methods for comparative structural genome annotation. The reviewed methods include classical approaches such as the alignment of protein sequences or protein profiles against the genome and comparative gene prediction methods that exploit a genome alignment to annotate a target genome. Newer approaches such as the simultaneous annotation of multiple genomes are also reviewed. We discuss how the methods depend on the phylogenetic placement of genomes, give advice on the choice of methods, and examine the consistency between gene structure annotations in an example. Further, we provide practical advice on genome annotation in general.

Figures ( 0 ) & Videos ( 0 )

Experimental specifications, other keywords.

comparative genomics thesis

Citations (6)

Related articles, context-aware transcript quantification from long-read rna-seq data with bambu, annotation of protein-coding genes in plant genomes, prokaryotic genome annotation, the barley and wheat pan-genomes, prediction of rice transcription start sites using transprise: a novel machine learning approach, simple, reliable, and time-efficient manual annotation of bacterial genomes with maisen, choosing the best gene predictions with genevalidator, generating publication-ready prokaryotic genome annotations with dfast, multi-genome annotation with augustus, practical guide for fungal gene prediction from genome assembly and rna-seq reads by fungap.

  • Salzberg SL, Angiuoli SV, Dunning Hotopp JC, Tettelin H (2011) Improving pan-genome annotation using whole genome multiple alignment. BMC Bioinf 12(1):272
  • Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV (2012) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res 41:D358–D365
  • Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E (2009) EnsemblCompara GeneTrees: complete, duplication-aware phylogenetic trees in vertebrates. Genome Res 19(2):327–335
  • Schmitt-Engel C, Schultheis D, Schwirz J, Ströhlein N, Troelenberg N, Majumdar U, Grossmann D, Richter T, Tech M, Dönitz J, Gerischer L, Theis M, Schild I, Trauner J, Koniszewski NDB, Küster E, Kittelmann S, Hu Y, Lehmann S, Siemanowski J, Ulrich J, Panfilio KA, Schröder R, Morgenstern B, Stanke M, Buchhholz F, Frasch M, Roth S, Wimmer EA, Schoppmeier M, Klingler M, Bucher G (2015) The iBeetle large-scale RNAi screen reveals gene functions for insect development and physiology. Nat Commun 6:7822
  • Avila-Herrera A, Pollard KS (2015) Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species. BMC Bioinf 16(1):1–18
  • Zhang G (2015) Genomics: bird sequencing project takes off. Nature 522(7554):34–34
  • Smit AFA, Hubley R (2008–2015) RepeatModeler Open-1.0. http://www.repeatmasker.org
  • Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D (2011) Cactus: algorithms for genome multiple sequence alignment. Genome Res 21(9):1512–1528
  • Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21
  • Wu TD, Nacu S (2010) Fast and snp-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26:873–881
  • Daehwan K, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14:R36
  • Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotech 33:290–295. StringTie transcript assembler. . Accessed 28 Oct 2014 http://ccb.jhu.edu/software/stringtie
  • Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515
  • Behr J, Kahles A, Zhong Y, Sreedharan VT, Drewe P, Rätsch G (2013) MITIE: simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 29(20):2529–2538
  • Schulz MH, Zerbino DR, Vingron M, Birney E (2012) Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28(8):1086–1092
  • Haas BJ, Papanicolaou A, Yassour M, Grabherr M, Blood PD, Bowden J, Brian Couger M, Eccles D, Li B, Lieber M, et al (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc 8(8):1494–1512
  • Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24(5):637–644
  • Solovyev V, Kosarev P, Seledsov I, Vorobyev D (2006) Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol 7(Suppl 1):S10
  • Behr J, Bohnert R, Zeller G, Schweikert G, Hartmann L, Rätsch G (2010) Next generation genome annotation with mGene.ngs. BMC Bioinf 11(S10):O8
  • Steijger T, Abril JF, Engstrom PG, Kokocinski F, Akerman M, Alioto T, Ambrosini G, Antonarakis SE, Behr J, Bohnert R, et al (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10(12):1177–1184
  • Schweikert G, Zien A, Zeller G, Behr J, Dietrich C, Ong GS, Philips P, De Bona F, Hartmann L, Bohlen A, et al (2009) mGene: accurate SVM-based gene findng with an application to nematode genomes. Genome Res 19:2133–2143
  • Seledtsov I, Molodtsov V, Kosarev P, Solovyev V (2014) Transomics transcript assembly pipeline. . Accessed 28 Oct 2014 http://www.softberry.com
  • Slater GSC, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinf 6(1):31
  • Korf I (2013) Genomics: the state of the art in RNA-seq analysis. Nat Methods 10(12):1165–1166
  • Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW (2003) Zero-mode waveguides for single-molecule analysis at high concentrations. Science 299:682–686
  • Martin JA, Johnson NV, Gross SM, Schnable J, Meng X, Wang M, Coleman-Derr D, Lindquist E, Wei C-L, Kaeppler S, Chen F, Wang Z (2014) A near complete snapshot of the zea mays seedling transcriptome revealed from ultra-deep sequencing. Sci Rep 4:4519
  • Gremme G (2013) Computational Gene Structure Prediction. PhD thesis, Universität Hamburg
  • Iwata H, Gotoh O (2012) Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res 40(20):e161
  • ProSplign (2014). . Accessed 17 Oct 2014 http://www.ncbi.nlm.nih.gov/sutils/static/prosplign/prosplign.html
  • Usuka J, Brendel V (2000) Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. J Mol Biol 297(5):1075–1085
  • Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 14:988–995
  • Keller O, Kollmar M, Stanke M, Waack S (2011) A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics 27(6):757–763
  • Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F (2016) Using intron position conservation for homology-based gene prediction. Nucleic Acids Res 44(9):e89
  • Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 1 Suppl. 1:S1–S9
  • Alexandersson M, Cawley S, Pachter L (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res 13:496–502
  • Richards S, Liu Y, Bettencourt BR, Hradecky P, Letovsky S, Nielsen R, Thornton K, Hubisz MJ, Chen R, Meisel RP, et al (2005) Comparative genome sequencing of drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res 15(1):1–18
  • Gross SS, Brent MR (2005) Using multiple alignments to improve gene prediction. In: Proceedings of RECOMB 2005
  • Gross S, Do C, Sirota M, Batzoglou S (2007) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol 8(12):R269
  • Brent MR (2008) Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9:62–73
  • Elsik C, Worley K, Bennett A, Beye M, Camara F, Childers C, de Graaf D, Debyser G, Deng J, Devreese B, et al (2014) Finding the missing honey bee genes: lessons learned from a genome upgrade. BMC Genomics 15(1):86
  • Csuros M, Rogozin IB, Koonin EV (2011) A detailed history of intron-rich eukaryotic ancestors inferred from a global survey of 100 complete genomes. PLoS Comput Biol 7(9):e1002150
  • Gotoh O, Morita M, Nelson DR (2014) Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment. BMC Bioinf 15(1):189
  • König S, Romoth LW, Gerischer L, Stanke M (2016) Simultaneous gene finding in multiple genomes. Bioinformatics 32:3388–3395
  • König S, Romoth L, Gerischer L, Stanke M (2015) Simultaneous gene finding in multiple genomes. PeerJ PrePrints 3:e1296v1
  • Hickey G, Paten B, Earl D, Zerbino D, Haussler D (2013). HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics 29(10):1341–1342
  • Nguyen N, Hickey G, Raney BJ, Armstrong J, Clawson H, Zweig A, Karolchik D, Kent WJ, Haussler D, Paten B (2014) Comparative assembly hubs: web-accessible browsers for comparative genomics. Bioinformatics 30:3293–3301
  • Hiller M, Schaar BT, Indjeian VB, Kingsley DM, Hagey LR, Bejerano G (2012) A “forward genomics” approach links genotype to phenotype using independent phenotypic losses among related species. Cell Rep 2(4):817–823
  • Goodswen SJ, Kennedy PJ, Ellis JT (2012) Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques. PloS One 7(11):e50609
  • Lomsadze A, Burns PD, Borodovsky M (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42(15):e119
  • Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2015) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32(5):767–769
  • Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (2015) Busco: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19):3210–3212
  • Keller O, Odronitz F, Stanke M, Kollmar M, Waack S (2008) Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species. BMC Bioinf 9(1):278
  • Haas B, Salzberg S, Zhu W, Pertea M, Allen J, Orvis J, White O, Buell CR, Wortman J (2008) Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol 9(1):R7
  • Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinf 12:491
  • Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 33(20):6494–6506
  • Hoff KJ, Stanke M (2013) WebAUGUSTUS – a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Res 41:W123–W1238
  • Raney BJ, Dreszer TR, Barber GP, Clawson H, Fujita PA, Wang T, Nguyen N, Paten B, Zweig AS, Karolchik D, Kent WJ (2013) Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser. Bioinformatics 30(7):1003–1005
  • McKay SJ, Vergara IA, Stajich JE (2010) Using the generic synteny browser (gbrowse_syn). Curr Protoc Bioinformatics UNIT 9.12
  • Mercer TR, Dinger ME, Mattick JS (2009) Long non-coding RNAs: insights into functions. Nat Rev Genet 10(3):155–159
  • Mattick JS, Makunin IV (2006) Non-coding RNA. Hum Mol Genet 15(Suppl 1):R17–R29
  • Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, et al (2012) The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression. Genome Res 22(9):1775–1789
  • Lin MF, Jungreis I, Kellis M (2011) PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27(13):i275–i282
  • Ulitsky I, Bartel DP (2013) lincRNAs: genomics, evolution, and mechanisms. Cell 154(1):26–46
  • Rivas E, Eddy SR (2001) Noncoding RNA gene detection using comparative sequence analysis. BMC Bioinformatics 2(1):1
  • Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH (2009) JBrowse: a next-generation genome browser. Genome Res 19:1630–1638
  • Pirovano W, Boetzer M, Derks MF, Smit S (2015) NCBI-compliant genome submissions: tips and tricks to save time and money. Brief Bioinform 18(2):179–182

Advertisement

ORIGINAL RESEARCH article

A comparative genomics study of the microbiome and freshwater resistome in southern pantanal.

Andr R. de Oliveira

  • 1 Laboratório de Biologia Computacional e Sistemas, Instituto Oswaldo Cruz, Rio de Janeiro, Brazil
  • 2 Universidade Federal do Mato Grosso do Sul, Campo Grande, Brazil
  • 3 Universidade Federal do Mato Grosso do Sul, Aquidauana, Brazil

This study explores the resistome and bacterial diversity of two small lakes in the Southern Pantanal, one in Aquidauana sub-region, close to a farm, and one in Abobral sub-region, an environmentally preserved area. Shotgun metagenomic sequencing data from water column samples collected near and far from the floating macrophyte Eichhornia crassipes were used. The Abobral small lake exhibited the highest diversity and abundance of antibiotic resistance genes (ARGs), antibiotic resistance classes (ARGCs), phylum, and genus. RPOB2 and its resistance class, multidrug resistance, were the most abundant ARG and ARGC, respectively. Pseudomonadota was the dominant phylum across all sites, and Streptomyces was the most abundant genus considering all sites.

1 Introduction

Antibiotic resistance is emerging as a significant global public health issue due to the swift rise of resistant bacteria and the concurrent decline in new drugs entering the market ​​( Serwecińska, 2020 )​​. While resistance is a natural phenomenon in microbial communities, where it serves as a form of competition ​​( Frost et al., 2018 )​​, its effects can be amplified in environments with high antibiotic concentrations. These environments include livestock farms ​( Qian et al., 2018 )​​, aquacultures ​​( Preena et al., 2020 )​​, hospital effluents ​​( Hassoun-Kheir et al., 2020 )​​, and wastewater treatment plants ​( Raza et al., 2022 )​​. Bacteria possess horizontal gene transfer mechanisms (integrons, MGE, and plastids) that facilitate the spread of antibiotic resistance genes (ARGs) within the community ​​( Sun et al., 2019 )​​. Consequently, a pathogenic species can develop resistance to a specific antibiotic without direct exposure to it.

Although antibiotic resistance in pathogenic bacteria is well-studied, these bacteria represent only a small fraction of the microbial community ​( Doron and Gorbach, 2008 )​​. Therefore, our understanding of the antibiotic-resistance genes in non-pathogenic bacteria, particularly those inhabiting rivers, lakes, soils, and oceans, remains limited. This gap in knowledge is due to the difficulty in cultivating these bacteria using current protocols. Metagenomics, which allows for the analysis of a microbial community through sequencing of environmental genetic material without the need for cultivation ​( Doron and Gorbach, 2008 )​​, has emerged as a promising solution.

Given the central role these microorganisms play in biogeochemical processes, studies on this topic have increased. Understanding the resistome of non-culturable bacterial communities is crucial for identifying potential gene reservoirs that could contribute to the evolution and spread of antibiotic resistance ​( Fresia et al., 2019 )​.​

The significance of biotic antibiotic removal mechanisms, often carried out by microorganisms, is highlighted in this context. These mechanisms are intertwined with various plant-based processes, a phenomenon known as phytoremediation. Certain plants have demonstrated the ability to eliminate and tolerate high levels of antibiotics without experiencing toxic effects ​( Kurade et al., 2021 ; Polińska et al., 2021 )​​. As a result, the microbial community present is shaped by the presence of plants and their impact on the rhizosphere, which in turn affects antibiotic resistance.

Our study is centered on the Pantanal Sul Matogrossense, one of the world’s largest wetland ecosystems ​​( Assine, 2015 )​​. This biome covers parts of Brazil (78%), Bolivia (18%), and Paraguay (4%). Despite its recognition, it has been significantly impacted by livestock activities, which often involve the use of large quantities of antibiotics ​​( Ferrante and Fearnside, 2022 )​. This region, abundant in water, serves as an efficient medium for the spread of Mobile Genetic Elements (MGE). The vast water systems in the Pantanal enhance the potential for ARGs to disseminate widely ​( Aminov and Mackie, 2007 ; Baquero et al., 2008 )​​. Hence, the examination of this region’s resistome is of paramount importance from both a public health and academic perspective.

This study is set to explore the resistome and taxonomy of filtered water samples from two small lakes in Pantanal. One small lake is in a farm area, the other in a reserve, where samples were taken both near and far from the Eichhornia crassipes (Mart.) Solms macrophyte. We’ll analyze sequencing data, looking into the identity and diversity of the Antibiotic Resistance Genes (ARGs), and also classify the samples taxonomically. This is the first time a metagenomic approach is being used to study the bacterial resistome of this area.

2.1 Sampling

The sampling site comprised two small lakes located in the Pantanal Sul Matogrossense: one within an environmental reserve unit in the Pantanal de Abobral subregion ( Figure 1A ), municipality of Corumbá (19°34′35″S 57°00′46″W), and the other in a farm region in the Pantanal de Aquidauana subregion ( Figure 1B ), municipality of Aquidauana (20°12′30″S 55°46′29″W), 147 km away from each other ( Figure 2 ). In each small lake, water column samples were collected near the floating macrophyte ( Eichhornia crassipes ) and at a distance of 10 m from it. Access to genetic samples was properly registered at Brazilian SisGen under the code: AB3AE33. The sequencing files were submitted to the NCBI Bioproject under the code PRJNA1078255.

www.frontiersin.org

Figure 1 . Map of the sample collection location. The small lakes are highlighted by a red circle. Image (A) is from Abobral and (B) is from Aquidauana.

www.frontiersin.org

Figure 2 . Map displaying the distance between the small lakes.

For clarity and objectivity, the sample from Abobral near the macrophyte will be referred to as “Site 1”, Abobral distant from the macrophyte as “Site 2”, Aquidauana near the macrophyte as “Site 3”, and Aquidauana distant from the macrophyte as “Site 4”. Each sample consisted of 10 L of water, collected in autoclaved bottles and subsequently stored at 4°C.

In the laboratory, the water samples were filtered through 1.2µm, 0.8µm, and 0.45 µm membranes, but only the data from the last one were used for this work. After the DNA extraction with QIAGEN DNEasy PowerWater Kit, the genetic material extracted from each of the 10 L bottled water was sent to the Sequencing Platform of the Histocompatibility and Cryopreservation Laboratory of the State University of Rio de Janeiro (UERJ) for a shotgun sequencing on the Illumina HiSeq-2500 platform. The sequenced data were then stored on the servers of the Laboratory of Computational Biology and Systems at Oswaldo Cruz Institute/Fiocruz and submitted to NCBI SRA (PRJNA1078255).

2.2 Data analysis

The sequencing data were processed using two sequence analysis tools: the Metawrap pipeline (version 1.3) ​( Uritskiy et al., 2018 )​ and DeepArg (version 2.0) ​( Arango-Argoty et al., 2018 )​. The Metawrap pipeline was utilized for the following steps: i) Quality verification of the sequences using FastQC, ii) Sequence cleaning with Trimmomatic, iii) Taxonomic inference using Kraken, and iv) Taxonomy visualization with Krona. Following the cleaning process conducted by Trimmomatic, the data were further analyzed by DeepArg. This allowed for the prediction of Antibiotic Resistance Genes (ARGs) and their corresponding Antibiotic Resistance Classes (ARGCs), following the classification scheme provided by the tool.

All statistical analyses were performed using the R programming language. The descriptive statistics, including all diversity indices, were computed using the Vegan package (version 2.15–1)​( Dixon, 2003 )​ The graphics were generated with the ggplot2 package (version 3.4.4).

Due to the large volume of data, a selection criterion was established for the ARGs, ARGCs, phylum, and genus to be included in the graphical analysis. The criteria for inclusion were a relative abundance of 5% or higher for ARGs, ARGCs, and phylum in at least one sample site. For genus, the threshold was set at a relative abundance of 1% or higher. This adjustment was necessary as only two genera had a relative abundance higher than 5%. This approach ensured a manageable and representative subset of data for graphical analysis.

The inferential statistics involved testing for normality using the Shapiro-Wilk test, Lilliefors test, and QQ plot. Based on these normality tests, only non-parametric tests were appropriate for our data. Therefore, group comparisons were made using the Kruskal–Wallis test, followed by a post hoc Dunn’s test with Bonferroni correction.

These comparisons were performed on alpha and beta diversity indices for ARGs, ARGCs, and the diversity of phyla and genera. Given that the Abobral and Aquidauana small lakes do not have direct contact, and therefore, their communities are isolated from each other, beta diversity was analyzed only among locations that establish a habitat gradient ( WHITTAKER, 1972 ), that is, in each isolated small lake, taking into account only the presence and absence of macrophyte. The bootstrap resampling process (R = 100) was used to enhance the reliability of the results. All tests were considered significant at p < 0.05.

3.1 Diversity of ARGs and ARGCs

A total of 232 ARGs and 66 ARGCs were found adding up all sites. Among them, we were able to identify a total of 103 unique ARGs and 21 Antibiotic Resistance Gene Classes (ARGCs). Comparing all sites, Site one stood out with the most ARGs (74), ARGCs (19), and the highest number of reads (1,685). On the other hand, Site 3 had the least with 48 ARGs, 14 ARGCs, and 874 reads ( Table 1 ).

www.frontiersin.org

Table 1 . Number of ARGs, ARGCs, reads, phylum and genus found in each Site.

A comparative analysis between the small lakes revealed that Abobral’s small lake (Sites 1 and 2) had a greater diversity of unique ARGs (84) and ARGCs (21) than Aquidauana’s small lake (Sites 3 and 4), which had 69 ARGs and 17 ARGCs.

When considering the presence (Sites 1 and 3) and absence (Site 2 and 4) of the macrophyte, its presence seemed to slightly increase the number of different ARGs (82 vs. 79) but decrease the number of ARGCs (19 vs. 21). The co-occurrence of ARGs and ARGCs, whether together or isolated, is illustrated in Supplementary Figures S1 and S2 . At least 24 ARGs and 11 known ARGCs were found to occur together in all sites.

The most abundant ARG was RPOB2, accounting for more than half of the total reads in all samples, with Site 3 having the highest percentage (73.8%). Other ARGs that had high percentages compared to the others were BACA and UGD in Sites 1 and 2, both accounting for approximately 10% of the total reads in each site ( Figure 3 ).

www.frontiersin.org

Figure 3 . Antibiotic Resistance Genes (ARGs) which had relative abundance equal to or greater than 5% in at least one site. The y -axis represents the percentage of relative abundance (0.75 = 75%).

Given that RPOB2 was the most abundant ARG across all samples, its corresponding antibiotic resistance class, Multidrug Resistance, would consequently be the most prevalent. Bacitracin and Peptides come in sequence also reflecting the proportions of the genes BACA and UGD ( Figure 4 ). The richness and abundance of all ARGs and ARGC present in each site can be found in Supplementary Figures S2 and S4 .

www.frontiersin.org

Figure 4 . Antibiotic Resistance Classes (ARGCs) which had relative abundance equal to or greater than 5% in at least one site. The y -axis represents the percentage of relative abundance.

A total of 98 phyla and 1779 genera were found adding up all sites. Among them, we were able to identify a total of 31 unique phyla and 917 genera. Comparing all sites, Site 1 and 4 stood out with the most unique phyla (26) and Site 2 with the most genera (485). On the other hand, Site 3 had the least with 22 phyla and 384 genera ( Table 1 ).

Similarly to the ARGs and ARGCs comparative analysis, Abobral’s small lake (Sites 1 and 2) had a greater diversity of unique phyla (31) and genera (687) than Aquidauana’s small lake (Sites 3 and 4), which had 27 phyla and 576 genera. When considering the presence (Sites 1 and 3) and absence (Site 2 and 4) of the macrophyte, its presence seemed to also slightly increase the number of different phyla (32 vs. 29) but decrease the number of genera (653 vs. 700).

Pseudomonadota is the dominant phylum across all sites, with relative abundance varying from 0.40 to 0.43. Actinobacteria follows as the second most abundant, with a slight increase in proportion at Site 4 (0.28) compared to the range of 0.22–0.25 at the other sites. Bacteroidota and Firmicutes exhibit similar distributions, but Bacteroidota shows a notable decrease at Site 3 and Site 4 (around 0.06) compared to Site one and Site 2 (around 0.13). Cyanobacteriota, although the least abundant, show a significant increase at Site 3 (0.08) compared to Site 2 (0.02) ( Figure 5 ). More information about all the phyla found in all sites can be seen in Supplementary Material ( Supplementary Figure S5 ).

www.frontiersin.org

Figure 5 . Phyla which had relative abundance equal to or greater than 5% in at least one site. The y -axis represents the percentage of relative abundance.

On the genus level, Streptomyces and Pseudomonas were relatively abundant across all sites and displayed a slight increase at Site 4 and Site 3, respectively. Mycobacterium and Mycolicibacterium , while exhibiting similar distributions, showed a significant increase at Site 4. Synechococcus, which was absent at Site 1, manifested a substantial increase at Site 3 and Site 4. Polynucleobacter , on the other hand, showed a significant decrease at Site 3 and Site 4. Methylobacterium , despite its low relative abundance at Site one and Site 2, was completely absent at Site 3 and appeared in lower proportions at Site 4 ( Figure 6 ). More information about all the genera found in all sites can be seen in Supplementary Material ( Supplementary Figure S6 ).

www.frontiersin.org

Figure 6 . Genus which had relative abundance equal to greater than 1% in at least one site.

3.2 Diversity indexes

Both Simpson and Shannon indices were employed to assess the diversity for the following categories: ARGs, ARGCs, phylum, and genus. The results are listed in Tables 2 , 3 .

www.frontiersin.org

Table 2 . Simpson Indexes for ARGs, ARGCs, phylum and genus.

www.frontiersin.org

Table 3 . Shannon Indexes for ARGs, ARGCs, phylum and genus.

Further statistical analysis revealed distinct differences in diversity across various sites. The Simpson index revealed key differences in ARGC diversity between Sites one and 3 ( Figure 7A ), phylum diversity between Sites 1 and 3 ( Figure 7B ), and genus diversity between Sites 2 and 3, and Sites 2 and 4 ( Figure 7C ).

www.frontiersin.org

Figure 7 . From left to right, bootstrap boxplots from the Simpson’s Index of ARGC (A) , phylum (B) , and genus (C) diversity of each site. Letters inside the graph represent Dunn’s test results.

Building on this, the Shannon index identified additional disparities. In ARGC diversity, it showed differences between Sites 1, 2, and 3 ( Figure 8A ). For phylum diversity, it highlighted differences between Sites 2, 3, and 4 ( Figure 8B ). For genus diversity, it confirmed differences between Sites 1, 2, and 4 ( Figure 8C ). No significant differences were observed for ARGs in either index.

www.frontiersin.org

Figure 8 . From left to right, bootstrap boxplots from the Shannon-Wiener Index of ARGC (A) , phylum (B) , and genus (C) diversity of each site. Letters inside the graph represent Dunn’s test results.

The beta diversity analysis, utilizing Bray Curtis and Sørensen dissimilarity indices, is listed in Table 4 :

www.frontiersin.org

Table 4 . Beta diversity indexes.

Further decomposition of the Sørensen index into turnover and nestedness components revealed noteworthy turnover. Sites 1 and 2 had the highest turnover for genus at 0.4295 and the lowest in phylum at 0.2083. Sites 3 and 4 had pronounced turnover, especially in ARGs at 0.2917 and phylum at 0.0455. Nestedness was observed to a lesser extent, emphasizing the distinct ecological compositions between the small lake sites ( Figure 9 ).

www.frontiersin.org

Figure 9 . Sørensen Dissimilarity decomposition into Turnover and Nestedness for ARGs, ARGCs, genus and phylum between sites close and far from the macrophyte for each small lake (Site 1 vs. Site 2 and Site 3 vs. Site 4).

The inferential statistics were conducted to evaluate whether the diversity difference caused by the presence of macrophytes was significantly greater for one group compared to the others. In Abobral, only ARGC and phylum were not different from one another ( Figure 10A ). A similar pattern was noted in Aquidauana, where only ARGs and phylum showed no significant difference ( Figure 10B ).

www.frontiersin.org

Figure 10 . Bray-Curtis Dissimilarity for ARGs, ARGCs, genus, and phylum between sites close and far from the macrophyte for each small lake. (A) Site 1 vs Site 2, (B) Site 3 vs Site 4.

4 Discussion

4.1 args, argc and read numbers.

Antibiotics and antibiotic resistance genes (ARGs) in freshwater environments are influenced by a myriad of factors, including soil type, macrophyte species, type of macrophyte, antibiotic concentration, water flow, microbiota, community dynamics, nutrients, pH, oxygenation, temperature, and the method of antibiotic introduction ​( Overton et al., 2023 )​. While this study provides valuable insights into the resistome and taxonomy of the sampled sites, it is important to acknowledge that these biotic and abiotic parameters were not measured during the sample collection and could potentially explain the observed results.

In the present study a higher concentration of ARGs, ARGCs, and reads in a conserved area with no human activity compared to a farm region were found, suggesting that human activity may not always lead to an increase in ARGs in nearby bacterial communities. This could be potentially explained by the well-established fact that conserved areas have higher biodiversity in contrast to anthropized areas ​( McDonald et al., 2020 ; Glidden et al., 2021 )​.

The significant presence of the RPOB2 gene in both Abobral and Aquidauana suggests the existence of a natural reservoir for this gene within the Pantanal region. The geographical extent of this reservoir warrants further investigation. This finding holds considerable interest for both public health and academic research.

The observed shift in the proportions of the ARGs RPOB2, BACA, UGD and their respective ARGCs within the Aquidauana may be attributed to a complex interplay of factors. While the small lake’s proximity to an agricultural region suggests that farming practices could have influenced these shifts, it is important to question this assumption. The use of antibiotics, heavy metals, and other agrochemicals, which can infiltrate local water bodies through various means, could have contributed to the loss of genetic diversity in this region ​( Holt, 2000 ; Schmitt et al., 2015 )​.

However, it is crucial to conduct further studies to determine whether the observed decrease in biodiversity is a permanent or transient phenomenon, possibly due to a recent change in environmental conditions or farming practices. A study demonstrated that the introduction of an antibiotic to the microbiological community was initially harmful. However, from the second to the fifth week, the levels of bacterial activity returned to levels similar to those found in communities not exposed to the antibiotic ​( Weber et al., 2011 )​. This observation further underscores the need for ongoing monitoring and research to fully understand the dynamics at play.

There were unique ARGs and ARGCs found exclusively in certain locations, and these did not appear in isolation but were always found together, suggesting a possible relationship or co-dependence ( Supplementary Figure S1 ). Moreover, the presence of the macrophyte influenced the occurrence of certain ARGs and ARGCs, demonstrating that their presence was more influenced by the macrophyte rather than the small lake environment.

One specific group of clinically important genes found in all samples were the mcr-N genes. These genes are significant because they offer resistance to colistin, an antibiotic used as a last resort against super-resistant bacteria. Although they have already been found in remote places, such as Antarctica, any occurrence of this group of genes is important to be reported ​( Cuadrat et al., 2020 )​.

4.2 Diversity indexes

In this study, we used two different indices to measure alfa diversity: the Shannon index and the Simpson index. Overall the Shannon index identified Abobral small lake with the higher biodiversity compared to Aquidauana. However, for the Simpson index, it was the opposite. This might seem contradictory at first, however, this is due to the different aspects of biodiversity these indices measure.

The Shannon index is sensitive to species richness, meaning it increases with the number of different species present. Conversely, the Simpson index emphasizes species evenness, meaning it increases when a few species are significantly more prevalent than others ​( Guiaşu and Guiaşu, 2003 )​.

A study on the effects of simulated nitrogen deposition on soil microbial community diversity in a coastal wetland found that with increasing levels of nitrogen deposition, alpha diversity (Shannon and Simpson indices) decreased significantly ( Lu et al., 2021 ). This decrease in diversity may be due to soil acidification resulting from long-term nitrogen deposition in the supersaturated state. Long-term nitrogen deposition may also decrease the available organic matter of soil microorganisms, reducing microbial activity and diversity. Another study found that the presence of certain pollutants, such as chromium, can also reduce alpha diversity ​( Wei et al., 2023 )​. Although we do not have data regarding nutrient and contaminant concentrations, bodies of water near farms often have high concentrations of these nutrients and heavy metals due to agricultural practices.

On the contrary, studies conducted in a wastewater treatment plant and different types of soil found that alpha diversity was higher in areas with more pollution and human intervention ​( Ndlovu et al., 2016 ; Geng et al., 2020 )​. The addition of Fe 2+ was also found to increase microbial diversity ​( Song et al., 2016 )​. These findings suggest that while certain conditions and substances can decrease alpha diversity, others can lead to an increase, highlighting the complex interplay of factors that influence microbial diversity.

The results of beta diversity indexes indicate that the influence of macrophytes on the diversity of the small lake ecosystem varies across different categories. The genus category showed the highest dissimilarity values across both indices and all sites, indicating that the genus-level diversity is most affected by the proximity to the macrophyte. This is expected given that genus was the category with the highest diversity of all, allowing for a greater variation.

In contrast, the ARGCs and phylum categories showed relatively lower dissimilarity values. This suggests that these categories are less influenced by the proximity to the macrophyte compared to the genus and ARGs categories.

The dissimilarity values were generally higher for the Abobral small lake (Site 1 vs. Site 2) compared to Aquidauana’s (Site 3 vs. Site 4) across all categories. This is also expected since the former exhibited greater diversity, allowing for more variation.

The decomposition of the Sørensen index into turnover and nestedness components provided a more nuanced understanding of the differences in microbial diversity between the sites. The results suggest that turnover is the dominant component of dissimilarity for all categories at both sites, indicating that the differences in diversity between the sites are primarily due to the replacement of species, rather than the presence of a subset of species at one site. In contrast, the nestedness component was relatively low for all categories, suggesting that the sites do not contain many species that are subsets of the species at other sites.

These findings emphasize the importance of considering multiple indexes when assessing biodiversity. Each index provides a different lens through which to view diversity, and together they offer a more comprehensive picture of the ecological structure of the sites.

4.3 Taxonomy

The abundance results found for the diversity of phyla and genera are consistent with the genetic diversity of the small lakes, with the Abobral small lake exhibiting greater diversity than Aquidauana.

The Pseudomonadota phylum (previous Proteobacteria), known for its high abundance in the majority of microbial communities, was, indeed, the most prevalent across all samples. This group plays a significant role in removing various pollutants, making it a crucial component of the microbial community. Its dominance in most systems can be attributed to its diverse metabolic capabilities and adaptability to different environmental conditions, which allow it to thrive in both polluted and unpolluted freshwater environments ​( Holt et al., 1984 ; Balows et al., 1992 ; Bergey, 1994 ; Topley, 2005 )​.

In addition to its prevalence, this phylum of microorganisms plays a significant role in various environmental processes. For instance, the removal of antibiotics is largely attributed to the majority of functional microorganisms within this phylum ​( Alexandrino et al., 2017 ; Huang et al., 2017 ; Li et al., 2019 ; Shan et al., 2020 )​. Additionally, the class Deltaproteobacteria , which is part of the Pseudomonadota phylum, contains most of the sulfate-reducing bacteria essential for heavy metal removal ​( Chen et al., 2021a ; 2021b )​. The phylum is also involved in the removal of phosphorus, as indicated by research from ​( Si et al., 2018 ; 2019 ; Huang et al., 2020 )​. Lastly, the role of the Pseudomonadota phylum in nitrogen removal from various wastewaters is well-documented, with genera such as Nitrosomonas , Nitrobacter , and Nitrosospira being associated with nitrification ​( Aguilar et al., 2019 ; Ajibade et al., 2021 )​.

The Pseudomonas genus, a member of the Pseudomonadota phylum, is renowned for its adaptability and metabolic flexibility ​( He et al., 2004 ; Høiby et al., 2010 )​. This genus is found in various environments ​( Özen and Ussery, 2012 )​, demonstrating its resilience and ability to thrive even in polluted areas. Its metabolic versatility allows it to play a crucial role in the removal of different pollutants. Pseudomonas exhibits a remarkable capability for environmental remediation. It effectively absorbs phosphorus from wastewater, storing it internally as polyphosphate ​( Tian et al., 2017 ; Huang et al., 2020 )​. This process not only aids in the purification of wastewater but also contributes to the recycling of this essential nutrient.

In addition to phosphorus absorption, Pseudomonas shows resistance to heavy metals, aiding in their extraction from the environment. This is achieved through the synthesis of extracellular substances that bind to these metals, thereby preventing their spread within the biofilm and offering protection to the cells from stress ​( Teitzel and Parsek, 2003 ; Giovanella et al., 2017 )​.

Pseudomonas also can metabolize glucose and mitigate sulfonamides through the co-metabolism of organic matter and sulfamethoxazole, contributing to antibiotic removal ​( Zheng et al., 2021 )​. This unique metabolic capability further underscores the importance of Pseudomonas in environmental remediation and pollutant removal.

A slight increase in the abundance of Pseudomonas was observed in Site 3, which has plants nearby. Given that most pathogenic members of this genus are related to plants ​( Özen and Ussery, 2012 )​, this relationship could be a pivotal factor in its distribution. The presence of plants and the specific environmental conditions at this site might have provided a competitive advantage for Pseudomonas , leading to its increased abundance.

On the other hand, the genus Polynucleobacter showed high abundance in the Abobral small lake and low abundance in the Aquidauana small lake, suggesting a higher susceptibility to pollution from the farm. Our results differ from a study that found a high abundance of Polynucleobacter in a river influenced by effluents from backyard aquacultures ​( Nakayama et al., 2017 )​. It has also been found in polluted rivers and is known to live as chemoorganotrophs by utilizing low-molecular-weight substrates derived from the photooxidation of humic substances ​( Hahn et al., 2012 ; Ma et al., 2016 )​.

Contrary to expectations, there was an increase in Bacteroidetes in the Aquidauana small lake. Although this phylum is more known for adapting well or even preferring polluted environments ​( Da Silva et al., 2015 ; Tai et al., 2020 )​, pollution-sensitive Bacteroidetes have been observed, to function as bioindicators ​( Wolińska et al., 2017 )​. This suggests that our environment might have conditions that are not unfavorable for this group.

The Actinobacteria phylum, which does not seem to be significantly influenced by the environmental conditions of the two regions studied, was found to be consistently present. Its genus, Streptomyces , was notably abundant across all four sites as well. This prevalence could be attributed to the fact that Streptomyces bacteria are the source of most antibiotics used in medicine, veterinary practice, and agriculture ​( Chater et al., 2010 )​, making them efficient competitors in natural environments.

Site 3 showed a minor increase in Streptomyces abundance. This could be due to their close relationship with plants, as they are common in the rhizosphere and are frequent endophytes. Their ecological function in the natural decomposition of plant and fungi cell walls, which are globally abundant, could also contribute to this increase. Streptomyces are known for their significant role in breaking down plant, fungi, and insect cell walls or surface components ​( Chater et al., 2010 )​.

On the other hand, the genera Mycobacterium and Mycolicibacterium of this phylum, demonstrated a different pattern. While their abundance remained relatively stable and low in Sites 1, 2, and 3, a significant peak was observed in Site 4. This pattern can be attributed to the natural resistance of Mycobacterium species to most antimicrobial agents currently available ​( Nguyen and Pieters, 2009 )​.

The absence of plants in Site 4, which could help eliminate antibiotics, might have exposed this site to more antibiotics, favoring the growth of these genera. In particular, the genus Mycolicibacterium , which was recently separated from Mycobacterium , indicating its close evolutionary proximity, may share similar characteristics, such as a higher resistance to antibiotics ​( Gupta et al., 2019 )​.

Cyanobacteria, a crucial component of many aquatic ecosystems, were found in significant amounts in all samples, demonstrating their resilience and adaptability in diverse environments. Despite being common in aquatic environments, they were found in greater relative abundance in Aquidauana compared to Abobral, likely due to the higher availability of nutrients such as phosphorus and nitrogen.

Synechococcus , a genus of cyanobacteria, exhibited a unique distribution pattern. Its abundance was low in the Abobral small lake, with no instances at Site 1. Conversely, in the Aquidauana small lake, there was a marked increase in its abundance. According to our data, this genus seems to be more sensitive to nutrient availability variations, as indicated by its differing abundance in Aquidauana and Abobral.

The presence of Synechococcus has been linked to total nitrogen, dissolved nitrogen, dissolved organic carbon, and dissolved phosphorus ​( Le et al., 2022 )​. Furthermore, a study by Pishbin et al. (2021) found that under mixotrophic conditions, Synechococcus elongatus could remove up to 85.1% of phosphorus and 87.4% of nitrogen. This nutrient removal efficiency, coupled with the likely high levels of phosphorus and nitrogen in the Aquidauana small lake due to its location in a farming region, could account for the observed distribution pattern of Synechococcus , and cyanobacteria in general.

5 Conclusion

In this study, we collected samples from two different small lakes. The first small lake is located in Abobral sub-region, which is in a protected reserve area, and the second one is in Aquidauana sub-region, characterized by farming activities. For each of these locations, we collected samples from areas close to and far from the floating macrophyte Eichornia crassipes , giving us a total of four samples.

In our successful endeavor to unravel these two areas’ resistome, we were able to identify the primary antibiotic resistance genes and taxa, and how these vary depending on the level of pollution in the area and the presence or absence of plants nearby. We also identified a potential natural reservoir of the RPOB2 gene, as it occurred in high abundance in both areas. This finding is of academic, economic, and public health interest as it could influence decisions regarding the use of antibiotics in the area.

Our analysis of the collected data revealed a significant loss of both genetic and taxonomic biodiversity in the sample from the farm in Aquidauana sub-region when compared to the sample from the reserve in Abobral sub-region. This finding supports the widely accepted view that human activities can lead to a decrease in biodiversity.

However, our study also brought to light an interesting observation that contradicts a well-studied phenomenon. We found that human activity does not always result in an increase in the number of antibiotic resistance genes in the nearby bacterial community. This is one of the first studies to report such a finding, highlighting the need for further research in this area.

While the impact of human activity on the loss of genetic and taxonomic biodiversity is well-documented, our understanding of its effect on the diversity of resistance genes is still limited. More research is needed to investigate if human activity is causing a loss in diversity of resistance genes, and if so, whether certain genes are being favored over others. This will help us gain a more comprehensive understanding of the complex interactions between human activity and microbial communities.

Furthermore, we may have identified a potential natural reservoir for the RPOB2 gene, given its significant presence in both Aquidauana and Abobral. This discovery could facilitate informed decision-making regarding the use of antibiotics and public health.

Data availability statement

The raw sequences have been deposited in the NCBI Sequence Read Archive under the code PRJNA1078255.

Author contributions

AO: Investigation, Methodology, Writing–original draft, Writing–review and editing. BR: Investigation, Methodology, Writing–original draft. RJ: Investigation, Methodology, Writing–original draft. NP: Investigation, Methodology, Writing–original draft. AB: Investigation, Methodology, Writing–original draft. RP: Investigation, Methodology, Writing–original draft. NA: Writing–original draft. AD: Conceptualization, Methodology, Supervision, Writing–original draft, Writing–review and editing.

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. POM-IOC/FIOCRUZ funding for LBCS. PIBIC/CNPq fellowship for AO.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2024.1352801/full#supplementary-material

Aguilar, L., Gallegos, Á., Arias, C. A., Ferrera, I., Sánchez, O., Rubio, R., et al. (2019). Microbial nitrate removal efficiency in groundwater polluted from agricultural activities with hybrid cork treatment wetlands. Sci. total Environ. 653, 723–734. doi:10.1016/j.scitotenv.2018.10.426

PubMed Abstract | CrossRef Full Text | Google Scholar

Ajibade, F. O., Wang, H.-C., Guadie, A., Ajibade, T. F., Fang, Y.-K., Sharif, H. M. A., et al. (2021). Total nitrogen removal in biochar amended non-aerated vertical flow constructed wetlands for secondary wastewater effluent with low C/N ratio: microbial community structure and dissolved organic carbon release conditions. Bioresour. Technol. 322, 124430. doi:10.1016/j.biortech.2020.124430

Alexandrino, D. A. M., Mucha, A. P., Almeida, C. M. R., Gao, W., Jia, Z., and Carvalho, M. F. (2017). Biodegradation of the veterinary antibiotics enrofloxacin and ceftiofur and associated microbial community dynamics. Sci. Total Environ. 581, 359–368. doi:10.1016/j.scitotenv.2016.12.141

Aminov, R. I., and Mackie, R. I. (2007). Evolution and ecology of antibiotic resistance genes. FEMS Microbiol. Lett. 271, 147–161. doi:10.1111/j.1574-6968.2007.00757.x

Arango-Argoty, G., Garner, E., Pruden, A., Heath, L. S., Vikesland, P., and Zhang, L. (2018). DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome 6, 23–15. doi:10.1186/s40168-018-0401-z

Assine, M. L. (2015). “Brazilian Pantanal: a large pristine tropical wetland,” in Landscapes and landforms of Brazil , 135–146.

CrossRef Full Text | Google Scholar

Balows, A., Trüper, H. G., Dworkin, M., Harder, W., and Schleifer, K.-H. (1992). The prokaryotes: a handbook on the biology of bacteria: ecophysiology, isolation, identification, applications . Cham: Springer .

Google Scholar

Baquero, F., Martínez, J.-L., and Cantón, R. (2008). Antibiotics and antibiotic resistance in water environments. Curr. Opin. Biotechnol. 19, 260–265. doi:10.1016/j.copbio.2008.05.006

Bergey, D. H. (1994). Bergey’s manual of determinative bacteriology . Pennsylvania, United States: Lippincott Williams & Wilkins .

Chater, K. F., Biró, S., Lee, K. J., Palmer, T., and Schrempf, H. (2010). The complex extracellular biology of Streptomyces. FEMS Microbiol. Rev. 34, 171–198. doi:10.1111/j.1574-6976.2009.00206.x

Chen, J., Deng, S., Jia, W., Li, X., and Chang, J. (2021a). Removal of multiple heavy metals from mining-impacted water by biochar-filled constructed wetlands: adsorption and biotic removal routes. Bioresour. Technol. 331, 125061. doi:10.1016/j.biortech.2021.125061

Chen, J., Li, X., Jia, W., Shen, S., Deng, S., Ji, B., et al. (2021b). Promotion of bioremediation performance in constructed wetland microcosms for acid mine drainage treatment by using organic substrates and supplementing domestic wastewater and plant litter broth. J. Hazard Mater 404, 124125. doi:10.1016/j.jhazmat.2020.124125

Cuadrat, R. R. C., Sorokina, M., Andrade, B. G., Goris, T., and Davila, A. M. R. (2020). Global ocean resistome revealed: exploring antibiotic resistance gene abundance and distribution in TARA Oceans samples. Gigascience 9, giaa046. doi:10.1093/gigascience/giaa046

Da Silva, M. L. B., Cantão, M. E., Mezzari, M. P., Ma, J., and Nossa, C. W. (2015). Assessment of bacterial and archaeal community structure in swine wastewater treatment processes. Microb. Ecol. 70, 77–87. doi:10.1007/s00248-014-0537-8

Dixon, P. (2003). VEGAN, a package of R functions for community ecology. J. Veg. Sci. 14, 927–930. doi:10.1111/j.1654-1103.2003.tb02228.x

Doron, S., and Gorbach, S. L. (2008). Bacterial infections: overview. Int. Encycl. Public Health 273, 273–282. doi:10.1016/b978-012373960-5.00596-7

Ferrante, L., and Fearnside, P. M. (2022). Brazil’s Pantanal threatened by livestock. Science 377, 720–721. doi:10.1126/science.ade0656

Fresia, P., Antelo, V., Salazar, C., Giménez, M., D’Alessandro, B., Afshinnekoo, E., et al. (2019). Urban metagenomics uncover antibiotic resistance reservoirs in coastal beach and sewage waters. Microbiome 7, 35–39. doi:10.1186/s40168-019-0648-z

Frost, I., Smith, W. P. J., Mitri, S., Millan, A. S., Davit, Y., Osborne, J. M., et al. (2018). Cooperation, competition and antibiotic resistance in bacterial colonies. ISME J. 12, 1582–1593. doi:10.1038/s41396-018-0090-4

Geng, X. D., Zhou, Y., Wang, C. Z., Yu, M. H., and Qian, J. L. (2020). BACTERIAL COMMUNITY STRUCTURE AND DIVERSITY IN THE SOIL OF THREE DIFFERENT LAND USE TYPES IN A COASTAL WETLAND. Appl. Ecol. Environ. Res. 18, 8131–8144. doi:10.15666/aeer/1806_81318144

Giovanella, P., Cabral, L., Costa, A. P., de Oliveira Camargo, F. A., Gianello, C., and Bento, F. M. (2017). Metal resistance mechanisms in Gram-negative bacteria and their potential to remove Hg in the presence of other metals. Ecotoxicol. Environ. Saf. 140, 162–169. doi:10.1016/j.ecoenv.2017.02.010

Glidden, C. K., Nova, N., Kain, M. P., Lagerstrom, K. M., Skinner, E. B., Mandle, L., et al. (2021). Human-mediated impacts on biodiversity and the consequences for zoonotic disease spillover. Curr. Biol. 31, R1342–R1361. doi:10.1016/j.cub.2021.08.070

Guiaşu, R. C., and Guiaşu, S. (2003). Conditional and weighted measures of ecological diversity. Int. J. Uncertain. Fuzziness Knowledge-Based Syst. 11, 283–300. doi:10.1142/s0218488503002089

Gupta, R. S., Lo, B., and Son, J. (2019). Corrigendum: phylogenomics and comparative genomic studies robustly support division of the genus Mycobacterium into an emended genus Mycobacterium and four novel genera. Front. Microbiol. 10, 714. doi:10.3389/fmicb.2019.00714

Hahn, M. W., Scheuerl, T., Jezberová, J., Koll, U., Jezbera, J., Šimek, K., et al. (2012). The passive yet successful way of planktonic life: genomic and experimental analysis of the ecology of a free-living Polynucleobacter population. PLoS One 7, e32772. doi:10.1371/journal.pone.0032772

Hassoun-Kheir, N., Stabholz, Y., Kreft, J.-U., De La Cruz, R., Romalde, J. L., Nesme, J., et al. (2020). Comparison of antibiotic-resistant bacteria and antibiotic resistance genes abundance in hospital and community wastewater: a systematic review. Sci. Total Environ. 743, 140804. doi:10.1016/j.scitotenv.2020.140804

He, J., Baldini, R. L., Déziel, E., Saucier, M., Zhang, Q., Liberati, N. T., et al. (2004). The broad host range pathogen Pseudomonas aeruginosa strain PA14 carries two pathogenicity islands harboring plant and animal virulence genes. Proc. Natl. Acad. Sci. 101, 2530–2535. doi:10.1073/pnas.0304622101

Høiby, N., Ciofu, O., and Bjarnsholt, T. (2010). Pseudomonas aeruginosa biofilms in cystic fibrosis. Future Microbiol. 5, 1663–1674. doi:10.2217/fmb.10.125

Holt, J. G., Krieg, N. R., Sneath, P. H. A., Staley, J. T., and Williams, S. T. (1984). Bergey’s manual of systematic bacteriology, vol. 1 . Baltimore: The Williams and Wilkins Co , 1–1388.

Holt, M. S. (2000). Sources of chemical contaminants and routes into the freshwater environment. Food Chem. Toxicol. 38, S21–S27. doi:10.1016/s0278-6915(99)00136-2

Huang, J., Xiao, J., Guo, Y., Guan, W., Cao, C., Yan, C., et al. (2020). Long-term effects of silver nanoparticles on performance of phosphorus removal in a laboratory-scale vertical flow constructed wetland. J. Environ. Sci. 87, 319–330. doi:10.1016/j.jes.2019.07.012

Huang, X., Zheng, J., Liu, C., Liu, L., Liu, Y., Fan, H., et al. (2017). Performance and bacterial community dynamics of vertical flow constructed wetlands during the treatment of antibiotics-enriched swine wastewater. Chem. Eng. J. 316, 727–735. doi:10.1016/j.cej.2017.02.029

Kurade, M. B., Ha, Y.-H., Xiong, J.-Q., Govindwar, S. P., Jang, M., and Jeon, B.-H. (2021). Phytoremediation as a green biotechnology tool for emerging environmental pollution: a step forward towards sustainable rehabilitation of the environment. Chem. Eng. J. 415, 129040. doi:10.1016/j.cej.2021.129040

Le, K. T. N., Maldonado, J. F. G., Goitom, E., Trigui, H., Terrat, Y., Nguyen, T. L., et al. (2022). Shotgun metagenomic sequencing to assess cyanobacterial community composition following coagulation of cyanobacterial blooms. Toxins (Basel) 14, 688. doi:10.3390/toxins14100688

Li, H., Liu, F., Luo, P., Chen, X., Chen, J., Huang, Z., et al. (2019). Stimulation of optimized influent C: N ratios on nitrogen removal in surface flow constructed wetlands: performance and microbial mechanisms. Sci. Total Environ. 694, 133575. doi:10.1016/j.scitotenv.2019.07.381

Lu, G., Xie, B., Cagle, G. A., Wang, X., Han, G., Wang, X., et al. (2021). Effects of simulated nitrogen deposition on soil microbial community diversity in coastal wetland of the Yellow River Delta. Sci. Total Environ. 757, 143825. doi:10.1016/j.scitotenv.2020.143825

Ma, L., Mao, G., Liu, J., Gao, G., Zou, C., Bartlam, M. G., et al. (2016). Spatial-temporal changes of bacterioplankton community along an exhorheic river. Front. Microbiol. 7, 250. doi:10.3389/fmicb.2016.00250

McDonald, R. I., Mansur, A. V., Ascensão, F., Colbert, M., Crossman, K., Elmqvist, T., et al. (2020). Research gaps in knowledge of the impact of urban growth on biodiversity. Nat. Sustain 3, 16–24. doi:10.1038/s41893-019-0436-6

Nakayama, T., Hoa, T. T. T., Harada, K., Warisaya, M., Asayama, M., Hinenoya, A., et al. (2017). Water metagenomic analysis reveals low bacterial diversity and the presence of antimicrobial residues and resistance genes in a river containing wastewater from backyard aquacultures in the Mekong Delta, Vietnam. Environ. Pollut. 222, 294–306. doi:10.1016/j.envpol.2016.12.041

Ndlovu, T., Khan, S., and Khan, W. (2016). Distribution and diversity of biosurfactant-producing bacteria in a wastewater treatment plant. Environ. Sci. Pollut. Res. 23, 9993–10004. doi:10.1007/s11356-016-6249-5

Nguyen, L., and Pieters, J. (2009). Mycobacterial subversion of chemotherapeutic reagents and host defense tactics: challenges in tuberculosis drug development. Annu. Rev. Pharmacol. Toxicol. 49, 427–453. doi:10.1146/annurev-pharmtox-061008-103123

Overton, O. C., Olson, L. H., Majumder, S. D., Shwiyyat, H., Foltz, M. E., and Nairn, R. W. (2023). Wetland removal mechanisms for emerging contaminants. Land (Basel) 12, 472. doi:10.3390/land12020472

Özen, A. I., and Ussery, D. W. (2012). Defining the Pseudomonas Genus: where do we draw the line with Azotobacter? Microb. Ecol. 63, 239–248. doi:10.1007/s00248-011-9914-8

Polińska, W., Kotowska, U., Kiejza, D., and Karpińska, J. (2021). Insights into the use of phytoremediation processes for the removal of organic micropollutants from water and wastewater; a review. Water (Basel) 13, 2065. doi:10.3390/w13152065

Preena, P. G., Swaminathan, T. R., Kumar, V. J. R., and Singh, I. S. B. (2020). Antimicrobial resistance in aquaculture: a crisis for concern. Biol. Bratisl. 75, 1497–1517. doi:10.2478/s11756-020-00456-4

Qian, X., Gu, J., Sun, W., Wang, X.-J., Su, J.-Q., and Stedfeld, R. (2018). Diversity, abundance, and persistence of antibiotic resistance genes in various types of animal manure following industrial composting. J. Hazard Mater 344, 716–722. doi:10.1016/j.jhazmat.2017.11.020

Raza, S., Shin, H., Hur, H.-G., and Unno, T. (2022). Higher abundance of core antimicrobial resistant genes in effluent from wastewater treatment plants. Water Res. 208, 117882. doi:10.1016/j.watres.2021.117882

Schmitt, N., Wanko, A., Laurent, J., Bois, P., Molle, P., and Mosé, R. (2015). Constructed wetlands treating stormwater from separate sewer networks in a residential Strasbourg urban catchment area: micropollutant removal and fate. J. Environ. Chem. Eng. 3, 2816–2824. doi:10.1016/j.jece.2015.10.008

Serwecińska, L. (2020). Antimicrobials and antibiotic-resistant bacteria: a risk to the environment and to public health. Water (Basel) 12, 3313. doi:10.3390/w12123313

Shan, A., Wang, W., Kang, K. J., Hou, D., Luo, J., Wang, G., et al. (2020). The removal of antibiotics in relation to a microbial community in an integrated constructed wetland for tail water decontamination. Wetlands 40, 993–1004. doi:10.1007/s13157-019-01262-8

Si, Z., Song, X., Wang, Y., Cao, X., Zhao, Y., Wang, B., et al. (2018). Intensified heterotrophic denitrification in constructed wetlands using four solid carbon sources: denitrification efficiency and bacterial community structure. Bioresour. Technol. 267, 416–425. doi:10.1016/j.biortech.2018.07.029

Si, Z., Wang, Y., Song, X., Cao, X., Zhang, X., and Sand, W. (2019). Mechanism and performance of trace metal removal by continuous-flow constructed wetlands coupled with a micro-electric field. Water Res. 164, 114937. doi:10.1016/j.watres.2019.114937

Song, X., Wang, S., Wang, Y., Zhao, Z., and Yan, D. (2016). Addition of Fe 2+ increase nitrate removal in vertical subsurface flow constructed wetlands. Ecol. Eng. 91, 487–494. doi:10.1016/j.ecoleng.2016.03.013

Sun, D., Jeannot, K., Xiao, Y., and Knapp, C. W. (2019). Editorial: horizontal gene transfer mediated bacterial antibiotic resistance. Front. Microbiol. 10, 1933. doi:10.3389/fmicb.2019.01933

Tai, X., Li, R., Zhang, B., Yu, H., Kong, X., Bai, Z., et al. (2020). Pollution gradients altered the bacterial community composition and stochastic process of rural polluted ponds. Microorganisms 8, 311. doi:10.3390/microorganisms8020311

Teitzel, G. M., and Parsek, M. R. (2003). Heavy metal resistance of biofilm and planktonic Pseudomonas aeruginosa . Appl. Environ. Microbiol. 69, 2313–2320. doi:10.1128/aem.69.4.2313-2320.2003

Tian, J., Yu, C., Liu, J., Ye, C., Zhou, X., and Chen, L. (2017). Performance of an ultraviolet Mutagenetic polyphosphate-accumulating bacterium PZ2 and its application for wastewater treatment in a newly designed constructed wetland. Appl. Biochem. Biotechnol. 181, 735–747. doi:10.1007/s12010-016-2245-y

Topley, W. W. C. (2005). Topley and Wilson’s microbiology and microbial infections . London: Hodder Arnold .

Uritskiy, G. V., DiRuggiero, J., and Taylor, J. (2018). MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 1–13. doi:10.1186/s40168-018-0541-1

Weber, K. P., Mitzel, M. R., Slawson, R. M., and Legge, R. L. (2011). Effect of ciprofloxacin on microbiological development in wetland mesocosms. Water Res. 45, 3185–3196. doi:10.1016/j.watres.2011.03.042

Wei, Z., Sixi, Z., Baojing, G., Xiuqing, Y., Guodong, X., and Baichun, W. (2023). Effects of Cr stress on bacterial community structure and composition in rhizosphere soil of Iris tectorum under different cultivation modes. Microbiol. Res. (Pavia) 14, 243–261. doi:10.3390/microbiolres14010020

Whittaker, R. H. (1972). Evolution and measurement of species diversity. Taxon 21, 213–251. doi:10.2307/1218190

Wolińska, A., Kuźniar, A., Zielenkiewicz, U., Izak, D., Szafranek-Nakonieczna, A., Banach, A., et al. (2017). Bacteroidetes as a sensitive biological indicator of agricultural soil usage revealed by a culture-independent approach. Appl. Soil Ecol. 119, 128–137. doi:10.1016/j.apsoil.2017.06.009

Zheng, Y., Liu, Y., Qu, M., Hao, M., Yang, D., Yang, Q., et al. (2021). Fate of an antibiotic and its effects on nitrogen transformation functional bacteria in integrated vertical flow constructed wetlands. Chem. Eng. J. 417, 129272. doi:10.1016/j.cej.2021.129272

Keywords: resistome, metagenomics, microbiome, pantanal, freshwater

Citation: de Oliveira AR, de Toledo Rós B, Jardim R, Kotowski N, de Barros A, Pereira RHG, Almeida NF and Dávila AMR (2024) A comparative genomics study of the microbiome and freshwater resistome in Southern Pantanal. Front. Genet. 15:1352801. doi: 10.3389/fgene.2024.1352801

Received: 08 December 2023; Accepted: 01 April 2024; Published: 18 April 2024.

Reviewed by:

Copyright © 2024 de Oliveira, de Toledo Rós, Jardim, Kotowski, de Barros, Pereira, Almeida and Dávila. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Alberto M. R. Dávila, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

  • Introduction to Genomics
  • Educational Resources
  • Policy Issues in Genomics
  • The Human Genome Project
  • Funding Opportunities
  • Funded Programs & Projects
  • Division and Program Directors
  • Scientific Program Analysts
  • Contact by Research Area
  • News & Events
  • Research Areas
  • Research investigators
  • Research Projects
  • Clinical Research
  • Data Tools & Resources
  • Genomics & Medicine
  • Family Health History
  • For Patients & Families
  • For Health Professionals
  • Jobs at NHGRI
  • Training at NHGRI
  • Funding for Research Training
  • Professional Development Programs
  • NHGRI Culture
  • Social Media
  • Broadcast Media
  • Image Gallery
  • Press Resources
  • Organization
  • NHGRI Director
  • Mission & Vision
  • Policies & Guidance
  • Institute Advisors
  • Strategic Vision
  • Leadership Initiatives
  • Diversity, Equity, and Inclusion
  • Partner with NHGRI
  • Staff Search

Comparative Genomics Fact Sheet

Comparative genomics is a field of biological research in which researchers use a variety of tools to compare the complete genome sequences of different species. By carefully comparing characteristics that define various organisms, researchers can pinpoint regions of similarity and difference.

What are the benefits of comparative genomics?

Identifying DNA sequences that have been "conserved" - that is, preserved in many different organisms over millions of years - is an important step toward understanding the genome itself. It pinpoints genes that are essential to life and highlights genomic signals that control gene function across many species. It helps us to further understand what genes relate to various biological systems, which in turn may translate into innovative approaches for treating human disease and improving human health.

Comparative genomics also provides a powerful tool for studying evolution. By taking advantage of - and analyzing- the evolutionary relationships between species and the corresponding differences in their DNA, scientists can better understand how the appearance, behavior and biology of living things have changed over time.

As DNA sequencing technology becomes more powerful and less expensive, comparative genomics is finding wider applications in agriculture, biotechnology and zoology as a tool to tease apart the often subtle differences among animal species. Such efforts have led to new insights into some branches on the evolutionary tree, as well as improving the health of domesticated animals and pointing to new strategies for conserving rare and endangered species.

What is a genome made of?

The genomes of almost all living creatures, both plants and animals, consist of DNA (deoxyribonucleic acid), the chemical chain that includes the genes that code for different proteins and the regulatory sequences that turn those genes on and off. Precisely which protein is produced by any given gene is determined by the sequence in which four building blocks - adenine (A), thymine (T), cytosine (C) and guanine (G) - are laid out along DNA's twisted, double-helix structure.

Comparative Genomics

What results has the field of comparative genomics produced?

Comparative genomics has yielded dramatic results. Investigators are increasingly using comparative genomics to explore areas ranging from human development and behavior to metabolism and susceptibility to disease. These studies are uncovering new behavioral, neurological and developmental pathways and genes that are shared or related among species. Some researchers are using comparative genomics to reveal the genomic underpinnings of disease in animals with the hope of gaining new insights into disease development in humans.

Among the results so far are the following:

A study discovered that about 60 percent of genes are conserved between fruit flies and humans, meaning that the two organisms appear to share a core set of genes. Two-thirds of human genes known to be involved in cancer have counterparts in the fruit fly.

A comparative genomics analysis of six species of yeast prompted scientists to significantly revise their initial catalog of yeast genes and to predict a new set of functional elements that play a role in regulating genome activity, not just in yeast but across many species.

Researchers studying milk production have mapped genes that increase the yield of high-fat milk in cows, resulting in higher production levels and potentially a significant economic impact. This is one of many studies aimed at increasing food production.

Scientists have found genes that increase muscling in cattle by twofold; they found the same genes in racing dogs, and such results may foster human performance studies.

Comparisons of nearly 50 bird species' genomes revealed a gene network that underlies singing in birds and that may have an important role in human speech and language. The bird researchers also found gene networks responsible for traits such as feathers and beaks.

In recent years, researchers in the National Human Genome Research Institute (NHGRI) intramural program also have studied the genomics of various cancer types in dogs, including common cancers and other diseases, to try to develop new insights into the human form of the condition. In some cases, they have mapped genes contributing to these disorders.

In other studies, NHGRI researchers are comparing how genes affect body shape and size in dogs to better understand growth and development. Studies of dogs with sleep problems have revealed genes and pathways - and potential drug targets - to treat sleep problems.

What other genomes have been sequenced?

Researchers have sequenced the complete genomes of hundreds of animals and plants-more than 250 animal species and 50 species of birds alone-and the list continues to grow almost daily.

In addition to the sequencing of the human genome, which was completed in 2003, scientists involved in the Human Genome Project sequenced the genomes of a number of model organisms that are commonly used as surrogates in studying human biology. These include the rat, puffer fish, fruit fly, sea squirt, roundworm, and the bacterium Escherichia coli. For some organisms NHGRI has sequenced many varieties, providing critical data for understanding genetic variation.

DNA sequencing centers supported by NHGRI also have sequenced genomes of the chicken, dog, honey bee, gorilla, chimpanzee, sea urchin, fungi and many other organisms.

How is NHGRI involved in the growth of this new field of research?

NHGRI pioneered the development of DNA sequencing methods and technologies - including informatics - and has funded research to study the genomes of a wide range of species. The National Institutes of Health (NIH) Intramural Sequencing Center has been instrumental in the sequencing of many organisms.

NHGRI programs such as ENCODE (Encyclopedia of DNA Elements) and modENCODE (model organism Encyclopedia of DNA Elements) have compared and contrasted the inner workings of animal and human genomes to try to better understand how genomes function.

In modENCODE, researchers found shared patterns of gene activity and regulation among fly, worm and human genomes. The mouse ENCODE Consortium demonstrated that, in general, the systems that are used to control gene activity have many similarities in mice and humans.

Last updated: August 15, 2020

Volume 16 Supplement 11

Proceedings of the 5th Symposium on Biological Data Visualization: Part 1

  • Open access
  • Published: 13 August 2015

BactoGeNIE: a large-scale comparative genome visualization for big displays

  • Jillian Aurisano 1 ,
  • Khairi Reda 2 , 3 ,
  • Andrew Johnson 1 ,
  • Elisabeta G Marai 1 &
  • Jason Leigh 3  

BMC Bioinformatics volume  16 , Article number:  S6 ( 2015 ) Cite this article

2657 Accesses

7 Citations

Metrics details

The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches are needed to take advantage of such environments, in order to enable the effective visual analysis of large genomics datasets.

In this paper, we present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a novel and visually scalable design for comparative gene neighborhood analysis on large display environments. We evaluate BactoGeNIE through a case study on close to 700 draft Escherichia coli genomes, and present lessons learned from our design process.

Conclusions

BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type may inform visualization design for large-scale comparative research problems in genomics.

Introduction

Bacterial genomes--the complete set of genes or genetic material present in bacteria--play an important role in several fields, from the study of micro-biomes to drug development. Bacterial genome sequencing, which determines the complete nucleotide sequence in a bacterial strain's DNA, is increasing at rates that exceed Moore's Law [ 1 ], particularly since these genomes are relatively small and inexpensive to sequence. These emerging large collections of complete genome sequence data from bacterial strains are changing the landscape of comparative bacterial genomics.

Comparative genomics is broadly concerned with comparing genomic features across several genomes, to address questions pertaining to evolution and explain variations in different organisms. In particular, comparative gene neighborhood analysis involves comparisons across large collections of bacterial genomes to identify variations in neighborhoods around genes of interest. This comparison helps the domain experts generate hypotheses regarding gene function, which is particularly helpful when studying novel or uncharacterized genes [ 2 ]. In addition, such comparisons are valuable when identifying the source of differences between related bacterial strains or studying bacterial strain evolution.

Automated methods play a central role in comparative genomics. However, neighborhood-based outliers and common features are difficult to identify through automated methods alone. Visualization can help in this direction.

A variety of visualization applications and techniques exist for genomic data, including tools to support comparative analysis [ 3 ]. However, existing techniques are largely not designed to accommodate comparative tasks across large collections of complete genome sequences. In particular, no visual tools exist for comparing gene neighborhoods that scale beyond small stretches of genes in 2-9 genomes. Even if the approaches in existing tools could be scaled to larger collections of genomes, our domain experts found that the fundamental designs did not scale visually or perceptually to allow for large-scale comparative tasks. This scalability issue limits analysis through visualization in comparative bacterial genomics, in particular in the case of comparative tasks across large collections (dozens to thousands) of bacterial genome sequences.

At the same time, novel high-resolution displays have become increasingly adopted in the genomics community, both in the form of personal workspaces featuring multiple monitors, and in the form of collaborative tiled-display walls. While an increase in resolution and display size has the potential to address some scalability challenges, novel visual abstractions are necessary to take advantage of the unique properties of these environments, while avoiding visual clutter.

Overview and contributions: In this work, we introduce a novel visualization approach and application called BactoGeNIE (Figure 1 ), which stands for Bacterial Gene Neighborhood Investigation Environment. This environment is specifically designed for researchers investigating gene neighborhoods around target genes of interest in large collections of complete bacterial genome sequences, on high-resolution and large display environments. BactoGeNIE was developed over a two year close collaboration with a team of genomics researchers in an industrial research lab setting.

figure 1

BactoGeNIE enables comparisons across large collections of gene neighborhoods on large, high-resolution environments . The visual encodings and interactions are designed to enable data exploration and browsing to enable users to locate and compare neighborhoods of interest and identify features and outliers in content, order and context within these regions. This image shows the neighborhood around a hypothetical protein in all draft Escherichia coli genomes from the PubMed database.

Our contributions include: 1) a data and task analysis for the domain of comparative bacterial genomics; 2) a description of the design process of BactoGeNIE, including a discussion of perceptual issues as well as opportunities and design limitations arising from environments at human scale; 3) an implementation of this design and application to large displays; 4) an evaluation of BactoGeNIE on a case study and with domain experts; and 5) a discussion of the lessons learned through this project.

To our best knowledge, this is the first interactive, large-scale comparative gene neighborhood visualization for big displays. While this work addresses a specific domain topic--comparative gene neighborhood analysis--as sequence volumes exceed the capacity of visualization designs for other research problems, our design decisions may inform comparative genomics visualization development more broadly.

Related work

Many powerful genome visualization approaches have been developed since the first complete genomes were published in the early 2000s. However, few of these approaches discuss visual scalability in presenting their work, likely because genome sequencing data volumes did not require it. For instance, Nielsen et al. review genome visualizations approaches broadly [ 3 ], but do not explicitly discuss visual scalability in popular comparative visualization designs.

In our work with genomics researchers we have evaluated existing approaches, and found that many of the common design decisions in existing tools would not scale visually to accommodate contemporary scales of complete genomes sequences. In the following section we describe these prior designs and discuss challenges in scaling these approaches to larger data volumes or higher-resolution and large-scale displays.

Visualizations for comparing gene neighborhoods

Orthology-line techniques . One common design for gene neighborhood comparisons uses a comparative track between two parallel genomes, each of which is laid-out on an independent coordinate system, with lines connecting similar, or orthologous, genes in different genomes [ 4 – 8 ].

While these applications permit comparisons across several gene neighborhoods, they are not designed to address visual scalability and comparisons over large collections of neighborhoods. The visual design could be 'scaled-up' to accommodate larger data volumes, but orthology lines at this scale become difficult to trace, and visual clutter from many crossing orthology lines limits analysis. In addition, as display size increases, the user needs to trace lines over larger areas, which adds to their cognitive burden.

Color and layout techniques . In an alternative approach to depicting orthology between genes in different genomes, similarity or 'orthology' between genes in different genomes is encoded with either color or spatial positioning, or a combination of the two [ 9 , 10 ]. However, these approaches only accommodate comparisons between a few genomes (3-4 in published examples).

SequenceSurveyor is an overview visualization for large-scale genome alignment data that uses color and positioning to encode comparative information about coding sequences within genomes [ 11 ]. However, this application is designed to provide overviews of comparative data, and does not emphasize identification of specific comparative features within individual genomes, which is the goal of our work.

Large, high-resolution displays

Evidence suggests that users take advantage of increased resolution on large displays, and are able to scale-up their perceptual processing to perform visual queries over larger volumes of data [ 12 ], with potential benefits for insight formation and discovery in visual exploration [ 13 ]. The display scale and resolution permit users to explore a large dataset through physical navigation instead of virtual navigation, allowing visualization designers to exploit embodied cognition , such as spatial memory, in complex data analysis scenarios [ 14 – 17 ]. [ 18 ] describe applications of such environments to visualizing large, heterogeneous scientific datasets.

These ideas have been applied to comparative genomics by Ruddle et al., who scaled genomics visualizations to large displays. However, many of the designs that were 'scaled-up' to big displays did not require adaptation, because they did not seek to enable the performance of comparative tasks across more genomes. Ruddle et al. also implemented a custom application for large and high-resolution displays, Orchestral, which visualizes--via color and alignment-- copy-number variations across one hundred genomes [ 19 ]. Orchestral was designed to enable a different comparative task than BactoGeNIE, and thus features distinct visual encodings and design decisions.

BactoGeNIE was developed through a two year long close collaboration with a group of eight comparative genomics researchers, in an industry research lab setting. Several of the researchers specialize in bacterial genomics, and the senior lead of the group has over 20 years of research experience in this field. All researchers in the group have several years of experience conducting research and developing approaches for analyzing genome sequence data. The main focus of the research team was on using computational methods on 'omics' datasets to support research in agriculture.

Over the two year period we conducted weekly meetings. In addition, the lead author was embedded with the team for a period of several weeks, which allowed for daily meetings and observation of the research process with bacterial genomics data.

To gather requirements, we utilized ethnographic observation, interviews and focus groups. As part of this process, we conducted brainstorming sessions and also asked the researchers to describe and critique existing visualization tools, as well as discuss the limitations of automated analysis approaches. We utilized a combination of iterative and parallel design approach, guided by regular feedback. The design stage employed whiteboard sketches, as well as paper and lightweight to fully-developed interactive prototypes.

Data analysis

Genomics and bacterial terminology Comparative genomics involves the analysis of similarity and dissimilarity in sequenced genomes. This similarity analysis can occur at several levels of detail, from whole genome comparison to gene sequence comparisons.

Our approach focused on genome sequence data collected from closely related bacterial species. Bacteria are small, single-celled organisms whose genome is on a single circular chromosome, and potentially several plasmids. A genome is the complete genetic material for an organism, and is composed of a linear sequence of subunits called nucleotides . This data is regularly stored in 'fasta' format files. Genomic data also includes annotations , such as information related to the coding sequences for genes . This data is stored in annotation files, often in genome feature file (gff) format.

A bacterial strain is a genetic variant, which may be described as related to other variants when it possesses similar properties and has a similar genome sequence.

Specific data entities In this project, we focus on comparisons in gene neighborhoods around genes of interest. The researchers we worked with consider a gene neighborhood to consist of the 10-30 genes closest to a gene of interest. In particular, the researchers sought to identify orthologs , or genes with highly similar sequences within and across distinct genomes. The process of identifying orthologs involves sequence comparison algorithms. For this analysis, two genes are orthologous if they are found to have highly similar sequences through these sequence comparison algorithms. A set of genes with highly similar sequences are referred to as an ortholog cluster .

Due to known sequencing limitations, drafts of complete genome sequences often are broken into pieces that are difficult to assemble , or stitch together. These pieces are referred to as contigs . Determining how to assemble these contigs and resolving breaks in the complete sequence is a significant challenge. In addition, genome annotations are sometimes incomplete when genomic data is in a draft state, with missing or incomplete annotations. Identifying and resolving such annotation errors is a priority when working with newly sequenced genomes.

The coding sequence for a gene can be found in the organisms' genome. Like a genome sequence, a gene's coding sequence consists of a sequence of nucleotides which determines the sequence and structure of the protein encoded by the gene. To locate a coding sequence for a gene in the genome, researchers refer to gff annotation files, which list a start and stop nucleotide position with the contig, as well as which of the two DNA strands on which the gene is encoded. Often the annotation includes common names for the gene or descriptions that cover known functions for the protein encoded by the gene.

Task analysis

Overall, the domain experts sought support for hypothesis generation pertaining to gene function, particularly when examining previously uncharacterized genes, and bacterial strain behavior and evolution. To this end, the researchers were interested in comparison across large collections of neighborhoods; and in locating variations within one genome, or a set of genomes, in the context of many genomes.

Researchers noted that the value of this analysis for generating hypotheses depends on comparisons across large collections of neighborhoods, since large scale analysis increases statistical significance of features and outliers in ortholog content within neighborhoods. In addition, large collections of neighborhoods allow researchers to assess the weight of evolutionary selection for the content of a particular neighborhood, and make inferences about differences in bacterial strain behavior under different conditions.

Functional requirements

Broadly, experts sought to compare the neighborhoods around ortholog clusters of interest in order to identify outliers and common features. These features and outliers would be derived from sets of coding-sequence data attributes, such as gene position, size and orientation, across large collections of neighborhoods. Specifically, the researchers wished to compare and identify features and outliers that fall into three broad categories: 1) gene content in a neighborhood; 2) ortholog order and orientation in a neighborhood; and 3) context for addressing errors in the data .

The domain experts indicated that their workflow needed to transition between locating, browsing and exploring tasks, for example:

Unstructured exploration of a pre-filtered set of contigs,

Browsing in a location of interest, or

Locating specific genes or ortholog clusters of interest within a complete set of draft bacterial sequence strains.

The researchers further noted that visualization could particularly enable them to move smoothly between locating, browsing and exploring tasks, where automated approaches tended to require a predetermined target and/or location with little room for exploration.

Below we identify specific configurations of coding sequence data attributes within the neighborhood of an ortholog cluster of interest. These configurations inform the design of BactoGeNIE. The figures provided in this section are derived from paper prototypes we developed in collaboration with the domain experts.

Gene content in a neighborhood . Variations in the content of a neighborhood around an ortholog cluster of interest can include the presence or absence of an ortholog in a neighborhood or set of neighborhoods. These variations arise from insertion or deletion events in bacterial strain evolution. Since bacterial genomes often cluster genes involved in similar functions within the genome, changes in neighbors around a gene of interest may signal changes in function, such as an alteration in a biochemical pathway.

In addition, a variation in a gene's size, in terms of its length in nucleotides, indicates a potential truncated gene. Since a gene's sequence encodes a protein sequence, whose 3 dimensional structure and function depends on this sequence, a significant change in length indicates a significant change in gene-product function. These variations are depicted in Figure 2 .

figure 2

Content variations: Insertion, deletion and trunction . In this prototype visual encoding, each horizontal line represents a portion of a neighborhood around orthologs, in 3 bacterial strains. Orthologs have the same color and label. The insertion illustration shows a yellow gene in strain 2 whose ortholog is not present in strains 1 and 3. The deletion illustration has gene 'B' missing from strain 2, while it is present in strains 1 and 3. The truncation illustration shows gene B with smaller length in pixels in strain 2, corresponding to a smaller length in nucleotides compared to its orthologs in strains 1 and 3.

Ortholog order and orientation . The domain experts wished to identify trends and outliers in ortholog order within large collections of neighborhoods around ortholog clusters of interest. These neighborhoods may have identical gene content, but significant differences in the order, number or orientation of orthologs, for instance from rearrangement, duplication or inversion events, depicted in Figure 3 . Such differences are significant because variations in order may impact gene expression--because sets of gene neighbors are often transcribed in tandem in bacteria.

figure 3

Order variations: inversion, rearrangement and duplication . In this prototype visual encoding, each horizontal line represents a portion of a neighborhood around orthologs, in 3 bacterial strains. Orthologs have the same color and label. The inversion illustration shows orthologs in strain 2 with a different orientation, when compared to strains 1 and 3. The rearrangement illustration shows orthologs in strain 2 in a different order, compared to strains 1 and 3. The duplication illustration shows strain 2 with two copies of gene 'B' in strain 2, and two copies of genes 'B' and 'B' in strain 1.

Context for addressing errors in the data . In addition to presenting attributes pertaining to content, order and orientation, researchers introduced data verification tasks that also required coding sequence attributes related to genomic context. This type of task was noted by researchers as a high priority, because data verification could be significantly strengthened by visualization.

A variety of errors can arise in the process of generating complete genome sequences. Errors in annotation, which will appear as gaps between otherwise tightly clustered genes, and breaks in assembly, resulting in collections of contigs that would otherwise form a continuous sequence, complicate automated analysis of commonly recurring sequences around ortholog clusters of interest. These features are depicted in Figure 4 .

figure 4

Context variations: gaps and breaks in assembly . In this prototype visual encoding, each horizontal line represents a portion of a neighborhood around orthologs, in 3 bacterial strains. Orthologs have the same color and label. These context variations in neighborhoods indicate potential errors in data generation. The first, shows a gaps in strain 2, highlighted with a pink box, when compared to strains 1 and 3, indicating a potential errors in genome data generation. The second, illustrates breaks in genome assembly, and shows how comparative views may help users resolve such breaks.

By presenting neighborhoods around ortholog clusters of interest, the researchers believed they could identify potential errors in data generation and form hypotheses that may aid in fixing such errors. The strength of the hypothesis depends on the consistency of a feature across many neighborhoods.

Operating requirements: big data and high resolutions

In addition to the task-related requirements identified above, we also derived several requirements regarding the operating environment. The domain experts had acquired a multi-panel tiled display with four 46 inch panels at 2048 by 1536 pixels, with the explicit goal of enabling high resolution data visualization and multiuser collaboration. The BactoGeNIE software was also tested on a 18 panel tiled display wall at 21.9 by 6.6 feet and 6144 by 2304 pixels. Also, the domain experts had personal workspaces with two monitors each.

Many tasks performed by users are at the level of a single neighborhood: the researchers are often considering a few dozen genes around a gene of interest. However, the analysis also takes place across large collections of neighborhoods, where it benefits from an increase in vertical pixels: the large display scale allows more neighborhoods to be displayed and compared. There are also instances where researchers noted benefits of an increase in horizontal pixels, for example in the initial stages of browsing and exploring long contigs, or in cases where a subset of the strains have a large sequence inserted within a neighborhood, or where a section of a neighborhood has been duplicated; such sections can appear at potentially considerable distances within a contig, and benefit from a larger horizontal area.

High-density encoding and layout Many visualizations for comparing gene neighborhoods were noted to be ill-suited to supporting large-scale comparison tasks. These visual approaches were found to be fundamentally low-density, in that the visualization is not equipped to show information at high-density even when given the pixels to do so. For some encodings, in particular the ones that encoded orthology by drawing lines between coding sequences or showed text labels by default for all genes on display, we estimated that an increase in the number of neighborhoods on display would produce an increase in visual clutter, and impede the performance of analytic tasks. In other cases, the layout adopted to arrange neighborhoods limits the number of contigs that can be shown without distortions, such as circular layouts. The identified requirement was a high-density visualization that accommodates spatial compression in order to show data across a high-resolution display without visual clutter.

Perceptual scalability In working with domain experts on tiled display walls, as well as on personal multi-monitor display systems, we noted that many designs for comparing across gene neighborhoods do not scale-up spatially across a big display , because an increase in display size seemed to hamper the perception of data and relationships. For instance, scaling up some representations to a large display, moved related entities to opposite ends of the display, limiting direct visual comparison through eye movements. As a result, users appear to need to hold targets in working memory and perform a time-intensive visual search across the display, limiting the use of cognitive resources for higher-level operations. In addition, encodings of orthology that cannot be easily perceived at a distance require users to step-up to the display, making it difficult to perceive patterns in the larger dataset.

We noted these problems with orthology-line-based comparative gene neighborhood visualizations, but also with prototypes that did not provide the opportunity to spatially cluster related neighborhoods .

Visual encodings and interaction design

In this section, we relate the data, tasks and requirements derived above to the BactoGeNIE visual encodings. Our top level design consists of a single global view, with details on demand, brushing and linking, and support for grouping and aligning collections of genomes. Since the task requirements captured through interviews, focus groups and discussions with the domain experts focused on identifying features and outliers within a single collection of sequenced genomes, we concentrated on supporting these tasks within a single view.

In this view, each row (6-10 pixels high) in the visualization shows one contig from one strain of bacteria; the bacterial strain is indicated by a text label on the left. The row-height was determined empirically, based on the display size and user feedback. Within each contig, a sequence of arrow-glyphs depict coding sequences; each contig may contain from one to hundreds of coding sequences. For each arrow-glyph corresponding to a coding sequence, the position, size, and orientation of the arrow-glyph corresponds to the position, size and strand of the coding sequence. The default color applied to all coding-sequence glyphs is a neutral desaturated blue. Below we describe in detail our mappings.

High-density encoding of genes and contigs

Based on the operating requirements identified during the domain analysis and early prototypes, we sought a high-density design for encoding contigs, codings sequences, and orthology clusters.

To create a high-density view, we considered several data abstractions for coding sequences. We considered abstractions that only encoded position relative to other coding sequences in the contig, for instance by just representing the order of coding sequences. However, this would not capture gaps between coding sequences, which are important for the data verification tasks described previously. In addition, we considered abstractions that did encode coding sequence length in nucleotides. However, doing so did not allow researchers to identify truncations . We also considered abstractions that did not capture the strand which encodes the protein. However, this information is vital for the identification of inversions , as described in the previous section.

For coding sequences, we agreed with the experts that it was essential to encode position, strand and size, as well as show clearly a sequence's contig and bacterial strain. We represent coding sequences via the arrow-glyph described earlier. The direction of the arrow encodes both the strand and direction of expression, allowing researchers to identify potential groups of coding sequences that are 'expressed' in tandem, and inversions. Alternative representations of the coding sequence strand that we have explored include offset layouts, where coding sequences on one strand are positioned slightly higher than coding sequences on a different strand within the boundary of the contig. However, spatial positioning within a glyph has been shown to be less effective when depicting many entities on a high resolution display [ 15 ]. We also considered alternative glyphs, such as perpendicular lines or lines with a dot; However, the domain experts judged the arrow glyph to be more effective.

Orthology is primarily encoded through color. Instead of showing orthology by drawing lines between orthologous coding sequences, which limits information density and increases visual clutter, we apply a color across all orthologs through a set of selection and brushing and linking interactions. Given the large number of genes on display, we could not apply unique colors to all genes. Instead, the user selections drive the application of color across orthologs. There are several options for the application of color. Brushing and linking highlights in yellow the orthologs on screen. Users can then brush to apply a persistent color of their choice to an ortholog cluster, effectively 'tagging' the orthologs. Alternatively, the user can use an ortholog-cluster neighborhood targeting function , which groups contigs containing a selected ortholog cluster, aligns the entire contig to the cluster and applies a color gradient to the orthologs in the neighborhood.

We did not show by default textual identification information for coding sequences, since this seemed to, at the same time: 1) limit the number of genomes which we could present, 2) increase visual clutter and 3) not enhance the performance of tasks described previously. Instead, we made this information available on demand. By default, gene identities or other textual data that consume valuable screen space are not displayed, but available on-demand. Contig and bacterial strain labels are available by default on the left side of the display, connected to the contig by a line. These high-density design decisions are illustrated in Figure 5 .

figure 5

This diagram illustrates an iterative adaption of an existing 'low-density' design, to the high-density encoding adopted by BactoGeNIE . At the top, we showed existing 'low-density' orthology-line and text-label encodings. The first transformation, shown by the arrow, reduces the number of pixels for each genome and the gap between genomes, which results in visual clutter. Visual clutter is reduced by replacing lines with color to encode orthology. Finally, by removing text labels BactoGeNIE produces a high-density encoding suitable for large-scale comparative tasks.

Encoding for perception over big displays

We designed the BactoGeNIE encodings and interactions so that they enable smooth transitions between exploration, browsing and locating tasks, in the context of comparing and identifying the features and outliers described in the earlier section.

Genome sorting and ortholog cluster alignment . Our design adopts layouts for contigs and gene neighborhoods that enable juxtaposition and direct visual comparison, limiting the need for visual search over large numbers of entities on large physical display spaces.

To this end, we adopted spatial clustering techniques, by implementing two grouping operations. First, we enable genome grouping , which moves contigs containing an ortholog cluster of interest to the top of the display. Second, we enable ortholog cluster alignment to line-up orthologs in distinct genomes against a target gene, producing a vertical stack of related neighborhoods. In addition, these contigs are oriented to the direction of transcription of the selected ortholog, to make inversion events more clear. Grouping and aligning also has the effect of using spatial positioning to encode meaningful information, capitalizing on embodied cognition.

Ortholog-cluster neighborhood targeting function . To further enable rapid and immediate comparison and identification of features and outliers described in the previous subsection across large collections of neighborhoods, we developed an ortholog-cluster neighborhood targeting function, the end result of which is depicted in prototype illustration Figure 6 , as well as in the application in Figure 1 . This function combines spatial clustering, though contig-grouping and alignment against a target gene or ortholog cluster in one contig, as well as coloring on a gradient.

figure 6

This prototype visual encoding shows the view following application of the ortholog-cluster neighborhood targeting function to a neighborhood . Each horizontal line represents a portion of a neighborhood around orthologs, in 3 bacterial strains. Orthologs have the same color. Features and outliers of interest are highlighted in the diagram.

Given a target--an ortholog cluster or gene--ortholog-cluster neighborhood targeting function groups the contigs in order to bring together spatially the neighborhoods that contain the target cluster. We position the target cluster in the center of the display. We then orient the cluster to show the same direction of transcription for the cluster across contigs. Finally, we apply a directional color gradient to the upstream and downstream side of the target cluster. Through the gradient coloring, 30 adjacent genes on that contig are given a color on a gradient, ranging from yellow, for genes closest to the target gene, to either green or blue, for genes farthest from the target gene. Blue genes lie in the direction of transcription, and green genes in the direction opposite to transcription for up to 15 genes on either side of the target gene. The gradient highlights in this manner the order and distance from a selected ortholog cluster target.

From Figure 6 , we can see that the end result highlights particular features and variants across sets of neighborhoods, suggesting that this design will make it easier for researchers to identify such variants across large collections of neighborhoods.

Details on demand: brushing and linking . Hovering over a gene opens a menu which shows detailed information about that gene and provides options to target that gene for further analysis, including applying a color to the gene and its orthologs, and as well as sorting and aligning contigs and targeting the gene's ortholog cluster. The ability to apply selected colors to an ortholog cluster of interest effectively allows users to tag genes for comparison and identification of features and outliers within the related neighborhoods.

Users are able to navigate through the scene by clicking-and-dragging with a mouse. Vertical movements brings different subsets of contigs onto the display. Lateral mouse movements, shift these contigs right and left, showing different subsets of genes. This type of interaction is important because even with a scalable design not all genes and genomes in a given dataset will be visible. Users can vary the contig height, to control the density of data display. In addition, we provide an 'accordion' action, where the pixel height of a contig is expanded in response to a hover event. The accordion action was requested by users, in order to better enable interaction on the contigs of interest in a high-density layout.

Implementation

In this project, we used both public and proprietary data. The public genome sequence data was obtained from the PubMed database of bacterial genomic data in draft, or unfinished states. BactoGeNIE was implemented in C++ using the Qt API for graphics and user interface elements. BactoGeNIE requires input of at least two file types, genome feature files, which contain positions of genes and lengths of contigs, and fasta sequence files, which contain raw sequences. The cd-hit algorithm is widely used for comparing protein or nucleotide sequences because it is fast and able to handle very large datasets [ 20 ]. After processing, data is stored in a MySQL database, with multiple threads handling data upload and data entry into the database. BactoGeNIE runs both on traditional display environments and tiled-display walls driven by a single machine. The displays used by our biological collaborators were not touch-enabled, and so we designed our system to accept input from a mouse placed close to the display. The researchers used the mouse to interact with the display, and stepped up to the display to discuss points of interest with collaborators. Our design is, however, largely compatible with touch; the only action that would require remapping to touch gestures is the mouse hover event.

Evaluation and results

Our evaluation approach includes a quantitative analysis of pure scalability in terms of pixels, a qualitative analysis with an in-depth case study, and a qualitative analysis with a group of domain experts.

Visual scalability evaluation

While the gene neighborhood approaches discussed in the related work section do not explicitly accommodate the upload and visualization of more than a few genomes in one view, we performed a synthetic analysis to measure the number of genomes that each approach could present at various display resolutions. The number of genomes for each tool at varied display resolutions in Figure 7 was computed by estimating the number of vertical pixels used encode two neighborhoods and the orthology between genes in those neighborhoods; we then divided the vertical screen resolution for different displays by this estimate. The result is a quantitative analysis of pure scalability in terms of pixels, given the estimated density of an encoding, and provides a rationale for why many of the approaches in previous work do not scale well to large displays, when compared to BactoGeNIE.

figure 7

Large Display Scalability . Estimated number of contigs that can fit large displays of varied resolutions for related tools. BactoGeNIE is capable of displaying more gene neighborhoods simultaneously than other approaches, and will scale more effectively to large displays.

Case Study: 673 strain E.coli analysis

While BactoGeNIE has been adopted as a research tool by the domain experts [ 21 ], and has been and is currently employed to investigate genomic data, these data are sensitive and proprietary. The case study reported here reflects a similar usage scenario based on public data; the scenario has been developed in collaboration with the biologists.

In this case study, a biology researcher and co-author of this paper was interested in understanding the function of Escherichia coli ( E.coli ) hypothetical proteins. E.coli is a common bacteria, which is sometimes pathogenic, and which can also be used in drug design. Hypothetical proteins of E.coli have the characteristics of a gene; however, it is not known whether the hypothetical protein is expressed and translated into a protein, or whether this protein product performs a function in the bacteria. Hypothetical proteins are expensive to study experimentally. Early research in the field [ 2 ] has indicated, however, that genomic analysis may allow experts to generate hypotheses about the function and importance of this protein, based on observations about its neighbors and the degree of conservation in its neighborhood.

The biology researcher performed her analysis on a large, high-resolution display (21.9 by 6.6 feet and 6144 by 2304 pixels) running BactoGeNIE. She loaded 'draft' or unfinished genomic data from 673 E.coli strains from the PubMed database. After examining their visual encoding, the biology researcher performed a set of exploring, browsing and locating tasks. The biology researcher first explored genes in this dataset using brushing and linking. Through these operations she was able to locate a coding sequence whose product was a hypothetical protein, and she selected this sequence for further analysis. Next, the biology researcher used the ortholog-cluster neighborhood targeting function to further locate all genes which were orthologous to the selected hypothetical protein, across the entire collection of E.coli complete genome sequences. Examining the result of the filtering operation, the researcher noted that this hypothetical protein had more than 100 potential orthologs. She then browsed through those coding sequences and examined their annotation labels, which presented basic gene information such as a protein name. She noted that this coding sequence was designated as a hypothetical protein within all strains of E.coli in the dataset.

The biology researcher then browsed the neighborhoods around these genes, to characterize common features within the neighborhood. In particular, she examined conservation--trying to determine whether the neighborhood around the hypothetical protein was resilient to changes across the genome collection. She identified multiple insertions, deletions and truncations , as well as inversions , as shown in Figure 8 . She further browsed these variations, applying colors outside the gradient to the variations of interest, and noted outliers as well as the strains in which these outliers arose. In addition, she identified a set of common breaks in assembly as well as gaps , which point to potential errors in assembly and annotation in the raw data, as described in the Task Analysis section. Several of the neighbors to the hypothetical protein were used in a subsequent round of ortholog-cluster targeting, to study their neighborhood.

figure 8

673 Strain E.coli Analysis: Applying the ortholog-cluster neighborhood targeting function to close to 700 strains of E.coli produced a view that enabled the identification of features and outliers within the neighborhood of a hypothetical protein, including insertions, inversions and other variations . In addition, breaks in assembly and gaps between genes indicate potential errors in data generation.

The domain experts working in a similar manner on proprietary data developed at this stage hypotheses regarding the function of the examined proteins, based on the annotations associated with common gene neighborhoods and variant features. These hypotheses typically focused on one of two areas: 1) proposing potential processes or pathways in which the target gene may participate or 2) explaining the genetic basis of phenotypic variations within strains with unusual characteristics.

The biology researcher explained that, in the absence of BactoGeNIE, the E.coli analysis would have required complex data mining and time-consuming analysis software. The output of these algorithms could be long and difficult to decipher. In addition, the approach would require additional analysis to allow the researcher to identify truncation or inversion events, or relatively uncommon variants. While existing visual tools could help with analysis tasks performed within a single neighborhood in a few genomes, comparative genomics research takes place across large collections of neighborhoods, where it benefits from an increase in display size. Visual encodings and interactions alternative to BactoGeNIE did not scale well and were not attempted. While both small and large displays have been used on these data, the domain experts stated that they found small displays useful only for analyzing smaller, pre-filtered subsets of the data. Overall, in the absence of BactoGeNIE and its high information density encodings, it would have been difficult for genomics researchers to explore or browse this type of large scale, comparative data.

Domain expert feedback

The domain experts have been given access to both lightweight and fully developed prototypes at regular intervals during the design and development of BactoGeNIE. The eight experts were first shown demonstrations of application usage, and then given the opportunity to use each functional prototype for real analysis scenarios, in the context of their research. We used a "think-aloud" technique when observing the use of the application by the genomics researchers, and also allowed the researchers to use BactoGeNIE independently. Each evaluation session was followed by interviews and discussions to gather feedback. We also observed the use of BactoGeNIE in the context of group meetings, where we noted collaborative hypothesis generation and discussion of observed features and outliers. At each session we collected informal feedback, which was used to drive subsequent development.

This system has been adopted by the domain experts and is now used regularly by the team for analysis of large-scale bacterial genome sequences. Experts noted that BactoGeNIE fits effectively their hypothesis generation workflow, and that it helps them to identify candidates for further computational and automated analysis. The grouping and alignment functions were described as highly useful for enabling comparisons of neighborhoods of interest, especially with large data volumes. The domain experts have further indicated that the visual encodings we used to indicate orthology, both through brushing and linking and the ortholog-cluster neighborhood targeting function, were effective and allowed for rapid identification of variants across large collections of neighborhoods.

Several genomics researchers with years of experience in comparative genomics noted that our approach was unique and that none of the tools they had previously encountered enabled large-scale comparative analysis of gene neighborhoods. One genomics researcher commented that the absence of large-scale comparative visualizations in genomics was a significant technological gap, and that BactoGeNIE helped fill that gap. Other researchers noted that they wished to see this approach extended to large-scale comparative genomics problems for non-bacterial species, suggesting extensibility of the design.

Discussion and conclusion

The contributions of this work follow the Nested Design Model [ 22 ], and span visual encodings, a domain characterization and abstraction into visualization terminology, a description of the design space we explored, an evaluation with domain experts, and a discussion of the merits of our approach.

The domain expert feedback shows that BactoGeNIE implements effectively a design for comparing neighborhoods around ortholog clusters of interest in large collections of bacterial genomes. Further evaluation shows that the environment supports the identification of biologically significant variations and hypothesis generation around gene function and bacterial strain evolution. Overall, the tool allowed the users to incorporate expertise in their data analysis: for instance, users were able to test whether particular strains possess common variants.

As evidenced by the case study, BactoGeNIE accommodates analysis in significantly larger volumes of data, for instance close to 700 strains of E.coli , as compared to 9 strains in an application like Mauve. The case study and user feedback suggest that the design is also more visually scalable than other approaches which generally rely on drawing lines between orthologs creating a significant potential for visual clutter.

Our visual encodings and interactions appear to be particularly effective in supporting exploration and browsing for unexpected features and for outliers. For example, gene truncations do not factor in automated approaches which mine for common subsequences, and inversions may be lost unless the mining algorithm takes DNA strands into account. In contrast, truncation and inversion features are easily identifiable in our visualization tool. Furthermore, interactive brushing and linking enabled the domain experts to perform queries on large collections of data.

BactoGeNIE is particularly well suited to large and high resolution environments. Nevertheless, the domain experts also provided feedback which suggested effective use of this tool on personal workspaces and smaller displays, albeit displaying fewer visible neighborhoods at once. Essentially, our design appears to 'scale-down' to smaller displays more effectively than existing designs 'scale-up', suggesting that developing for such environments can also benefit users without access to large display technology. The benefits of embodied cognition in large display environments motivated the design of our alignment, grouping and ortholog cluster targeting encodings. However, embodied cognition was not explicitly investigated in the case study or reflected directly in the user feedback, and constitutes a direction of future work.

In terms of limitations, BactoGeNIE does not take into account the case where coding sequences lie on different strands within the same location. This situation occurs rarely in the datasets evaluated by our collaborators, but would need to be considered in future work. Second, horizontal scalability remains an issue despite the large size of the display. For instance, BactoGeNIE does not provide specific accommodation for contigs with more than one copy of a targeted ortholog cluster. There are circumstances where such genes will be offscreen and not noted by researchers. Finally, the most frequent request from users was to see additional grouping options for contigs following sort and the ortholog-cluster neighborhood targeting function, for instance ones that would group common variants in content, order and context, or apply an ordering that reflects evolutionary distance from phylogenetic trees.

As described in the Nested Design Model [ 22 ], at the visual encoding and interaction design level, "the threat is that the chosen design is not effective at communicating the desired abstraction to the person using the system". While our case study does not take the form of a formal user study to validate the visual encoding--partly because of the impracticality of achieving statistically significant results on such a small user group--we nevertheless followed the Nested Model guidelines in validating our encoding approach with respect to known perceptual and cognitive principles, and we further discussed, in context, our encoding choices (see Methods section).

In conclusion, in this work we introduced BactoGeNIE, a novel visualization design and application that enables comparisons across large collections of gene neighborhoods from complete bacterial genome sequences. BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type will be increasingly required in comparative genomics research, and we believe the design decisions and guiding principles enumerated here may inform such future work.

Wetterstrand KA: DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program. Accessed May 7, 2014, [ http://www.genome.gov/sequencingcosts ]

Overbeek R, Fonstein M, D'souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proceedings of the National Academy of Sciences. 1999, 96 (6): 2896-2901. 10.1073/pnas.96.6.2896.

Article   CAS   Google Scholar  

Nielsen CB, Cantor M, Dubchak I, Gordon D, Wang T: Visualizing genomes: techniques and challenges. Nature methods. 2010, 7 (3 Suppl): S5-S15.

Article   CAS   PubMed   Google Scholar  

McKay S: Using the generic synteny browser. Plant and Animal Genome XX Conference (January 14-18, 2012). 2012, Plant and Animal Genome

Google Scholar  

Wang H, Su Y, Mackey AJ, Kraemer ET, Kissinger JC: Synview: a gbrowse-compatible approach to visualizing comparative genome data. Bioinformatics. 2006, 22 (18): 2308-2309. 10.1093/bioinformatics/btl389.

Pan X, Stein L, Brendel V: Synbrowse: a synteny browser for comparative sequence analysis. Bioinformatics. 2005, 21 (17): 3461-3468. 10.1093/bioinformatics/bti555.

Meyer M, Munzner T, Pfister H: Mizbee: a multiscale synteny browser. Visualization and Computer Graphics, IEEE Transactions. 2009, 15 (6): 897-904.

Article   Google Scholar  

Darling AC, Mau B, Blattner FR, Perna NT: Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome research. 2004, 14 (7): 1394-1403. 10.1101/gr.2289704.

Article   PubMed Central   CAS   PubMed   Google Scholar  

Price A, Kosara R, Gibas C: Gene-rivit: A visualization tool for comparative analysis of gene neighborhoods in prokaryotes. Biological Data Visualization (BioVis), 2012 IEEE Symposium. 2012, IEEE, 57-62.

Chapter   Google Scholar  

Fong C, Rohmer L, Radey M, Wasnick M, Brittnacher MJ: Psat: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes. BMC bioinformatics. 2008, 9 (1): 170-10.1186/1471-2105-9-170.

Article   PubMed Central   PubMed   Google Scholar  

Albers D, Dewey C, Gleicher M: Sequence surveyor: Leveraging overview for scalable genomic alignment visualization. Visualization and Computer Graphics, IEEE Transactions. 2011, 17 (12): 2392-2401.

Yost B, Haciahmetoglu Y, North C: Beyond visual acuity: the perceptual scalability of information visualizations for large displays. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2007, ACM, 101-110.

Reda K, Johnson AE, Papka ME, Leigh J: Effects of display size and resolution on user behavior and insight acquisition in visual exploration. Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 2015, ACM, 2759-2768.

Andrews C, Endert A, Yost B, North C: Information visualization on large, high-resolution displays: Issues, challenges, and opportunities. Information Visualization. 2011, 10 (4): 341-355. 10.1177/1473871611415997.

Andrews C, Endert A, North C: Space to think: large high-resolution displays for sensemaking. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2010, ACM, 55-64.

Isenberg P, Dragicevic P, Willett W, Bezerianos A, Fekete JD: Hybrid-image visualization for large viewing environments. IEEE Transactions on Visualization & Computer Graphics. 2013, 2346-2355. 12

Endert A, Andrews C, Lee YH, North C: Visual encodings that support physical navigation on large displays. Proceedings of Graphics Interface 2011. 2011, Canadian Human-Computer Communications Society, 103-110.

Reda K, Febretti A, Knoll A, Aurisano J, Leigh J, Johnson AE, Papka ME, Hereld M: Visualizing large, heterogeneous data in hybrid-reality environments. IEEE Computer Graphics and Applications. 2013, 33 (4): 38-48. 10.1109/MCG.2013.37.

Article   PubMed   Google Scholar  

Ruddle RA, Fateen W, Treanor D, Sondergeld P, Ouirke P: Leveraging wall-sized high-resolution displays for comparative genomics analyses of copy number variation. Biological Data Visualization (BioVis). 2013, 2013 IEEE Symposium, 89-96. 10.1109/BioVis.2013.6664351. IEEE

Huang Y, Niu B, Gao Y, Fu L, Li W: Cd-hit suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010, 26 (5): 680-682. 10.1093/bioinformatics/btq003.

Monsanto. Accessed: 2015-04-20, [ http://www.monsanto.com ]

Munzner T: A nested model for visualization design and validation. Visualization and Computer Graphics, IEEE Transactions. 2009, 15 (6): 921-928.

Download references

Declarations

This work and publication has been supported by grants NSF CAREER IIS-1541277, CNS-0959053 (CAVE2) and OCI-0943559 (SAGE). Special thanks to the Computational Biology team at Monsanto for providing the motivation and drive for this project, as well as for their help in evaluating BactoGeNIE. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the funding agencies and companies.

This article has been published as part of BMC Bioinformatics Volume 16 Supplement 11, 2015: Proceedings of the 5th Symposium on Biological Data Visualization: Part 1. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S11

Author information

Authors and affiliations.

Electronic Visualization Laboratory, University of Illinois at Chicago, 60607, Chicago, IL, USA

Jillian Aurisano, Andrew Johnson & Elisabeta G Marai

Argonne National Laboratory, 60439, Lemont, IL, USA

Khairi Reda

University of Hawai'i at Manoa, 96822, Honolulu, HI, USA

Khairi Reda & Jason Leigh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jillian Aurisano .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors' contributions

JA conceived and directed the design, implementation, and evaluation of BactoGeNIE. KR assisted with design and research on perception and big display visualization design. AJ and JL provided direction for capturing user requirements, designing for big display environments and evaluating visual scalability. GEM helped characterize the domain in terms of data and tasks, articulate the design decisions, structure the case study and the discussion of results. JA and GEM contributed to and approved the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Aurisano, J., Reda, K., Johnson, A. et al. BactoGeNIE: a large-scale comparative genome visualization for big displays. BMC Bioinformatics 16 (Suppl 11), S6 (2015). https://doi.org/10.1186/1471-2105-16-S11-S6

Download citation

Published : 13 August 2015

DOI : https://doi.org/10.1186/1471-2105-16-S11-S6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Comparative Genomics
  • Large Displays
  • Visualization

BMC Bioinformatics

ISSN: 1471-2105

comparative genomics thesis

Mitochondrial comparative genomics and phylogenetic signal assessment of mtDNA among arbuscular mycorrhizal fungi

Affiliations.

  • 1 Institut de Recherche en Biologie Végétale, Département de Sciences Biologiques, Université de Montréal, 4101 Rue Sherbrooke Est, Montréal (Québec) H1X 2B2, Canada.
  • 2 Institut de Recherche en Biologie Végétale, Département de Sciences Biologiques, Université de Montréal, 4101 Rue Sherbrooke Est, Montréal (Québec) H1X 2B2, Canada. Electronic address: [email protected].
  • PMID: 26868331
  • DOI: 10.1016/j.ympev.2016.01.009

Mitochondrial (mt) genes, such as cytochrome C oxidase genes (cox), have been widely used for barcoding in many groups of organisms, although this approach has been less powerful in the fungal kingdom due to the rapid evolution of their mt genomes. The use of mt genes in phylogenetic studies of Dikarya has been met with success, while early diverging fungal lineages remain less studied, particularly the arbuscular mycorrhizal fungi (AMF). Advances in next-generation sequencing have substantially increased the number of publically available mtDNA sequences for the Glomeromycota. As a result, comparison of mtDNA across key AMF taxa can now be applied to assess the phylogenetic signal of individual mt coding genes, as well as concatenated subsets of coding genes. Here we show comparative analyses of publically available mt genomes of Glomeromycota, augmented with two mtDNA genomes that were newly sequenced for this study (Rhizophagus irregularis DAOM240159 and Glomus aggregatum DAOM240163), resulting in 16 complete mtDNA datasets. R. irregularis isolate DAOM240159 and G. aggregatum isolate DAOM240163 showed mt genomes measuring 72,293bp and 69,505bp with G+C contents of 37.1% and 37.3%, respectively. We assessed the phylogenies inferred from single mt genes and complete sets of coding genes, which are referred to as "supergenes" (16 concatenated coding genes), using Shimodaira-Hasegawa tests, in order to identify genes that best described AMF phylogeny. We found that rnl, nad5, cox1, and nad2 genes, as well as concatenated subset of these genes, provided phylogenies that were similar to the supergene set. This mitochondrial genomic analysis was also combined with principal coordinate and partitioning analyses, which helped to unravel certain evolutionary relationships in the Rhizophagus genus and for G. aggregatum within the Glomeromycota. We showed evidence to support the position of G. aggregatum within the R. irregularis 'species complex'.

Keywords: Arbuscular mycorrhizal fungi; Comparative mitochondrial genomics; Fungi; Genome evolution; Phylogenetic analysis.

Copyright © 2016 Elsevier Inc. All rights reserved.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't
  • DNA, Mitochondrial / genetics*
  • Evolution, Molecular
  • Genes, Mitochondrial / genetics
  • Genome, Mitochondrial / genetics*
  • Glomeromycota / classification
  • Glomeromycota / genetics*
  • High-Throughput Nucleotide Sequencing
  • Mitochondria / genetics*
  • Mycorrhizae / classification
  • Mycorrhizae / genetics*
  • DNA, Mitochondrial
  • Search Menu
  • Advance Access
  • Collections
  • Author Guidelines
  • Submission Site
  • Open Access Policy
  • Self-Archiving Policy
  • Why Submit?
  • About Horticulture Research
  • About Nanjing Agricultural University
  • Editorial Board
  • Advertising & Corporate Services
  • Journals on Oxford Academic
  • Books on Oxford Academic

Nanjing Agricultural University

Article Contents

Comparative population genomics reveals convergent and divergent selection in the apricot-peach-plum-mei complex.

Xuanwen Yang, Ying Su and Siyang Huang, These authors contributed equally to the work.

  • Article contents
  • Figures & tables
  • Supplementary Data

Xuanwen Yang, Ying Su, Siyang Huang, Qiandong Hou, Pengcheng Wei, Yani Hao, Jiaqi Huang, Hua Xiao, Zhiyao Ma, Xiaodong Xu, Xu Wang, Shuo Cao, Xuejing Cao, Mengyan Zhang, Xiaopeng Wen, Yuhua Ma, Yanling Peng, Yongfeng Zhou, Ke Cao, Guang Qiao, Comparative population genomics reveals convergent and divergent selection in the Apricot-Peach-Plum-Mei Complex, Horticulture Research , 2024;, uhae109, https://doi.org/10.1093/hr/uhae109

  • Permissions Icon Permissions

The economically significant genus Prunus includes fruit and nut crops that have been domesticated for shared and specific agronomic traits, however, the genomic signals of convergent and divergent selection have not been elucidated. In this study, we aim to detect genomic signatures of convergent and divergent selection by conducting comparative population genomic analyses of the Apricot-Peach-Plum-Mei (APPM) complex, utilizing a haplotype-resolved telomere-to-telomere (T2T) genome assembly and population resequencing data. The haplotype-resolved T2T reference genome for the plum cultivar was assembled through HiFi and Hi-C reads, resulting in two haplotypes with 251.25 Mb and 251.29 Mb in size, respectively. Comparative genomics reveals a chromosomal translocation of approximately 1.17 Mb in the apricot genomes compared to peach, plum, and mei. Notably, the translocation involves the D locus, significantly impacting acidity (TA), pH, and sugar content. Population genetic analysis detected substantial gene flow between plum and apricot, with introgression regions enriched in post-embryonic development and pollen germination processes. Comparative population genetic analyses revealed convergent selection for stresses, flower development, and fruit ripening, along with divergent selection shaping crop-specific genes, such as somatic embryogenesis in plum, pollen germination in mei, and hormone regulation in peach. Notably, selective sweeps on chromosome 7 coincide with a chromosomal co-linearity from the comparative genomics, impacting key fruit-softening genes such as PG , regulated by ERF and RMA1H1 . Overall, this study provides insights into the genetic diversity, evolutionary history, and domestication of the APPM complex, offering valuable implications for genetic studies and breeding programs of Prunus crops.

Author notes

Supplementary data, email alerts, citing articles via.

  • International Horticulture Research Conference
  • Advertising & Corporate Services

Affiliations

  • Online ISSN 2052-7276
  • Print ISSN 2662-6810
  • Copyright © 2024 Nanjing Agricultural University
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

  • Bibliography
  • More Referencing guides Blog Automated transliteration Relevant bibliographies by topics
  • Automated transliteration
  • Relevant bibliographies by topics
  • Referencing guides

Dissertations / Theses on the topic 'Comparative genomics'

Create a spot-on reference in apa, mla, chicago, harvard, and other styles.

Consult the top 50 dissertations / theses for your research on the topic 'Comparative genomics.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

Loman, Nicholas James. "Comparative bacterial genomics." Thesis, University of Birmingham, 2012. http://etheses.bham.ac.uk//id/eprint/2839/.

Axelsson, Erik. "Comparative Genomics in Birds." Doctoral thesis, Uppsala : Acta Universitatis Upsaliensis, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-7432.

Eriksen, Niklas. "Combinatorial methods in comparative genomics." Doctoral thesis, KTH, Mathematics, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-3508.

Manee, Manee. "Comparative genomics of noncoding DNA." Thesis, University of Manchester, 2016. https://www.research.manchester.ac.uk/portal/en/theses/comparative-genomics-of-noncoding-dna(d16aa46c-b8a2-4e6c-b825-d4246d3775fa).html.

Mikkelsen, Tarjei Sigurd 1978. "Mammalian comparative genomics and epigenomics." Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/52808.

Ryder, Carol D. "Comparative genomics of Brassica oleracea." Thesis, University of Warwick, 2012. http://wrap.warwick.ac.uk/51651/.

Dong, Xin. "Comparative genomics of rickettsia species." Thesis, Aix-Marseille, 2012. http://www.theses.fr/2012AIXM5054/document.

Sentausa, Erwin. "Intraspecies comparative genomics of Rickettsia." Thesis, Aix-Marseille, 2013. http://www.theses.fr/2013AIXM5082/document.

Benevides, Leandro. "Comparative Genomics of Faecalibacterium spp." Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLS129.

St, Jean Andrew Louis. "Haloarchaeal comparative genomics and the local context model of genomic evolution." Thesis, University of Ottawa (Canada), 1996. http://hdl.handle.net/10393/10308.

Prakash, Amol. "Algorithms for comparative sequence analysis and comparative proteomics /." Thesis, Connect to this title online; UW restricted, 2006. http://hdl.handle.net/1773/6904.

Gaiarsa, S. "EVOLUTION, COMPARATIVE GENOMICS AND GENOMIC EPIDEMIOLOGY OF BACTERIA OF PUBLIC HEALTH IMPORTANCE." Doctoral thesis, Università degli Studi di Milano, 2017. http://hdl.handle.net/2434/525881.

Dessimoz, Christophe. "Comparative genomics using pairwise evolutionary distances /." Zürich : ETH, 2009. http://e-collection.ethbib.ethz.ch/show?type=diss&nr=18177.

Ulrich, Luke. "Comparative Genomics of Microbial Signal Transduction." Diss., Georgia Institute of Technology, 2005. http://hdl.handle.net/1853/7632.

Golenetskaya, Natalia. "Adressing scaling challenges in comparative genomics." Phd thesis, Université Sciences et Technologies - Bordeaux I, 2013. http://tel.archives-ouvertes.fr/tel-00865840.

Buchan, Daniel William Alexander. "Protein domain evolution by comparative genomics." Thesis, University College London (University of London), 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.407753.

Kamvysselis, Manolis 1977. "Computational comparative genomics : genes, regulation, evolution." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/7999.

Coates-Brown, Rosanna. "Comparative genomics of the skin staphylococci." Thesis, University of Liverpool, 2015. http://livrepository.liverpool.ac.uk/2046659/.

Moore, Matthew Phillip. "Comparative genomics of Pseudomonas aeruginosa populations." Thesis, University of Liverpool, 2017. http://livrepository.liverpool.ac.uk/3021281/.

Mularoni, Loris. "Comparative genomics of amino acid tandem repeats." Doctoral thesis, Universitat Pompeu Fabra, 2009. http://hdl.handle.net/10803/7187.

Wuichet, Kristin. "Comparative Genomics of the Microbial Chemotaxis System." Diss., Georgia Institute of Technology, 2007. http://hdl.handle.net/1853/16193.

Fuxelius, Hans-Henrik. "Methods and Applications in Comparative Bacterial Genomics." Doctoral thesis, Uppsala universitet, Molekylär evolution, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-8398.

Garikipati, Dilip Kumar. "The comparative genomics and physiology of myostatin." Online access for everyone, 2007. http://www.dissertations.wsu.edu/Dissertations/Summer2007/D_Garikipati_070807.pdf.

Mostowy, Serge. "Comparative genomics of the Mycobacterium tuberculosis complex." Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=111834.

Li, Yang. "Understanding lineage-specific biology through comparative genomics." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:23398cc7-8bbe-4f5a-8cd9-1104591400cc.

Park, Gyoungju Nah. "Comparative Genomics in Two Dicot Model Systems." Diss., The University of Arizona, 2008. http://hdl.handle.net/10150/194279.

Syme, Robert Andrew. "Comparative Genomics of Parastagonospora and Pyrenophora species." Thesis, Curtin University, 2015. http://hdl.handle.net/20.500.11937/54044.

Dang, Ha Xuan. "Mold Allergomics: Comparative and Machine Learning Approaches." Diss., Virginia Tech, 2014. http://hdl.handle.net/10919/64205.

Seibert, Sara Rose. "Host-parasite interactions: comparative analyses of population genomics, disease-associated genomic regions, and host use." Wright State University / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=wright1590585260282244.

Kang, Lin. "Comparative Genomics Insights into Speciation and Evolution of Hawaiian Drosophila." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/85467.

Epamino, George Willian Condomitti. "Alinhamento múltiplo de genomas de eucariotos com montagens altamente fragmentadas." Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/95/95131/tde-31102017-102826/.

Steward, Karen Frances. "Comparative genomics of Streptococcus equi and Streptococcus zooepidemicus." Thesis, University of Cambridge, 2015. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.708551.

Migeon, Pierre. "Comparative genomics of repetitive elements between maize inbred lines B73 and Mo17." Thesis, Kansas State University, 2017. http://hdl.handle.net/2097/35377.

Åkerborg, Örjan. "Taking advantage of phylogenetic trees in comparative genomics." Doctoral thesis, KTH, Beräkningsbiologi, CB, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4757.

Åkerborg, Örjan. "Taking advantage of phylogenetic trees in comparative genomics /." Stockholm : School of Computer Science and Communication, Kungliga tekniska högskolan, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4757.

Abbas, Ali Hadi. "Comparative structural genomics and phylogenomics of African trypanosomes." Thesis, University of Liverpool, 2018. http://livrepository.liverpool.ac.uk/3022845/.

Xia, Ai. "Comparative genomics of chromosomal rearrangements in malaria mosquitoes." Diss., Virginia Tech, 2010. http://hdl.handle.net/10919/37335.

Lyman, Cole Andrew. "Comparative Genomics Using the Colored de Bruijn Graph." BYU ScholarsArchive, 2020. https://scholarsarchive.byu.edu/etd/8441.

Pryszcz, Leszek Piotr 1985. "Comparative genomics to unravel virulence mechanisms in fungal human pathogens." Doctoral thesis, Universitat Pompeu Fabra, 2014. http://hdl.handle.net/10803/301437.

Perrin, Amandine. "Tools for massive bacterial comparative genomics : Development and Applications." Thesis, Sorbonne université, 2022. https://tel.archives-ouvertes.fr/tel-03789655.

Bate, Rachael. "Mapping and gene identification within the Ids to Dmd region of the mouse X chromosome." Thesis, Oxford Brookes University, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.247810.

Karanam, Suresh Kumar. "Automation of comparative genomic promoter analysis of DNA microarray datasets." Thesis, Available online, Georgia Institute of Technology, 2004:, 2003. http://etd.gatech.edu/theses/available/etd-04062004-164658/unrestricted/karanam%5Fsuresh%5Fk%5F200312%5Fms.pdf.

Cheung, Hiu Tung (Tom). "Understanding mammalian transcriptional regulation using comparative and functional genomics." Diss., Connect to online resource, 2006. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3207751.

Saski, Christopher A. "Chloroplast comparative genomics implications for phylogeny, evolution and biotechnology /." Connect to this title online, 2007. http://etd.lib.clemson.edu/documents/1193080368/.

Wagner, Darlene Darlington. "Comparative genomics reveal ecophysiological adaptations of organohalide-respiring bacteria." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/45916.

Ellegaard, Kirsten Maren, Lisa Klasson, Kristina Näslund, Kostas Bourtzis, and Siv G. E. Andersson. "Comparative Genomics of Wolbachia and the Bacterial Species Concept." Uppsala universitet, Molekylär evolution, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-200821.

Nakjang, Sirintra. "Comparative genomics for studying the proteomes of mucosal microorganisms." Thesis, University of Newcastle Upon Tyne, 2011. http://hdl.handle.net/10443/1265.

Muller, Carolin Anne. "Comparative genomics of chromosome replication in sensu stricto yeasts." Thesis, University of Nottingham, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.603592.

Godinez, Ricardo. "Comparative Genomics of the Major Histocompatibility Complex in Amniotes." Thesis, Harvard University, 2012. http://dissertations.umi.com/gsas.harvard:10685.

Edwards, Martin Tavis. "Comparative prokaryotic genomics : conservation of functional and spatial context." Thesis, Birkbeck (University of London), 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.428024.

COMMENTS

  1. PDF COMPUTATIONAL COMPARATIVE GENOMICS: GENES, REGULATION, EVOLUTION by

    to coordinate the expression of these genes. Comparative genome analysis of related species provides a general approach for identifying these functional elements, by virtue of their stronger conservation across evolutionary time. In this thesis we address key issues in the comparative analysis of multiple species.

  2. PDF Comparative Genomics Identifies Conserved Factors that Regulate

    A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy ... Figure 2.1: Comparative genomics identifies candidate genes as putative stemness regulators. Figure 2.2: Functional analysis of candidate stemness regulators in planarians.

  3. A comparative genomics multitool for scientific discovery and

    Designing a comparative-genomics multitool. When selecting species, we sought to maximize evolutionary branch length, to include at least one species from each eutherian family, and to prioritize ...

  4. The NIH Comparative Genomics Resource: addressing the promises and

    This review provides examples of significant biological phenomena informed by comparative genomics that impact human health (see Fig. 2), presents challenges to this rapidly developing field, and indicates how CGR can meet those challenges.The expanding connection to sequenced organisms is integral to researching many of the traits of interest to human health (e.g., vision, metabolism) that ...

  5. Comparative Genomics, from the Annotated Genome to Valuable Biological

    Comparative Genomics, from the Annotated Genome to Valuable Biological Information: A Case Study Authors: Sabina Zoledowska 1 , 2 , Agata Motyka-Pomagruk 2 , Agnieszka ... PhD thesis. Univeristy of Gdańsk Golanowska M, Potrykus M, Motyka-Pomagruk A et al ...

  6. Application of comparative genomics for detection of genomic features

    In this thesis, I discussed comparative genomics approaches to these questions. First, I described genome features of eight recently sequenced plant species, with standards for comparison provided by the well established model species for dicots (Arabidopsis) and monocots (rice). Secondly, I discussed software and statistical models for ...

  7. Comparative Genomics, from the Annotated Genome to Valuable ...

    PhD thesis. Univeristy of Gdańsk. Google Scholar Golanowska M, Potrykus M, Motyka-Pomagruk A et al (2018) Comparison of ... The sequencing and comparative genomics analyses were funded from National Science Centre in Poland via 2014/14/M/NZ8/00501 granted to EL. National Science Centre in Poland via grant 2016/21/N/NZ1/02783 is currently ...

  8. Comparative genomics

    Comparative genomics is the direct comparison of complete genetic material of one organism against that of another to gain a better understanding of how species evolved and to determine the function of genes and noncoding regions in genomes. It includes a comparison of gene number, gene content, and gene location, the length and number of ...

  9. A comparative genomics multitool for scientific discovery and

    Designing a comparative-genomics multitool. When selecting species, we sought to maximize evolutionary branch length, to include at least one species from each eutherian family, and to prioritize species of medical, biological or biodiversity conservation interest. Our assemblies increase the percentage of eutherian families with a ...

  10. Computational algorithms for comparative genomics

    To this end, computational comparative genomics is an essential task for studying the organization, topology and conservation of genes and strings of genes that lends to a better biological understanding of gene function and annotation. ... This thesis presents new algorithms for gene matching to identify gene relationships across genomes (or ...

  11. Comparative Genomics

    Abstract. Comparative genomics is a science in its infancy. It has been driven by a huge increase in freely available genome-sequence data, and the development of computer techniques to allow whole-genome sequence analyses. Other approaches, which use hybridization as a method for comparing the gene content of related organisms, are rising ...

  12. PDF Comparative genomics of muskmelon reveals a potential role for ...

    a combination of comparative genomics and comprehensive tran-scriptome analysis,we suggestthat retrotransposons played a role in the modification of gene expression as well as evolution of fruit-

  13. PDF Comparative Genomic Analysis of Catfish Linkage Group 27 With Teleost

    Therefore, the genomic research of catfish has been extensively studied by many scientific research methods including molecular genetics tools and comparative genomics (He et al. 2003, Serapion et al. 2004, Somridhivej et al. 2008, Xu et al. 2006, Kucuktas et al. 2009). Among these methods, comparative genomics is utilized as an

  14. PDF Advancing systems biology of yeast through machine learning ...

    To address these knowledge gaps, this thesis aims to leverage the large amounts of data available for yeast species and use state-of-the-art machine learning techniques and comparative genomic analysis to gain a deeper insight into yeast traits and metabolism. In this thesis, machine learning was applied to various unresolved biological problems on

  15. PDF Comparative Gene Identification in Mammalian, Fly, and Fungal Genomes

    In this thesis, I propose a methodology for systematically identifying protein-coding genes by comparative analysis of several related genomes. Although protein-coding genes are only one category of functional genomic elements, they have distinctive and well-studied properties that

  16. Everything at once: Comparative analysis of the genomes of bacterial

    1. Introduction. This review focuses on the comparative genomics of both human and animal bacterial pathogens. Comparative genomics has found a growing use in, (1) the definition of clusters of genetically and phenotypically related organisms that previously were not captured by taxonomic schemes based on a limited number of attributes such as biotype, serotype, phage type, etc., (2) molecular ...

  17. Comparative Genomics Guides Elucidation of Vitamin B

    Akkermansia muciniphila is a mucin-degrading bacterium found in the gut of most humans and is considered a "next-generation probiotic." However, knowledge of the genomic and physiological diversity of human-associated Akkermansia sp. strains is limited. Here, we reconstructed 35 metagenome-assembled genomes and combined them with 40 publicly available genomes for comparative genomic analysis.

  18. Comparative Genome Annotation

    Lin MF, Jungreis I, Kellis M (2011) PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27(13):i275-i282 Ulitsky I, Bartel DP (2013) lincRNAs: genomics, evolution, and mechanisms. Cell 154(1):26-46

  19. Frontiers

    A comparative genomics study of the microbiome and freshwater resistome in Southern Pantanal. ... A comparative analysis between the small lakes revealed that Abobral's small lake (Sites 1 and 2) had a greater diversity of unique ARGs (84) and ARGCs (21) than Aquidauana's small lake (Sites 3 and 4), which had 69 ARGs and 17 ARGCs. ...

  20. Comparative Genomics Fact Sheet

    Comparative Genomics Fact Sheet. Comparative genomics is a field of biological research in which researchers use a variety of tools to compare the complete genome sequences of different species. By carefully comparing characteristics that define various organisms, researchers can pinpoint regions of similarity and difference.

  21. BactoGeNIE: a large-scale comparative genome visualization for big

    The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches ...

  22. Comparative genomic analysis reveals a high prevalence of inter-species

    The genomic content and plasmid backbones of the class B metallo-β-lactamase are heterogeneous. HGT was very likely in the three patients with class B β-lactamases. The alignment of the IncN2 bla VIM-1 plasmids from all three isolates of P3 support in vivo HGT of these plasmids (Fig. 2D). Similarly, both isolates from P7 showed high ...

  23. Mitochondrial comparative genomics and phylogenetic signal ...

    This mitochondrial genomic analysis was also combined with principal coordinate and partitioning analyses, which helped to unravel certain evolutionary relationships in the Rhizophagus genus and for G. aggregatum within the Glomeromycota. We showed evidence to support the position of G. aggregatum within the R. irregularis 'species complex'.

  24. Comparative population genomics reveals convergent and divergent

    Comparative genomics reveals a chromosomal translocation of approximately 1.17 Mb in the apricot genomes compared to peach, plum, and mei. Notably, the translocation involves the D locus, significantly impacting acidity (TA), pH, and sugar content. Population genetic analysis detected substantial gene flow between plum and apricot, with ...

  25. Dissertations / Theses: 'Comparative genomics'

    Comparative genomic analysis carried out on T. brucei large chromosomes and the new PacBio de novo assemblies in this thesis T. congolense and T. vivax uncovered putative large structural chromosomal rearrangements between the African trypanosomes.