Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Related collections

Most cited references 5

Record: found
Abstract: found
Article: found

Is Open Access

Oligotyping: differentiating between closely related microbial taxa using 16S rRNA gene data

A. Eren, Lois Maignien, Woo Jun Sul … (2013)

Bacteria comprise the most diverse domain of life on Earth, where they occupy nearly every possible ecological niche and play key roles in biological and chemical processes. Studying the composition and ecology of bacterial ecosystems and understanding their function are of prime importance. High-throughput sequencing technologies enable nearly comprehensive descriptions of bacterial diversity through 16S ribosomal RNA gene amplicons. Analyses of these communities generally rely upon taxonomic assignments through reference data bases or clustering approaches using de facto sequence similarity thresholds to identify operational taxonomic units. However, these methods often fail to resolve ecologically meaningful differences between closely related organisms in complex microbial data sets. In this paper, we describe oligotyping, a novel supervised computational method that allows researchers to investigate the diversity of closely related but distinct bacterial organisms in final operational taxonomic units identified in environmental data sets through 16S ribosomal RNA gene data by the canonical approaches. Our analysis of two data sets from two different environments demonstrates the capacity of oligotyping at discriminating distinct microbial populations of ecological importance. Oligotyping can resolve the distribution of closely related organisms across environments and unveil previously overlooked ecological patterns for microbial communities. The URL http://oligotyping.org offers an open-source software pipeline for oligotyping.

0 comments Cited 260 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Advancing our understanding of the human microbiome using QIIME.

Jose A. Navas-Molina, Juan M Peralta-Sánchez, Antonio González … (2013)

High-throughput DNA sequencing technologies, coupled with advanced bioinformatics tools, have enabled rapid advances in microbial ecology and our understanding of the human microbiome. QIIME (Quantitative Insights Into Microbial Ecology) is an open-source bioinformatics software package designed for microbial community analysis based on DNA sequence data, which provides a single analysis framework for analysis of raw sequence data through publication-quality statistical analyses and interactive visualizations. In this chapter, we demonstrate the use of the QIIME pipeline to analyze microbial communities obtained from several sites on the bodies of transgenic and wild-type mice, as assessed using 16S rRNA gene sequences generated on the Illumina MiSeq platform. We present our recommended pipeline for performing microbial community analysis and provide guidelines for making critical choices in the process. We present examples of some of the types of analyses that are enabled by QIIME and discuss how other tools, such as phyloseq and R, can be applied to expand upon these analyses. © 2013 Elsevier Inc. All rights reserved.

0 comments Cited 244 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Meeting Report: The Terabase Metagenomics Workshop and the Vision of an Earth Microbiome Project

Jack A Gilbert, Folker Meyer, Dion Antonopoulos … (2011)

Between July 18th and 24th 2010, 26 leading microbial ecology, computation, bioinformatics and statistics researchers came together in Snowbird, Utah (USA) to discuss the challenge of how to best characterize the microbial world using next-generation sequencing technologies. The meeting was entitled “Terabase Metagenomics” and was sponsored by the Institute for Computing in Science (ICiS) summer 2010 workshop program. The aim of the workshop was to explore the fundamental questions relating to microbial ecology that could be addressed using advances in sequencing potential. Technological advances in next-generation sequencing platforms such as the Illumina HiSeq 2000 can generate in excess of 250 billion base pairs of genetic information in 8 days. Thus, the generation of a trillion base pairs of genetic information is becoming a routine matter. The main outcome from this meeting was the birth of a concept and practical approach to exploring microbial life on earth, the Earth Microbiome Project (EMP). Here we briefly describe the highlights of this meeting and provide an overview of the EMP concept and how it can be applied to exploration of the microbiome of each ecosystem on this planet.

0 comments Cited 118 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

J. Gregory Caporaso

Journal

Journal ID (nlm-ta): PeerJ

Journal ID (iso-abbrev): PeerJ

Journal ID (pmc): PeerJ

Journal ID (publisher-id): PeerJ

Title: PeerJ

Publisher: PeerJ Inc. (San Francisco, USA )

ISSN (Electronic): 2167-8359

Publication date (Electronic): 21 August 2014

Publication date Collection: 2014

Volume: 2

Electronic Location Identifier: e545

Affiliations

[1 ]Center for Microbial Genetics and Genomics, Northern Arizona University , Flagstaff, AZ, USA

[2 ]Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai , New York, NY, USA

[3 ]State Key Laboratory of Organ Failure Prevention, and Department of Environmental Health, School of Public Health and Tropical Medicine, Southern Medical University , Guangzhou, Guangdong, China

[4 ]Department of Computer Science, University of Colorado Boulder , Boulder, CO, USA

[5 ]Department of Molecular, Cellular, and Developmental Biology, University of Colorado at Boulder , Boulder, CO, USA

[6 ]Department of Chemistry and Biochemistry, University of Colorado at Boulder , Boulder, CO, USA

[7 ]Graduate Program in Biophysical Sciences, University of Chicago , Chicago, IL, USA

[8 ]Department of Biological Sciences, Northern Arizona University , AZ, USA

[9 ]BioFrontiers Institute, University of Colorado at Boulder , Boulder, CO, USA

[10 ]Institute for Genomics and Systems Biology, Argonne National Laboratory , Lemont, IL, USA

[11 ]Department of Ecology and Evolution, University of Chicago , Chicago, IL, USA

[12 ]Department of Pathology and Laboratory Science, Warren Alpert Medical School, Brown University , Providence, RI, USA

[13 ]Howard Hughes Medical Institute , Boulder, CO, USA

Article

Publisher ID: 545

DOI: 10.7717/peerj.545

PMC ID: 4145071

PubMed ID: 25177538

SO-VID: 12a9ae4c-1f15-4cde-8197-7cfabca25e69

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

History

Date received : 10 June 2014

Date accepted : 6 August 2014

Funding

Funded by: EPA STAR Graduate Fellowship

Funded by: NSF IGERT

Award ID: 1144807

Funded by: Arizona’s Technology and Research Initiative Fund

Funded by: Alfred P. Sloan Foundation

Award ID: 2012-5-42 MBRP

SMG was supported by an EPA STAR Graduate Fellowship. DM was supported in part by NSF IGERT (award number: 1144807). This work was partially supported by a grant from Arizona’s Technology and Research Initiative Fund to JGC, and by a grant from the Alfred P. Sloan Foundation to JGC and RK (award number: 2012-5-42 MBRP). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences

Read this article at

Abstract

Related collections

Drug Repurposing

Most cited references 5

Oligotyping: differentiating between closely related microbial taxa using 16S rRNA gene data

Advancing our understanding of the human microbiome using QIIME.

Meeting Report: The Terabase Metagenomics Workshop and the Vision of an Earth Microbiome Project

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Comments

Comment on this article

Similar content 21

Cited by 270

Most referenced authors 712