460
views
0
recommends
+1 Recommend
0 collections
    4
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small e.g., 180 bp and large e.g., 3–5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.

          Related collections

          Most cited references26

          • Record: found
          • Abstract: found
          • Article: not found

          Fast algorithms for large-scale genome alignment and comparison.

          We describe a suffix-tree algorithm that can align the entire genome sequences of eukaryotic and prokaryotic organisms with minimal use of computer time and memory. The new system, MUMmer 2, runs three times faster while using one-third as much memory as the original MUMmer system. It has been used successfully to align the entire human and mouse genomes to each other, and to align numerous smaller eukaryotic and prokaryotic genomes. A new module permits the alignment of multiple DNA sequence fragments, which has proven valuable in the comparison of incomplete genome sequences. We also describe a method to align more distantly related genomes by detecting protein sequence homology. This extension to MUMmer aligns two genomes after translating the sequence in all six reading frames, extracts all matching protein sequences and then clusters together matches. This method has been applied to both incomplete and complete genome sequences in order to detect regions of conserved synteny, in which multiple proteins from one organism are found in the same order and orientation in another. The system code is being made freely available by the authors.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences.

            'BLAST 2 Sequences', a new BLAST-based tool for aligning two protein or nucleotide sequences, is described. While the standard BLAST program is widely used to search for homologous sequences in nucleotide and protein databases, one often needs to compare only two sequences that are already known to be homologous, coming from related species or, e.g. different isolates of the same virus. In such cases searching the entire database would be unnecessarily time-consuming. 'BLAST 2 Sequences' utilizes the BLAST algorithm for pairwise DNA-DNA or protein-protein sequence comparison. A World Wide Web version of the program can be used interactively at the NCBI WWW site (http://www.ncbi.nlm.nih.gov/gorf/bl2.++ +html). The resulting alignments are presented in both graphical and text form. The variants of the program for PC (Windows), Mac and several UNIX-based platforms can be downloaded from the NCBI FTP site (ftp://ncbi.nlm.nih.gov).
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              REAPR: a universal tool for genome assembly evaluation

              Methods to reliably assess the accuracy of genome sequence data are lacking. Currently completeness is only described qualitatively and mis-assemblies are overlooked. Here we present REAPR, a tool that precisely identifies errors in genome assemblies without the need for a reference sequence. We have validated REAPR on complete genomes or de novo assemblies from bacteria, malaria and Caenorhabditis elegans, and demonstrate that 86% and 82% of the human and mouse reference genomes are error-free, respectively. When applied to an ongoing genome project, REAPR provides corrected assembly statistics allowing the quantitative comparison of multiple assemblies. REAPR is available at http://www.sanger.ac.uk/resources/software/reapr/.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2014
                19 November 2014
                : 9
                : 11
                : e112963
                Affiliations
                [1 ]Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
                [2 ]VIB Department of Plant Systems Biology, Ghent University, Ghent, Belgium
                The University of Hong Kong, Hong Kong
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: BJW TA CAC QZ JW SKY AME. Performed the experiments: BJW TA TS SKY. Analyzed the data: BJW TA TS MP AA SS CAC QZ JW SKY AME. Wrote the paper: BJW TA TS MP AA SS CAC SKY AME.

                [¤]

                Current address: Applied Minds, LLC, Boston, Massachusetts, United States of America

                Article
                PONE-D-14-38252
                10.1371/journal.pone.0112963
                4237348
                25409509
                c0865889-283a-4b8c-b4c3-7a9126e2cdf6
                Copyright @ 2014

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 25 August 2014
                : 16 October 2014
                Page count
                Pages: 14
                Funding
                This project has been funded in part with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract No.:HHSN272200900018C. This project has been also been funded in part with Federal funds from the National Human Genome Research Institute, National Institutes of Health, Department of Health and Human Services, under grant U54HG003067. TA is a postdoctoral fellow of the Research Foundation-Flanders. The funders played no role collection, analysis, and interpretation of data; in the writing of the manuscript; and in the decision to submit the manuscript for publication.
                Categories
                Research Article
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Genetics
                Genomics
                Microbial Genomics
                Microbiology
                Computer and Information Sciences
                Computer Software
                Open Source Software
                Custom metadata
                The authors confirm that all data underlying the findings are fully available without restriction. All sequence data files are available from the Sequence Read Archive database (accession numbers SRX347313, SRX347312, SRX105400, SRX110130, SRX347317 and SRX347316).

                Uncategorized
                Uncategorized

                Comments

                Comment on this article