20
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Viruses are a significant player in many biosphere and human ecosystems, but most signals remain “hidden” in metagenomic/metatranscriptomic sequence datasets due to the lack of universal gene markers, database representatives, and insufficiently advanced identification tools.

          Results

          Here, we introduce VirSorter2, a DNA and RNA virus identification tool that leverages genome-informed database advances across a collection of customized automatic classifiers to improve the accuracy and range of virus sequence detection. When benchmarked against genomes from both isolated and uncultivated viruses, VirSorter2 uniquely performed consistently with high accuracy (F1-score > 0.8) across viral diversity, while all other tools under-detected viruses outside of the group most represented in reference databases (i.e., those in the order Caudovirales). Among the tools evaluated, VirSorter2 was also uniquely able to minimize errors associated with atypical cellular sequences including eukaryotic genomes and plasmids. Finally, as the virosphere exploration unravels novel viral sequences, VirSorter2’s modular design makes it inherently able to expand to new types of viruses via the design of new classifiers to maintain maximal sensitivity and specificity.

          Conclusion

          With multi-classifier and modular design, VirSorter2 demonstrates higher overall accuracy across major viral groups and will advance our knowledge of virus evolution, diversity, and virus-microbe interaction in various ecosystems. Source code of VirSorter2 is freely available ( https://bitbucket.org/MAVERICLab/virsorter2), and VirSorter2 is also available both on bioconda and as an iVirus app on CyVerse ( https://de.cyverse.org/de).

          Supplementary Information

          The online version contains supplementary material available at 10.1186/s40168-020-00990-y.

          Related collections

          Most cited references70

          • Record: found
          • Abstract: found
          • Article: not found

          MUSCLE: multiple sequence alignment with high accuracy and high throughput.

          We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Prodigal: prokaryotic gene recognition and translation initiation site identification

            Background The quality of automated gene prediction in microbial organisms has improved steadily over the past decade, but there is still room for improvement. Increasing the number of correct identifications, both of genes and of the translation initiation sites for each gene, and reducing the overall number of false positives, are all desirable goals. Results With our years of experience in manually curating genomes for the Joint Genome Institute, we developed a new gene prediction algorithm called Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm). With Prodigal, we focused specifically on the three goals of improved gene structure prediction, improved translation initiation site recognition, and reduced false positives. We compared the results of Prodigal to existing gene-finding methods to demonstrate that it met each of these objectives. Conclusion We built a fast, lightweight, open source gene prediction program called Prodigal http://compbio.ornl.gov/prodigal/. Prodigal achieved good results compared to existing methods, and we believe it will be a valuable asset to automated microbial annotation pipelines.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

              The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
                Bookmark

                Author and article information

                Contributors
                sullivan.948@osu.edu
                sroux@lbl.gov
                Journal
                Microbiome
                Microbiome
                Microbiome
                BioMed Central (London )
                2049-2618
                1 February 2021
                1 February 2021
                2021
                : 9
                : 37
                Affiliations
                [1 ]GRID grid.261331.4, ISNI 0000 0001 2285 7943, Department of Microbiology, , Ohio State University, ; Columbus, OH 43210 USA
                [2 ]GRID grid.215654.1, ISNI 0000 0001 2151 2636, The Biodesign Center for Fundamental and Applied Microbiomics, Center for Evolution and Medicine, School of Life Sciences, , Arizona State University, ; Tempe, AZ 85287 USA
                [3 ]GRID grid.7836.a, ISNI 0000 0004 1937 1151, Structural Biology Research Unit, Department of Integrative Biomedical Sciences, , University of Cape Town, ; Observatory, Cape Town, 7701 South Africa
                [4 ]GRID grid.460789.4, ISNI 0000 0004 4910 6535, Génomique Métabolique, Genoscope, Institut François Jacob, CEA, CNRS, Univ Evry, , Université Paris-Saclay, ; 91057 Evry, France
                [5 ]Viromica, 7870582 Santiago, Chile
                [6 ]GRID grid.261331.4, ISNI 0000 0001 2285 7943, Civil, Environmental and Geodetic Engineering, , Ohio State University, ; Columbus, OH 43210 USA
                [7 ]GRID grid.261331.4, ISNI 0000 0001 2285 7943, Center of Microbiome Science, , Ohio State University, ; Columbus, OH 43210 USA
                [8 ]GRID grid.184769.5, ISNI 0000 0001 2231 4551, DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, ; Berkeley, CA 94720 USA
                Author information
                http://orcid.org/0000-0001-8398-8234
                Article
                990
                10.1186/s40168-020-00990-y
                7852108
                33522966
                4b03ed67-b0fe-44ec-bf66-849ebe928fb7
                © The Author(s) 2021

                Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

                History
                : 4 July 2020
                : 29 December 2020
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/100000001, National Science Foundation;
                Award ID: OCE1829831
                Award ID: ABI1758974
                Funded by: FundRef http://dx.doi.org/10.13039/100000015, U.S. Department of Energy;
                Award ID: DE-SC0020173
                Award ID: DE-AC02-05CH11231
                Award Recipient :
                Funded by: Gordon and Betty Moore Foundation (US)
                Award ID: #3790
                Categories
                Software Article
                Custom metadata
                © The Author(s) 2021

                Comments

                Comment on this article