22
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      DeepLoc 2.0: multi-label subcellular localization prediction using protein language models

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The prediction of protein subcellular localization is of great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization prediction and improvements in both performance and interpretability. For training and validation, we curate eukaryotic and human multi-location protein datasets with stringent homology partitioning and enriched with sorting signal information compiled from the literature. We achieve state-of-the-art performance in DeepLoc 2.0 by using a pre-trained protein language model. It has the further advantage that it uses sequence input rather than relying on slower protein profiles. We provide two means of better interpretability: an attention output along the sequence and highly accurate prediction of nine different types of protein sorting signals. We find that the attention output correlates well with the position of sorting signals. The webserver is available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.

          Graphical Abstract

          Graphical Abstract

          DeepLoc 2.0 uses a transformer-based protein language model to predict multi-label subcellular localization and provides interpretability via the attention and sorting signal prediction.

          Related collections

          Most cited references34

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          Focal Loss for Dense Object Detection

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            UniProt: the universal protein knowledgebase

            (2016)
            The UniProt knowledgebase is a large resource of protein sequences and associated detailed annotation. The database contains over 60 million sequences, of which over half a million sequences have been curated by experts who critically review experimental and predicted data for each protein. The remainder are automatically annotated based on rule systems that rely on the expert curated knowledge. Since our last update in 2014, we have more than doubled the number of reference proteomes to 5631, giving a greater coverage of taxonomic diversity. We implemented a pipeline to remove redundant highly similar proteomes that were causing excessive redundancy in UniProt. The initial run of this pipeline reduced the number of sequences in UniProt by 47 million. For our users interested in the accessory proteomes, we have made available sets of pan proteome sequences that cover the diversity of sequences for each species that is found in its strains and sub-strains. To help interpretation of genomic variants, we provide tracks of detailed protein information for the major genome browsers. We provide a SPARQL endpoint that allows complex queries of the more than 22 billion triples of data in UniProt (http://sparql.uniprot.org/). UniProt resources can be accessed via the website at http://www.uniprot.org/.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

              Background To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets. Results The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. Conclusions In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.
                Bookmark

                Author and article information

                Contributors
                Journal
                Nucleic Acids Res
                Nucleic Acids Res
                nar
                Nucleic Acids Research
                Oxford University Press
                0305-1048
                1362-4962
                05 July 2022
                30 April 2022
                30 April 2022
                : 50
                : W1
                : W228-W234
                Affiliations
                Indian Institute of Technology Madras , Chennai 600036, India
                Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen , Copenhagen 2200, Denmark
                Department of Genetics, Stanford University School of Medicine , Stanford 94305, CA, USA
                Department of Computer Science, Stanford University , Stanford 94305, CA, USA
                Department of Genetics, Stanford University School of Medicine , Stanford 94305, CA, USA
                Section for Bioinformatics, Department of Health Technology, Technical University of Denmark , Kongens Lyngby 2800, Denmark
                Center for Genomic Medicine, Rigshospitalet (Copenhagen University Hospital) , Copenhagen 2100, Denmark
                Department of Biology, Bioinformatics Centre, University of Copenhagen , Copenhagen 2200, Denmark
                Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark , Kongens Lyngby 2800, Denmark
                Author notes
                To whom correspondence should be addressed. Email: henni@ 123456dtu.dk

                The authors wish it to be known that, in their opinion, these authors should be regarded as Joint First Authors.

                The authors wish it to be known that, in their opinion, these authors should be regarded as Joint Last Authors.

                Author information
                https://orcid.org/0000-0002-9412-9643
                Article
                gkac278
                10.1093/nar/gkac278
                9252801
                35489069
                d2967ea5-ea01-409c-8c2d-74d49a5db2f4
                © The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 19 April 2022
                : 07 April 2022
                : 05 February 2022
                Page count
                Pages: 7
                Funding
                Funded by: Novo Nordisk Fonden, DOI 10.13039/501100009708;
                Award ID: NNF20OC0062606
                Funded by: Danish National Research Foundation, DOI 10.13039/501100001732;
                Award ID: P1
                Categories
                AcademicSubjects/SCI00010
                Web Server Issue

                Genetics
                Genetics

                Comments

                Comment on this article