8
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      MolData, a molecular benchmark for disease and target based machine learning

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files.

          Supplementary Information

          The online version contains supplementary material available at 10.1186/s13321-022-00590-y.

          Related collections

          Most cited references44

          • Record: found
          • Abstract: not found
          • Article: not found

          SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Extended-connectivity fingerprints.

            Extended-connectivity fingerprints (ECFPs) are a novel class of topological fingerprints for molecular characterization. Historically, topological fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure-activity modeling. ECFPs are circular fingerprints with a number of useful qualities: they can be very rapidly calculated; they are not predefined and can represent an essentially infinite number of different molecular features (including stereochemical information); their features represent the presence of particular substructures, allowing easier interpretation of analysis results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A comprehensive map of molecular drug targets

              The success of mechanism-based drug discovery depends on the definition of the drug target. This definition becomes even more important as we try to link drug response to genetic variation, understand stratified clinical efficacy and safety, rationalize the differences between drugs in the same therapeutic
                Bookmark

                Author and article information

                Contributors
                arashka@knights.ucf.edu
                Journal
                J Cheminform
                J Cheminform
                Journal of Cheminformatics
                Springer International Publishing (Cham )
                1758-2946
                7 March 2022
                7 March 2022
                2022
                : 14
                : 10
                Affiliations
                [1 ]GRID grid.170430.1, ISNI 0000 0001 2159 2859, Burnett School of Biomedical Sciences, , University of Central Florida, ; Orlando, FL USA
                [2 ]GRID grid.170430.1, ISNI 0000 0001 2159 2859, Department of Electrical and Computer Engineering, , University of Central Florida, ; Orlando, FL USA
                [3 ]GRID grid.35403.31, ISNI 0000 0004 1936 9991, Department of Chemistry, , University of Illinois at Urbana, ; Champaign, IL USA
                Author information
                http://orcid.org/0000-0003-4050-0897
                Article
                590
                10.1186/s13321-022-00590-y
                8899453
                8fa1a422-cfa4-4a4d-8d81-9e82e190cb66
                © The Author(s) 2022

                Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

                History
                : 26 October 2021
                : 13 February 2022
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2022

                Chemoinformatics
                artificial intelligence,benchmark,biological assays,big data,database,drug discovery,machine learning,pubchem

                Comments

                Comment on this article