10
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A new semi-automated workflow for chemical data retrieval and quality checking for modeling applications

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The quality of data used for QSAR model derivation is extremely important as it strongly affects the final robustness and predictive power of the model. Ambiguous or wrong structures need to be carefully checked, because they lead to errors in calculation of descriptors, hence leading to meaningless results. The increasing amounts of data, however, have often made it hard to check of very large databases manually. In the light of this, we designed and implemented a semi-automated workflow integrating structural data retrieval from several web-based databases, automated comparison of these data, chemical structure cleaning, selection and standardization of data into a consistent, ready-to-use format that can be employed for modeling. The workflow integrates best practices for data curation that have been suggested in the recent literature. The workflow has been implemented with the freely available KNIME software and is freely available to the cheminformatics community for improvement and application to a broad range of chemical datasets.

          Electronic supplementary material

          The online version of this article (10.1186/s13321-018-0315-6) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references19

          • Record: found
          • Abstract: not found
          • Article: not found

          Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research.

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            InChI, the IUPAC International Chemical Identifier

            This paper documents the design, layout and algorithms of the IUPAC International Chemical Identifier, InChI.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              ToxCast Chemical Landscape: Paving the Road to 21st Century Toxicology

              The U.S. Environmental Protection Agency's (EPA) ToxCast program is testing a large library of Agency-relevant chemicals using in vitro high-throughput screening (HTS) approaches to support the development of improved toxicity prediction models. Launched in 2007, Phase I of the program screened 310 chemicals, mostly pesticides, across hundreds of ToxCast assay end points. In Phase II, the ToxCast library was expanded to 1878 chemicals, culminating in the public release of screening data at the end of 2013. Subsequent expansion in Phase III has resulted in more than 3800 chemicals actively undergoing ToxCast screening, 96% of which are also being screened in the multi-Agency Tox21 project. The chemical library unpinning these efforts plays a central role in defining the scope and potential application of ToxCast HTS results. The history of the phased construction of EPA's ToxCast library is reviewed, followed by a survey of the library contents from several different vantage points. CAS Registry Numbers are used to assess ToxCast library coverage of important toxicity, regulatory, and exposure inventories. Structure-based representations of ToxCast chemicals are then used to compute physicochemical properties, substructural features, and structural alerts for toxicity and biotransformation. Cheminformatics approaches using these varied representations are applied to defining the boundaries of HTS testability, evaluating chemical diversity, and comparing the ToxCast library to potential target application inventories, such as used in EPA's Endocrine Disruption Screening Program (EDSP). Through several examples, the ToxCast chemical library is demonstrated to provide comprehensive coverage of the knowledge domains and target inventories of potential interest to EPA. Furthermore, the varied representations and approaches presented here define local chemistry domains potentially worthy of further investigation (e.g., not currently covered in the testing library or defined by toxicity "alerts") to strategically support data mining and predictive toxicology modeling moving forward.
                Bookmark

                Author and article information

                Contributors
                domenico.gadaleta@marionegri.it
                Journal
                J Cheminform
                J Cheminform
                Journal of Cheminformatics
                Springer International Publishing (Cham )
                1758-2946
                10 December 2018
                10 December 2018
                2018
                : 10
                : 60
                Affiliations
                ISNI 0000000106678902, GRID grid.4527.4, Laboratory of Environmental Chemistry and Toxicology, Department of Environmental Health Sciences, , Istituto di Ricerche Farmacologiche Mario Negri IRCCS, ; Via la Masa 19, 20156 Milan, Italy
                Article
                315
                10.1186/s13321-018-0315-6
                6503381
                30536051
                0cb851af-1034-498d-a49f-0f88de722d3d
                © The Author(s) 2018

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 26 September 2018
                : 1 December 2018
                Funding
                Funded by: EUToxRisk
                Award ID: 681002
                Funded by: LIFE-COMBASE
                Award ID: LIFE15 ENV/ES/000416
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2018

                Chemoinformatics
                qsar,data curation,data cleaning,semi-automated,workflow
                Chemoinformatics
                qsar, data curation, data cleaning, semi-automated, workflow

                Comments

                Comment on this article