6
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Identifying and classifying goals for scientific knowledge

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation

          Science progresses by posing good questions, yet work in biomedical text mining has not focused on them much. We propose a novel idea for biomedical natural language processing: identifying and characterizing the questions stated in the biomedical literature. Formally, the task is to identify and characterize statements of ignorance, statements where scientific knowledge is missing or incomplete. The creation of such technology could have many significant impacts, from the training of PhD students to ranking publications and prioritizing funding based on particular questions of interest. The work presented here is intended as the first step towards these goals.

          Results

          We present a novel ignorance taxonomy driven by the role statements of ignorance play in research, identifying specific goals for future scientific knowledge. Using this taxonomy and reliable annotation guidelines (inter-annotator agreement above 80%), we created a gold standard ignorance corpus of 60 full-text documents from the prenatal nutrition literature with over 10 000 annotations and used it to train classifiers that achieved over 0.80 F1 scores.

          Availability and implementation

          Corpus and source code freely available for download at https://github.com/UCDenver-ccp/Ignorance-Question-Work. The source code is implemented in Python.

          Related collections

          Most cited references43

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          BioBERT: a pre-trained biomedical language representation model for biomedical text mining

          Abstract Motivation Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. Availability and implementation We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Scikit-learn: machine learning in python

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Evaluation of time profile reconstruction from complex two-color microarray designs

              Background As an alternative to the frequently used "reference design" for two-channel microarrays, other designs have been proposed. These designs have been shown to be more profitable from a theoretical point of view (more replicates of the conditions of interest for the same number of arrays). However, the interpretation of the measurements is less straightforward and a reconstruction method is needed to convert the observed ratios into the genuine profile of interest (e.g. a time profile). The potential advantages of using these alternative designs thus largely depend on the success of the profile reconstruction. Therefore, we compared to what extent different linear models agree with each other in reconstructing expression ratios and corresponding time profiles from a complex design. Results On average the correlation between the estimated ratios was high, and all methods agreed with each other in predicting the same profile, especially for genes of which the expression profile showed a large variance across the different time points. Assessing the similarity in profile shape, it appears that, the more similar the underlying principles of the methods (model and input data), the more similar their results. Methods with a dye effect seemed more robust against array failure. The influence of a different normalization was not drastic and independent of the method used. Conclusion Including a dye effect such as in the methods lmbr_dye, anovaFix and anovaMix compensates for residual dye related inconsistencies in the data and renders the results more robust against array failure. Including random effects requires more parameters to be estimated and is only advised when a design is used with a sufficient number of replicates. Because of this, we believe lmbr_dye, anovaFix and anovaMix are most appropriate for practical use.
                Bookmark

                Author and article information

                Contributors
                Role: Associate Editor
                Journal
                Bioinform Adv
                Bioinform Adv
                bioadv
                Bioinformatics Advances
                Oxford University Press
                2635-0041
                2021
                28 July 2021
                28 July 2021
                : 1
                : 1
                : vbab012
                Affiliations
                [1 ] Computational Bioscience Program, University of Colorado Anschutz Medical Campus , Aurora, CO 80045, USA
                [2 ] Health Informatics Program, College of Health Solutions at Arizona State University , Phoenix, AZ 85004, USA
                [3 ] Center for Genes, Environment and Health, National Jewish Health , Denver, CO 80206, USA
                Author notes
                To whom correspondence should be addressed. Mayla.Boguslav@ 123456CUAnschutz.edu
                Author information
                https://orcid.org/0000-0003-1455-3370
                Article
                vbab012
                10.1093/bioadv/vbab012
                8508177
                34661112
                3e71b9ba-2192-4185-a699-6bbfe5263211
                © The Author(s) 2021. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 07 May 2021
                : 17 June 2021
                : 22 June 2021
                Page count
                Pages: 11
                Funding
                Funded by: NIH, DOI 10.13039/100000002;
                Award ID: T15LM009451
                Funded by: NIH, DOI 10.13039/100000002;
                Award ID: R01LM013400
                Funded by: NIH, DOI 10.13039/100000002;
                Award ID: R01LM008111
                Award ID: R01ES025722
                Award ID: R01HL136681-01
                Categories
                Original Paper
                AcademicSubjects/SCI01060

                Comments

                Comment on this article