COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities.

Results

Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences.

Conclusion

The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity.

Related collections

Most cited references 42

Record: found
Abstract: not found
Article: not found

ORIGINAL ARTICLE: Predicting species distributions from small numbers of occurrence records: a test case using cryptic geckos in Madagascar

Richard Pearson, Christopher J. Raxworthy, Miguel Nakamura … (2007)

0 comments Cited 510 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

PubTator: a web-based text mining tool for assisting biocuration

Chih-Hsuan Wei, Hung-Yu Kao, Zhiyong Lu (2013)

Manually curating knowledge from biomedical literature into structured databases is highly expensive and time-consuming, making it difficult to keep pace with the rapid growth of the literature. There is therefore a pressing need to assist biocuration with automated text mining tools. Here, we describe PubTator, a web-based system for assisting biocuration. PubTator is different from the few existing tools by featuring a PubMed-like interface, which many biocurators find familiar, and being equipped with multiple challenge-winning text mining algorithms to ensure the quality of its automatic results. Through a formal evaluation with two external user groups, PubTator was shown to be capable of improving both the efficiency and accuracy of manual curation. PubTator is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/.

0 comments Cited 226 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The taxonomic name resolution service: an online tool for automated standardization of plant names

Brad Boyle, Nicole Hopkins, Zhenyuan Lu … (2013)

Background The digitization of biodiversity data is leading to the widespread application of taxon names that are superfluous, ambiguous or incorrect, resulting in mismatched records and inflated species numbers. The ultimate consequences of misspelled names and bad taxonomy are erroneous scientific conclusions and faulty policy decisions. The lack of tools for correcting this ‘names problem’ has become a fundamental obstacle to integrating disparate data sources and advancing the progress of biodiversity science. Results The TNRS, or Taxonomic Name Resolution Service, is an online application for automated and user-supervised standardization of plant scientific names. The TNRS builds upon and extends existing open-source applications for name parsing and fuzzy matching. Names are standardized against multiple reference taxonomies, including the Missouri Botanical Garden's Tropicos database. Capable of processing thousands of names in a single operation, the TNRS parses and corrects misspelled names and authorities, standardizes variant spellings, and converts nomenclatural synonyms to accepted names. Family names can be included to increase match accuracy and resolve many types of homonyms. Partial matching of higher taxa combined with extraction of annotations, accession numbers and morphospecies allows the TNRS to standardize taxonomy across a broad range of active and legacy datasets. Conclusions We show how the TNRS can resolve many forms of taxonomic semantic heterogeneity, correct spelling errors and eliminate spurious names. As a result, the TNRS can aid the integration of disparate biological datasets. Although the TNRS was developed to aid in standardizing plant names, its underlying algorithms and design can be extended to all organisms and nomenclatural codes. The TNRS is accessible via a web interface at http://tnrs.iplantcollaborative.org/ and as a RESTful web service and application programming interface. Source code is available at https://github.com/iPlantCollaborativeOpenSource/TNRS/.

0 comments Cited 207 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Sophia Ananiadou

Journal

Journal ID (nlm-ta): Biodivers Data J

Journal ID (iso-abbrev): Biodivers Data J

Journal ID (pmc): Biodiversity Data Journal

Journal ID (publisher-id): Biodiversity Data Journal

Title: Biodiversity Data Journal

Publisher: Pensoft Publishers

ISSN (Electronic): 1314-2828

Publication date Collection: 2019

Publication date (Electronic): 22 January 2019

Issue: 7

Electronic Location Identifier: e29626

Affiliations

[1 ] The National Centre for Text Mining, University of Manchester, Manchester, United Kingdom The National Centre for Text Mining, University of Manchester Manchester United Kingdom

[2 ] University of the Philippines Diliman, Quezon City, Philippines University of the Philippines Diliman Quezon City Philippines

[3 ] University of the Philippines Los Baños, Los Baños, Philippines University of the Philippines Los Baños Los Baños Philippines

Author notes

Corresponding author: Sophia Ananiadou ( sophia.ananiadou@ 123456manchester.ac.uk ).

Academic editor: Anne Thessen

Author information

Nhung T.H. Nguyen https://orcid.org/0000-0001-5935-9235

Article

Publisher ID: Biodiversity Data Journal Other ID: 10040

DOI: 10.3897/BDJ.7.e29626

PMC ID: 6351503

SO-VID: c692bf0c-9185-4f1c-bc49-35687bd6d9cc

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 08 September 2018

Date accepted : 03 January 2019

Page count

Figures: 3, Tables: 6, References: 49

Comments

Comment on this article

scite_

Cited by 8

See all cited by

Most referenced authors 393

See all reference authors

- Version 1

Publish your biodiversity research with us!

Submit your article here.

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

Read this article at

Abstract

Related collections

Pensoft Biodiversity

Most cited references 42

ORIGINAL ARTICLE: Predicting species distributions from small numbers of occurrence records: a test case using cryptic geckos in Madagascar

PubTator: a web-based text mining tool for assisting biocuration

The taxonomic name resolution service: an online tool for automated standardization of plant names

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 6

Cited by 8

Most referenced authors 393