The language of proteins: NLP, machine learning & protein sequences

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.

Related collections

Most cited references 66

Record: found
Abstract: found
Article: not found

Long Short-Term Memory

Jürgen Schmidhuber, Jürgen Schmidhuber (2003)

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

0 comments Cited 7887 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

The Protein Data Bank.

H M Berman, J Westbrook, Z Feng … (2000)

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

0 comments Cited 4561 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

ImageNet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton (2017)

0 comments Cited 3958 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Nadav Brandes

Journal

Journal ID (nlm-ta): Comput Struct Biotechnol J

Journal ID (iso-abbrev): Comput Struct Biotechnol J

Title: Computational and Structural Biotechnology Journal

Publisher: Research Network of Computational and Structural Biotechnology

ISSN (Electronic): 2001-0370

Publication date PMC-release: 25 March 2021

Publication date Collection: 2021

Publication date (Electronic): 25 March 2021

Volume: 19

Pages: 1750-1758

Affiliations

[a ]Medtronic, Inc, Israel

[b ]The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel

[c ]Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel

Author notes

[* ]Corresponding author. nadav.brandes@ 123456mail.huji.ac.il

Article

Publisher Item ID: S2001-0370(21)00094-5

DOI: 10.1016/j.csbj.2021.03.022

PMC ID: 8050421

PubMed ID: 33897979

SO-VID: 94e105c6-6186-4b34-8aef-596273289449

License:

This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

History

Date received : 28 January 2021

Date revision received : 19 March 2021

Date accepted : 19 March 2021

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 100

See all cited by

Most referenced authors 3,858

See all reference authors

The language of proteins: NLP, machine learning & protein sequences

Read this article at

Abstract

Related collections

Radiology and Natural Language Processing

Most cited references 66

Long Short-Term Memory

The Protein Data Bank.

ImageNet classification with deep convolutional neural networks

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 436

Cited by 100

Most referenced authors 3,858