Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

In practice, it is common to find oneself with far too little text data to train a deep neural network. This "Big Data Wall" represents a challenge for minority language communities on the Internet, organizations, laboratories and companies that compete the GAFAM (Google, Amazon, Facebook, Apple, Microsoft). While most of the research effort in text data augmentation aims on the long-term goal of finding end-to-end learning solutions, which is equivalent to "using neural networks to feed neural networks", this engineering work focuses on the use of practical, robust, scalable and easy-to-implement data augmentation pre-processing techniques similar to those that are successful in computer vision. Several text augmentation techniques have been experimented. Some existing ones have been tested for comparison purposes such as noise injection or the use of regular expressions. Others are modified or improved techniques like lexical replacement. Finally more innovative ones, such as the generation of paraphrases using back-translation or by the transformation of syntactic trees, are based on robust, scalable, and easy-to-use NLP Cloud APIs. All the text augmentation techniques studied, with an amplification factor of only 5, increased the accuracy of the results in a range of 4.3% to 21.6%, with significant statistical fluctuations, on a standardized task of text polarity prediction. Some standard deep neural network architectures were tested: the multilayer perceptron (MLP), the long short-term memory recurrent network (LSTM) and the bidirectional LSTM (biLSTM). Classical XGBoost algorithm has been tested with up to 2.5% improvements.

Related collections

Author and article information

Journal

Publication date Created: 04 December 2018

Article

ArXiV ID: 1812.04718

SO-VID: 2136d018-505f-4afb-9740-70af8bee3369

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments 33 pages, 25 figures

Categories cs.CL cs.LG

ScienceOpen disciplines: Theoretical computer science,Artificial intelligence

Data availability:

ScienceOpen disciplines: Theoretical computer science, Artificial intelligence

Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs

Read this article at

Abstract

Related collections

Radiology and Natural Language Processing

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 42