14
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Predicting tongue motion in unlabeled ultrasound videos using convolutional LSTM neural network

      Preprint
      , , , , ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          A challenge in speech production research is to predict future tongue movements based on a short period of past tongue movements. This study tackles speaker-dependent tongue motion prediction problem in unlabeled ultrasound videos with convolutional long short-term memory (ConvLSTM) networks. The model has been tested on two different ultrasound corpora. ConvLSTM outperforms 3-dimensional convolutional neural network (3DCNN) in predicting the 9\textsuperscript{th} frames based on 8 preceding frames, and also demonstrates good capacity to predict only the tongue contours in future frames. Further tests reveal that ConvLSTM can also learn to predict tongue movements in more distant frames beyond the immediately following frames. Our codes are available at: https://github.com/shuiliwanwu/ConvLstm-ultrasound-videos.

          Related collections

          Most cited references5

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          Learning Spatiotemporal Features Using 3DCNN and Convolutional LSTM for Gesture Recognition

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Convolutional neural network-based automatic classification of midsagittal tongue gestural targets using B-mode ultrasound images.

            Tongue gestural target classification is of great interest to researchers in the speech production field. Recently, deep convolutional neural networks (CNN) have shown superiority to standard feature extraction techniques in a variety of domains. In this letter, both CNN-based speaker-dependent and speaker-independent tongue gestural target classification experiments are conducted to classify tongue gestures during natural speech production. The CNN-based method achieves state-of-the-art performance, even though no pre-training of the CNN (with the exception of a data augmentation preprocessing) was carried out.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A comparative study on the contour tracking algorithms in ultrasound tongue images with automatic re-initialization.

              The feasibility of an automatic re-initialization of contour tracking is explored by using an image similarity-based method in the ultrasound tongue sequences. To this end, the re-initialization method was incorporated into current state-of-art tongue tracking algorithms, and a quantitative comparison was made between different algorithms by computing the mean sum of distances errors. The results demonstrate that with automatic re-initialization, the tracking error can be reduced from an average of 5-6 to about 4 pixels, a result obtained by using a large number of hand-labeled frames and similarity measurements to extract the contours, which results in improved performance.
                Bookmark

                Author and article information

                Journal
                19 February 2019
                Article
                1902.06927
                a335e17e-a5df-4b65-85c3-e441cdc7630d

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                Accepted by ICASSP 2019
                cs.CV cs.LG cs.MM

                Computer vision & Pattern recognition,Artificial intelligence,Graphics & Multimedia design

                Comments

                Comment on this article