5
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Creation of annotated country-level dialectal Arabic resources: An unsupervised approach

      Natural Language Engineering
      Cambridge University Press (CUP)

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The wide usage of multiple spoken Arabic dialects on social networking sites stimulates increasing interest in Natural Language Processing (NLP) for dialectal Arabic (DA). Arabic dialects represent true linguistic diversity and differ from modern standard Arabic (MSA). In fact, the complexity and variety of these dialects make it insufficient to build one NLP system that is suitable for all of them. In comparison with MSA, the available datasets for various dialects are generally limited in terms of size, genre and scope. In this article, we present a novel approach that automatically develops an annotated country-level dialectal Arabic corpus and builds lists of words that encompass 15 Arabic dialects. The algorithm uses an iterative procedure consisting of two main components: automatic creation of lists for dialectal words and automatic creation of annotated Arabic dialect identification corpus. To our knowledge, our study is the first of its kind to examine and analyse the poor performance of the MSA part-of-speech tagger on dialectal Arabic contents and to exploit that in order to extract the dialectal words. The pointwise mutual information association measure and the geographical frequency of word occurrence online are used to classify dialectal words. The annotated dialectal Arabic corpus (Twt15DA), built using our algorithm, is collected from Twitter and consists of 311,785 tweets containing 3,858,459 words in total. We randomly selected a sample of 75 tweets per country, 1125 tweets in total, and conducted a manual dialect identification task by native speakers. The results show an average inter-annotator agreement score equal to 64%, which reflects satisfactory agreement considering the overlapping features of the 15 Arabic dialects.

          Related collections

          Most cited references54

          • Record: found
          • Abstract: not found
          • Article: not found

          The Measurement of Observer Agreement for Categorical Data

            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Feature-rich part-of-speech tagging with a cyclic dependency network

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Understanding the appeal of user‐generated media: a uses and gratification perspective

                Bookmark

                Author and article information

                Contributors
                (View ORCID Profile)
                Journal
                Natural Language Engineering
                Nat. Lang. Eng.
                Cambridge University Press (CUP)
                1351-3249
                1469-8110
                September 2022
                August 09 2021
                September 2022
                : 28
                : 5
                : 607-648
                Article
                10.1017/S135132492100019X
                9ae215e1-b288-45b6-b9db-290678073bac
                © 2022

                https://www.cambridge.org/core/terms

                History

                Comments

                Comment on this article