Fumbling in Babel: An Investigation into ChatGPT&apos;s Language Identification Ability

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Recently, ChatGPT has emerged as a powerful NLP tool that can carry out several tasks. However, the range of languages ChatGPT can handle remains largely a mystery. In this work, we investigate ChatGPT's language identification abilities. For this purpose, we compile Babel-670, a benchmark comprising \(670\) languages representing \(23\) language families. Languages in Babel-670 run the gamut between the very high-resource to the very low-resource and are spoken in five continents. We then study ChatGPT's (both GPT-3.5 and GPT-4) ability to (i) identify both language names and language codes (ii) under both zero- and few-shot conditions (iii) with and without provision of label set. When compared to smaller finetuned language identification tools, we find that ChatGPT lags behind. Our empirical analysis shows the reality that ChatGPT still resides in a state of potential enhancement before it can sufficiently serve diverse communities.

Related collections

Author and article information

Journal

Publication date Created: 16 November 2023

Article

ArXiV ID: 2311.09696

SO-VID: 767f92a0-a002-4447-83c5-961cb28495ce

License:

http://creativecommons.org/licenses/by/4.0/

History

Custom metadata

Comments 15 pages, 5 figures

Categories cs.CL

ScienceOpen disciplines: Theoretical computer science

Data availability:

ScienceOpen disciplines: Theoretical computer science

Fumbling in Babel: An Investigation into ChatGPT's Language Identification Ability

Read this article at

Abstract

Related collections

Blockchain in Healthcare Today

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 371