Exploratory Analysis of the News in Easy Language (NiEL) Corpus to Identify Characteristic Patterns for Natural Language Processing Konferenzpaper uri icon



  • While comprehensive corpora are available for resource-rich languages such as English in different domains, which can be made usable for natural language processing (NLP) applications, this is not the case for resource-poor languages. Parallel or monolingual corpora must first be created and adequately processed in order to make them usable for later NLP applications. In the past, selected variants of a standard language were increasingly identified as resource-poor languages and corresponding resources were created. The German Easy Language, as a highly simplified variant of the Standard German language, can be defined as a resource-poor language, since here, too, hardly any NLP-suitable corpora are available. In this paper, we present the News in Easy Language (NiEL) corpus, a monolingual text resource for German Easy Language. By means of exploratory analysis using selected NLP tools, characteristic patterns for Easy Language can be derived at both word and sentence level. The identified patterns of Easy Language can be compared in perspective with patterns from standard language texts. Our results show that multiple tools from the NLP domain are suitable for German Easy Language as well as for German Standard Language. Features like word variance, sentence depth but also average word and sentence length can be distinguished. The features extracted in this way are suitable for the development of models, whereby initial implications for the natural language processing of Easy Language can be derived. The results form an important basis for further research in the domain of Easy Language. As a low-resource language that has been primarily analyzed intellectually, another added value of our work also lies in the implications for natural processing of plain language derived from the exploratory analysis of the corpus.


