Aarhus University Seal / Aarhus Universitets segl

Erla Hallsteinsdóttir

High quality word lists as a resource for multiple purposes

Research output: Contribution to journal/Conference contribution in journal/Contribution to newspaperConference articleResearchpeer-review

  • Uwe Quasthoff, Leipzig University
  • ,
  • Dirk Goldhahn, Leipzig University
  • ,
  • Thomas Eckart, Universität Leipzig
  • ,
  • Erla Hallsteinsdóttir
  • Sabine Fiedler, Leipzig University

Since 2011 the comprehensive, electronically available sources of the Leipzig Corpora Collection have been used consistently for the compilation of high quality word lists. The underlying corpora include newspaper texts, Wikipedia articles and other randomly collected Web texts. For many of the languages featured in this collection, it is the first comprehensive compilation to use a large-scale empirical base. The word lists have been used to compile dictionaries with comparable frequency data in the Frequency Dictionaries series. This includes frequency data of up to 1, 000, 000 word forms presented in alphabetical order. This article provides an introductory description of the data and the methodological approach used. In addition, language-specific statistical information is provided with regard to letters, word structure and structural changes. Such high quality word lists also provide the opportunity to explore comparative linguistic topics and such monolingual issues as studies of word formation and frequency-based examinations of lexical areas for use in dictionaries or language teaching. The results presented here can provide initial suggestions for subsequent work in several areas of research.

Original languageEnglish
JournalProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
Pages (from-to)2816-2819
Number of pages4
Publication statusPublished - 1 Jan 2014
Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
Duration: 26 May 201431 May 2014


Conference9th International Conference on Language Resources and Evaluation, LREC 2014
SponsorEuropean Media Laboratory GmbH (EML), Holmes Semantic Solutions, IMMI, KDictionaries, VoiceBox Technologies

    Research areas

  • Corpora, Frequency data, Word lists

See relations at Aarhus University Citationformats

ID: 174177097