Aarhus University Seal / Aarhus Universitets segl

Erla Hallsteinsdóttir

A 500 million word POS-tagged icelandic corpus

Research output: Contribution to journal/Conference contribution in journal/Contribution to newspaperConference articleResearchpeer-review

  • Thomas Eckart, Leipzig University
  • ,
  • Erla Hallsteinsdóttir
  • Sigrún Helgadóttir, Árni Magnússon Institute
  • ,
  • Uwe Quasthoff, Leipzig University
  • ,
  • Dirk Goldhahn, Leipzig University

The new POS-tagged Icelandic corpus of the Leipzig Corpora Collection is an extensive resource for the analysis of the Icelandic language. As it contains a large share of all Web documents hosted under the.is top-level domain, it is especially valuable for investigations on modern Icelandic and non-standard language varieties. The corpus is accessible via a dedicated web portal and large shares are available for download. Focus of this paper will be the description of the tagging process and evaluation of statistical properties like word form frequencies and part of speech tag distributions. The latter will be in particular compared with values from the Icelandic Frequency Dictionary (IFD) Corpus.

Original languageEnglish
JournalProceedings of the 9th International Conference on Language Resources and Evaluation, LREC 2014
Pages (from-to)2398-2402
Number of pages5
Publication statusPublished - 1 Jan 2014
Event9th International Conference on Language Resources and Evaluation, LREC 2014 - Reykjavik, Iceland
Duration: 26 May 201431 May 2014

Conference

Conference9th International Conference on Language Resources and Evaluation, LREC 2014
CountryIceland
CityReykjavik
Period26/05/201431/05/2014
SponsorEuropean Media Laboratory GmbH (EML), Holmes Semantic Solutions, IMMI, KDictionaries, VoiceBox Technologies

    Research areas

  • Corpus creation, Grammar and syntax, Part-of-speech tagging

See relations at Aarhus University Citationformats

ID: 174176959