The Danish Gigaword Project

Leon Strømberg-Derczynski, Rebekah Baglini, Morten H. Christiansen, Manuel R. Ciosici, Jacob Aarup Dalsgaard, Riccardo Fusaroli, Peter Juel Henrichsen, Rasmus Hvingelby, Andreas Kirkedal, Alex Speed Kjeldsen, Claus Ladefoged, Finn Årup Nielsen, Malte Lau Petersen, Jonathan Hvithamar Rystrøm, Daniel Varab

Publikation: Working paper/Preprint Working paperForskning

127 Downloads (Pure)

Abstract

Danish is a North Germanic/Scandinavian language spoken primarily in Denmark, a country with a tradition of technological and scientific innovation. However, from a technological perspective, the Danish language has received relatively little attention and, as a result, Danish language technology is hard to develop, in part due to a lack of large or broad-coverage Danish corpora. This paper describes the Danish Gigaword project, which aims to construct a freely-available one billion word corpus of Danish text that represents the breadth of the written language.
OriginalsprogEngelsk
UdgiverArXiv
Antal sider6
StatusUdgivet - maj 2020

Emneord

  • cs.CL

Fingeraftryk

Dyk ned i forskningsemnerne om 'The Danish Gigaword Project'. Sammen danner de et unikt fingeraftryk.

Citationsformater