Aarhus University Seal / Aarhus Universitets segl

A real-world data resource of complex sensitive sentences based on documents from the Monsanto trial

Research output: Contribution to book/anthology/report/proceedingArticle in proceedingsResearchpeer-review

In this work we present a corpus for the evaluation of sensitive information detection approaches that addresses the need for real world sensitive information for empirical studies. Our sentence corpus contains different notions of complex sensitive information that correspond to different aspects of concern in a current trial of the Monsanto company. This paper describes the annotations process, where we both employ human annotators and furthermore create automatically inferred labels regarding technical, legal and informal communication within and with employees of Monsanto, drawing on a classification of documents by lawyers involved in the Monsanto court case. We release corpus of high quality sentences and parse trees with these two types of labels on sentence level. We characterize the sensitive information via several representative sensitive information detection models, in particular both keyword-based (n-gram) approaches and recent deep learning models, namely, recurrent neural networks (LSTM) and recursive neural networks (RecNN). Data and code are made publicly available.

Original languageEnglish
Title of host publicationLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Number of pages10
PublisherEuropean Language Resources Association
Publication yearJan 2020
ISBN (Electronic)9791095546344
Publication statusPublished - Jan 2020
Event12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France
Duration: 11 May 202016 May 2020


Conference12th International Conference on Language Resources and Evaluation, LREC 2020
SponsorAmazon AWS, Bertin, Lenovo, Ontotex, Vecsys, Vocapia

    Research areas

  • Corpus (Creation Annotation Etc.), Document Classification, Statistical and Machine Learning Methods, Text categorisation

See relations at Aarhus University Citationformats

ID: 201989767