A real-world data resource of complex sensitive topics based on expert document labels of the Monsanto trial

Research output: Contribution to book/anthology/report/proceedingArticle in proceedingsResearchpeer-review

There is a lack of real-world datasets that contain complex sensitive information, and that provide high quality ground truth labels. In this work we describe a new dataset for research and evaluation studies for the NLP and ML communities. We base our contribution on documents released as part of a current trial of the Monsanto company, which were classified into complex topics by lawyers involved in the trial. We process these documents to extract high quality sentences, sentential phrases, and parse trees over these phrases. With this work we release four ready to process classification tasks from real-world human interactions, featuring complex types of information such as technical, legal and informal communication.

We characterize the complexity of the classification task by evaluating the performance of simple and complex state-of-the-art models, and discussing their differences for the different tasks. We show that the complexity of the topics is challenging for traditional NLP approaches such as n-grams and inference systems. More complex models such as deep recurrent neural networks (LSTM) and recursive neural networks (TreeNN) capture more complex aspects of the data, but also leave open interesting and relevant real-world challenges for the research community. We suggest some research directions that could further advance the state-of-the-art using the data and code that we publish.
Original languageEnglish
Title of host publicationProceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)
Number of pages10
PublisherEuropean Language Resources Association
Publication year2020
Pages1258–1267
Publication statusPublished - 2020
Event12th Conference on Language Resources and Evaluation: LREC 2020 - Marseille, France
Duration: 11 May 202016 May 2020

Conference

Conference12th Conference on Language Resources and Evaluation
LandFrance
ByMarseille
Periode11/05/202016/05/2020

See relations at Aarhus University Citationformats

ID: 130411947