A real-world data resource of complex sensitive topics based on expert document labels of the Monsanto trial

Publikation: Bidrag til bog/antologi/rapport/proceedingKonferencebidrag i proceedingsForskningpeer review

There is a lack of real-world datasets that contain complex sensitive information, and that provide high quality ground truth labels. In this work we describe a new dataset for research and evaluation studies for the NLP and ML communities. We base our contribution on documents released as part of a current trial of the Monsanto company, which were classified into complex topics by lawyers involved in the trial. We process these documents to extract high quality sentences, sentential phrases, and parse trees over these phrases. With this work we release four ready to process classification tasks from real-world human interactions, featuring complex types of information such as technical, legal and informal communication.

We characterize the complexity of the classification task by evaluating the performance of simple and complex state-of-the-art models, and discussing their differences for the different tasks. We show that the complexity of the topics is challenging for traditional NLP approaches such as n-grams and inference systems. More complex models such as deep recurrent neural networks (LSTM) and recursive neural networks (TreeNN) capture more complex aspects of the data, but also leave open interesting and relevant real-world challenges for the research community. We suggest some research directions that could further advance the state-of-the-art using the data and code that we publish.
TitelProceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)
Antal sider10
ForlagEuropean Language Resources Association
StatusUdgivet - 2020
Begivenhed12th Conference on Language Resources and Evaluation: LREC 2020 - Marseille, Frankrig
Varighed: 11 maj 202016 maj 2020


Konference12th Conference on Language Resources and Evaluation

Se relationer på Aarhus Universitet Citationsformater

ID: 130411947