A real-world data resource of complex sensitive topics based on expert document labels of the Monsanto trial

Research output: Contribution to book/anthology/report/proceedingArticle in proceedingsResearchpeer-review

  • Jan Neerbek
  • Morten Eskildsen, Datalogisk Institut, Aarhus Universitet, Denmark
  • Peter Dolog, Aalborg Universitet, Institut for Datalogi, Denmark
  • Ira Assent
There is a lack of real-world datasets that contain complex sensitive information, and that provide high quality ground truth labels. In this work we describe a new dataset for research and evaluation studies for the NLP and ML communities. We base our contribution on documents released as part of a current trial of the Monsanto company, which were classified into complex topics by lawyers involved in the trial. We process these documents to extract high quality sentences, sentential phrases, and parse trees over these phrases. With this work we release four ready to process classification tasks from real-world human interactions, featuring complex types of information such as technical, legal and informal communication.

We characterize the complexity of the classification task by evaluating the performance of simple and complex state-of-the-art models, and discussing their differences for the different tasks. We show that the complexity of the topics is challenging for traditional NLP approaches such as n-grams and inference systems. More complex models such as deep recurrent neural networks (LSTM) and recursive neural networks (TreeNN) capture more complex aspects of the data, but also leave open interesting and relevant real-world challenges for the research community. We suggest some research directions that could further advance the state-of-the-art using the data and code that we publish.
Original languageEnglish
Title of host publicationProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing : EMNLP '18
Publication statusSubmitted - 23 Jan 2019

See relations at Aarhus University Citationformats

ID: 130411947