Sensitive Information Detection: Recursive Neural Networks for Encoding Context

Publikation: Bog/antologi/afhandling/rapportPh.d.-afhandling

Standard

Sensitive Information Detection: Recursive Neural Networks for Encoding Context. / Neerbek, Jan.

2020. 156 s.

Publikation: Bog/antologi/afhandling/rapportPh.d.-afhandling

Harvard

APA

CBE

MLA

Vancouver

Author

Bibtex

@phdthesis{b3519e58de9e4ffaadc6606f53a0f0c6,
title = "Sensitive Information Detection: Recursive Neural Networks for Encoding Context",
abstract = "The amount of data for processing and categorization grows at an ever increas-ing rate. At the same time the demand for collaboration and transparencyin organizations, government and businesses, drives the release of data frominternal repositories to the public or 3rd party domain. This in turn increasethe potential of sharing sensitive information. The leak of sensitive informa-tion can potentially be very costly, both financially for organizations, but alsofor individuals. In this work we address the important problem of sensitiveinformation detection. Specially we focus on detection in unstructured textdocuments.We show that simplistic, brittle rule sets for detecting sensitive informationonly find a small fraction of the actual sensitive information. Furthermore weshow that previous state-of-the-art approaches have been implicitly tailoredto such simplistic scenarios and thus fail to detect actual sensitive content.We develop a novel family of sensitive information detection approacheswhich only assumes access to labeled examples, rather than unrealistic as-sumptions such as access to a set of generating rules or descriptive topicalseed words. Our approaches are inspired by the current state-of-the-art forparaphrase detection and we adapt deep learning approaches over recursiveneural networks to the problem of sensitive information detection. We showthat our context-based approaches significantly outperforms the family of pre-vious state-of-the-art approaches for sensitive information detection, so-calledkeyword-based approaches, on real-world data and with human labeled exam-ples of sensitive and non-sensitive documents.A key challenge in the field of sensitive information detection is the lackof publicly available real-world datasets on which to train and/or benchmarkon. This is due to the inherent sensitive nature of the data in question.We address this issue in this work by releasing publicly labeled examples ofsensitive and non-sensitive content. We release a total of 8 different typesof sensitive information over 2 distinct sets of documents. We utilize effortsby human domain experts in labeling both datasets for 4 complex types ofinformational content for each set of documents. This release totals 750, 000labeled sentences",
author = "Jan Neerbek",
year = "2020",
language = "English",

}

RIS

TY - BOOK

T1 - Sensitive Information Detection: Recursive Neural Networks for Encoding Context

AU - Neerbek, Jan

PY - 2020

Y1 - 2020

N2 - The amount of data for processing and categorization grows at an ever increas-ing rate. At the same time the demand for collaboration and transparencyin organizations, government and businesses, drives the release of data frominternal repositories to the public or 3rd party domain. This in turn increasethe potential of sharing sensitive information. The leak of sensitive informa-tion can potentially be very costly, both financially for organizations, but alsofor individuals. In this work we address the important problem of sensitiveinformation detection. Specially we focus on detection in unstructured textdocuments.We show that simplistic, brittle rule sets for detecting sensitive informationonly find a small fraction of the actual sensitive information. Furthermore weshow that previous state-of-the-art approaches have been implicitly tailoredto such simplistic scenarios and thus fail to detect actual sensitive content.We develop a novel family of sensitive information detection approacheswhich only assumes access to labeled examples, rather than unrealistic as-sumptions such as access to a set of generating rules or descriptive topicalseed words. Our approaches are inspired by the current state-of-the-art forparaphrase detection and we adapt deep learning approaches over recursiveneural networks to the problem of sensitive information detection. We showthat our context-based approaches significantly outperforms the family of pre-vious state-of-the-art approaches for sensitive information detection, so-calledkeyword-based approaches, on real-world data and with human labeled exam-ples of sensitive and non-sensitive documents.A key challenge in the field of sensitive information detection is the lackof publicly available real-world datasets on which to train and/or benchmarkon. This is due to the inherent sensitive nature of the data in question.We address this issue in this work by releasing publicly labeled examples ofsensitive and non-sensitive content. We release a total of 8 different typesof sensitive information over 2 distinct sets of documents. We utilize effortsby human domain experts in labeling both datasets for 4 complex types ofinformational content for each set of documents. This release totals 750, 000labeled sentences

AB - The amount of data for processing and categorization grows at an ever increas-ing rate. At the same time the demand for collaboration and transparencyin organizations, government and businesses, drives the release of data frominternal repositories to the public or 3rd party domain. This in turn increasethe potential of sharing sensitive information. The leak of sensitive informa-tion can potentially be very costly, both financially for organizations, but alsofor individuals. In this work we address the important problem of sensitiveinformation detection. Specially we focus on detection in unstructured textdocuments.We show that simplistic, brittle rule sets for detecting sensitive informationonly find a small fraction of the actual sensitive information. Furthermore weshow that previous state-of-the-art approaches have been implicitly tailoredto such simplistic scenarios and thus fail to detect actual sensitive content.We develop a novel family of sensitive information detection approacheswhich only assumes access to labeled examples, rather than unrealistic as-sumptions such as access to a set of generating rules or descriptive topicalseed words. Our approaches are inspired by the current state-of-the-art forparaphrase detection and we adapt deep learning approaches over recursiveneural networks to the problem of sensitive information detection. We showthat our context-based approaches significantly outperforms the family of pre-vious state-of-the-art approaches for sensitive information detection, so-calledkeyword-based approaches, on real-world data and with human labeled exam-ples of sensitive and non-sensitive documents.A key challenge in the field of sensitive information detection is the lackof publicly available real-world datasets on which to train and/or benchmarkon. This is due to the inherent sensitive nature of the data in question.We address this issue in this work by releasing publicly labeled examples ofsensitive and non-sensitive content. We release a total of 8 different typesof sensitive information over 2 distinct sets of documents. We utilize effortsby human domain experts in labeling both datasets for 4 complex types ofinformational content for each set of documents. This release totals 750, 000labeled sentences

M3 - Ph.D. thesis

BT - Sensitive Information Detection: Recursive Neural Networks for Encoding Context

ER -