Publikation: Bog/antologi/afhandling/rapport › Ph.d.-afhandling
Outlier Detection and Explanation for Domain Experts. / Micenková, Barbora.
Department of Computer Science, University of Aarhus, 2015. 100 s.Publikation: Bog/antologi/afhandling/rapport › Ph.d.-afhandling
}
TY - BOOK
T1 - Outlier Detection and Explanation for Domain Experts
AU - Micenková, Barbora
PY - 2015
Y1 - 2015
N2 - In many data exploratory tasks, extraordinary and rarely occurring patternscalled outliers are more interesting than the prevalent ones. For example, theycould represent frauds in insurance, intrusions in network and system monitoring,or motion in video surveillance. Decades of research have producedvarious outlier detection algorithms. It is commonly known that these algorithmsare difficult to apply and interpret in practice for a variety of reasons.In this thesis we propose novel algorithms that provide robust performance,support for validation and interpretability for outlier detection in practice andwe empirically evaluate them on synthetic and real world data sets.First, we tackle the problem that most algorithms leave the end user withoutany explanation of how or why the identified outliers deviate. Such knowledgeis important for domain experts in order to be able to validate the outputof outlier detection algorithms and perhaps then take necessary actions. Tothis end we develop an algorithm that outputs an outlierness score and an accompanyingexplanation in the form of relevancy feature weights to each datapoint. We further present a general explanation technique that given a querypoint on input, outputs its outlier explanation in the form of the attributesubset where the point is the most separable from the other data.In the second part we address the problem that unsupervised outlier detectionalgorithms require a lot of user input for model selection which leads topoor overall performance. Furthermore, in many applications some labeled examplesof outliers are available but not sufficient enough in number as trainingdata for standard supervised learning methods. As such, this valuable informationis typically ignored. We introduce a new paradigm for outlier detectionwhere supervised and unsupervised information are combined to improve theperformance while reducing the sensitivity to parameters of individual outlierdetection algorithms. We do this by learning a new representation using theoutliers from outputs of unsupervised outlier detectors as input to a supervisedclassifier. The resulting method is robust to parameters and as such itcan be easily applied to data by non-experts in data mining. We also considerthe case where computational resources at test time are limited and introducea feature selection technique that respects a computational budget whileretaining good predictive performance.
AB - In many data exploratory tasks, extraordinary and rarely occurring patternscalled outliers are more interesting than the prevalent ones. For example, theycould represent frauds in insurance, intrusions in network and system monitoring,or motion in video surveillance. Decades of research have producedvarious outlier detection algorithms. It is commonly known that these algorithmsare difficult to apply and interpret in practice for a variety of reasons.In this thesis we propose novel algorithms that provide robust performance,support for validation and interpretability for outlier detection in practice andwe empirically evaluate them on synthetic and real world data sets.First, we tackle the problem that most algorithms leave the end user withoutany explanation of how or why the identified outliers deviate. Such knowledgeis important for domain experts in order to be able to validate the outputof outlier detection algorithms and perhaps then take necessary actions. Tothis end we develop an algorithm that outputs an outlierness score and an accompanyingexplanation in the form of relevancy feature weights to each datapoint. We further present a general explanation technique that given a querypoint on input, outputs its outlier explanation in the form of the attributesubset where the point is the most separable from the other data.In the second part we address the problem that unsupervised outlier detectionalgorithms require a lot of user input for model selection which leads topoor overall performance. Furthermore, in many applications some labeled examplesof outliers are available but not sufficient enough in number as trainingdata for standard supervised learning methods. As such, this valuable informationis typically ignored. We introduce a new paradigm for outlier detectionwhere supervised and unsupervised information are combined to improve theperformance while reducing the sensitivity to parameters of individual outlierdetection algorithms. We do this by learning a new representation using theoutliers from outputs of unsupervised outlier detectors as input to a supervisedclassifier. The resulting method is robust to parameters and as such itcan be easily applied to data by non-experts in data mining. We also considerthe case where computational resources at test time are limited and introducea feature selection technique that respects a computational budget whileretaining good predictive performance.
M3 - Ph.D. thesis
BT - Outlier Detection and Explanation for Domain Experts
PB - Department of Computer Science, University of Aarhus
ER -