An Empirical Analysis of Synthetic-Data-Based Anomaly Detection

Majlinda Llugiqi, Rudolf Mayer

Publication: Chapter in book/Conference proceedingContribution to conference proceedings

Abstract

Data is increasingly collected on practically every area of human life, e.g. from health care to financial or work aspects, and from many different sources. As the amount of data gathered grows, efforts to leverage it have intensified. Many organizations are interested to anal- yse or share the data they collect, as it may be used to provide critical services and support much-needed research. However, this often conflicts with data protection regulations. Thus sharing, analyzing and working with those sensitive data while preserving the privacy of the individ- uals represented by the data is needed. Synthetic data generation is one method increasingly used for achieving this goal. Using synthetic data would useful also for anomaly detection tasks, which often contains highly sensitive data.
While synthetic data generation aims at capturing the most relevant statistical properties of a dataset to create a dataset with similar char- acteristics, it is less explored if this method is capable of capturing also the properties of anomalous data, which is generally a minority class with potentially very few samples, and can thus reproduce meaningful anomaly instances. In this paper, we perform an extensive study on sev- eral anomaly detection techniques (supervised, unsupervised and semi- supervised) on credit card fraud and medical (annthyroid) data, and evaluate the utility of corresponding, synthetically generated datasets, obtained by various different synthetisation methods. Moreover, for su- pervised methods, we have also investigated various sampling methods; sampling in average improves the results, and we show that this transfers also to detectors learned on synthetic data. Overall, our evaluation shows that models trained on synthetic data can achieve a performance that renders them a viable alternative to real data, sometimes even outper- forming them. Based on the evaluation, we provide guidelines on which synthesizer method to use for which anomaly detection setting.
Original languageEnglish
Title of host publicationInternational Cross-Domain Conference for Machine Learning and Knowledge Extraction
Place of PublicationCham
PublisherSpringer Cham
Pages306-327
Number of pages22
ISBN (Electronic)978-3-031-14463-9
ISBN (Print)978-3-031-14462-2
DOIs
Publication statusPublished - Aug 2022
Externally publishedYes

Publication series

SeriesLecture Notes in Computer Science
Number13480
ISSN0302-9743

Cite this