TY - JOUR
T1 - Incremental Density-based Clustering on Multicore Processors
AU - Thai Son, Mai
AU - Jacobsen, Jon
AU - Amer-Yahia, Sihem
AU - Spence, Ivor
AU - Tran, Phuong
AU - Assent, Ira
AU - Viet Hung Nguyen, Quoc
PY - 2022/3
Y1 - 2022/3
N2 - The density-based clustering algorithm is a fundamental data clustering technique with many real-world applications. However, when the database is frequently changed, how to effectively update clustering results rather than reclustering from scratch remains a challenging task. In this work, we introduce IncAnyDBC, a unique parallel incremental data clustering approach to deal with this problem. First, IncAnyDBC can process changes in bulks rather than batches like state-of-the-art methods for reducing update overheads. Second, it keeps an underlying cluster structure called the object node graph during the clustering process and uses it as a basis for incrementally updating clusters wrt. inserted or deleted objects in the database by propagating changes around affected nodes only. In additional, IncAnyDBC actively and iteratively examines the graph and chooses only a small set of most meaningful objects to produce exact clustering results of DBSCAN or to approximate results under arbitrary time constraints. This makes it more efficient than other existing methods. Third, by processing objects in blocks, IncAnyDBC can be efficiently parallelized on multicore CPUs, thus creating a work-efficient method. It runs much faster than existing techniques using one thread while still scaling well with multiple threads. Experiments are conducted on various large real datasets for demonstrating the performance of IncAnyDBC.
AB - The density-based clustering algorithm is a fundamental data clustering technique with many real-world applications. However, when the database is frequently changed, how to effectively update clustering results rather than reclustering from scratch remains a challenging task. In this work, we introduce IncAnyDBC, a unique parallel incremental data clustering approach to deal with this problem. First, IncAnyDBC can process changes in bulks rather than batches like state-of-the-art methods for reducing update overheads. Second, it keeps an underlying cluster structure called the object node graph during the clustering process and uses it as a basis for incrementally updating clusters wrt. inserted or deleted objects in the database by propagating changes around affected nodes only. In additional, IncAnyDBC actively and iteratively examines the graph and chooses only a small set of most meaningful objects to produce exact clustering results of DBSCAN or to approximate results under arbitrary time constraints. This makes it more efficient than other existing methods. Third, by processing objects in blocks, IncAnyDBC can be efficiently parallelized on multicore CPUs, thus creating a work-efficient method. It runs much faster than existing techniques using one thread while still scaling well with multiple threads. Experiments are conducted on various large real datasets for demonstrating the performance of IncAnyDBC.
KW - Density-based clustering
KW - active clustering
KW - anytime clustering
KW - incremental clustering
KW - multicore CPUs
UR - http://www.scopus.com/inward/record.url?scp=85111278133&partnerID=8YFLogxK
U2 - 10.1109/TPAMI.2020.3023125
DO - 10.1109/TPAMI.2020.3023125
M3 - Journal article
C2 - 32915725
SN - 0162-8828
VL - 44
SP - 1338
EP - 1356
JO - I E E E Transactions on Pattern Analysis and Machine Intelligence
JF - I E E E Transactions on Pattern Analysis and Machine Intelligence
IS - 3
ER -