TY - GEN
T1 - EGG-SynC: Exact GPU-parallelized Grid-based Clustering by Synchronization
AU - Jørgensen, Jakob Rødsgaard
AU - Assent, Ira
PY - 2023
Y1 - 2023
N2 - Clustering by synchronization (SynC) is a clustering method that is motivated by the natural phenomena of synchronization and is based on the Kuramoto model. The idea is to iteratively drag similar objects closer to each other until they have synchronized. SynC has been adapted to solve several well-known data mining tasks such as subspace clustering, hierarchical clustering, and streaming clustering. This shows that the SynC model is very versatile. Sadly, SynC has an O(T × n2 × d) complexity, which makes it impractical for larger datasets. E.g., Chen et al. [8] show runtimes of more than 10 hours for just n = 70, 000 data points, but improve this to just above one hour by using R-Trees in their method FSynC. Both are still impractical in real-life scenarios. Furthermore, SynC uses a termination criterion that brings no guarantees that the points have synchronized but instead just stops when most points are close to synchronizing. In this paper, our contributions are manifold. We propose a new termination criterion that guarantees that all points have synchronized. To achieve a much-needed reduction in runtime, we propose a strategy to summarize partitions of the data into a grid structure, a GPU-friendly grid structure to support this and neighborhood queries, and a GPU-parallelized algorithm for clustering by synchronization (EGG-SynC) that utilize these ideas. Furthermore, we provide an extensive evaluation against state-of-the-art showing 2 to 3 orders of magnitude speedup compared to SynC and FSynC.
AB - Clustering by synchronization (SynC) is a clustering method that is motivated by the natural phenomena of synchronization and is based on the Kuramoto model. The idea is to iteratively drag similar objects closer to each other until they have synchronized. SynC has been adapted to solve several well-known data mining tasks such as subspace clustering, hierarchical clustering, and streaming clustering. This shows that the SynC model is very versatile. Sadly, SynC has an O(T × n2 × d) complexity, which makes it impractical for larger datasets. E.g., Chen et al. [8] show runtimes of more than 10 hours for just n = 70, 000 data points, but improve this to just above one hour by using R-Trees in their method FSynC. Both are still impractical in real-life scenarios. Furthermore, SynC uses a termination criterion that brings no guarantees that the points have synchronized but instead just stops when most points are close to synchronizing. In this paper, our contributions are manifold. We propose a new termination criterion that guarantees that all points have synchronized. To achieve a much-needed reduction in runtime, we propose a strategy to summarize partitions of the data into a grid structure, a GPU-friendly grid structure to support this and neighborhood queries, and a GPU-parallelized algorithm for clustering by synchronization (EGG-SynC) that utilize these ideas. Furthermore, we provide an extensive evaluation against state-of-the-art showing 2 to 3 orders of magnitude speedup compared to SynC and FSynC.
UR - http://www.scopus.com/inward/record.url?scp=85137571229&partnerID=8YFLogxK
U2 - 10.48786/edbt.2023.16
DO - 10.48786/edbt.2023.16
M3 - Article in proceedings
T3 - Advances in Database Technology
SP - 195
EP - 207
BT - Proceedings 26th International Conference on Extending Database Technology ( EDBT 2023 )
PB - openproceedings.org
T2 - EDBT 2023: 26th International Conference on Extending Database Technology
Y2 - 28 March 2023 through 31 March 2023
ER -