Clustering by synchronization (SynC) is a clustering method that is motivated by the natural phenomena of synchronization and is based on the Kuramoto model. The idea is to iteratively drag similar objects closer to each other until they have synchronized. SynC has been adapted to solve several well-known data mining tasks such as subspace clustering, hierarchical clustering, and streaming clustering. This shows that the SynC model is very versatile. Sadly, SynC has an O(T * n^2 * d) complexity, which makes it impractical for larger datasets. E.g. Chen et al., shows runtimes of more than 10 hours for just n=70,000 data points, but improves this to just above one hour by using Rtrees in their method FSynC. Both are still impractical in real-life scenarios. Furthermore, SynC uses a termination criterion that brings no guarantees that the points have synchronized, but instead just stops when most points are close to synchronizing.
In this paper, our contributions are manyfold. We propose a new termination criterion that guarantees that all points have synchronized. To achieve a much-needed reduction in runtime, we propose a strategy to summarize partitions of the data into a grid structure, a GPU-friendly grid structure to support this and neighborhood queries, and a GPU-parallelized algorithm for clustering by synchronization (EGG-SynC) that utilize these ideas. Furthermore, we provide an extensive evaluation against state-of-the-art showing 2 to 3 orders of magnitude speedup compared to SynC and FSynC.
Originalsprog
Engelsk
Udgivelsesår
2023
Status
Udgivet - 2023
Begivenhed
EDBT 2023: 26th International Conference on Extending Database Technology - Varighed: 28 mar. 2023 → 31 mar. 2023
Konference
Konference
EDBT 2023: 26th International Conference on Extending Database Technology