Aarhus University Seal

Towards optimal lower bounds for k-median and k-means coresets

Research output: Contribution to book/anthology/report/proceedingArticle in proceedingsResearchpeer-review

DOI

The (k,z)-clustering problem consists of finding a set of k points called centers, such that the sum of distances raised to the power of z of every data point to its closest center is minimized. Among the most commonly encountered special cases are k-median problem (z=1) and k-means problem (z=2). The k-median and k-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative (1± ϵ) factor, hence reducing from large to small scale the input to the problem. While there has been an intensive effort to understand what is the best coreset size possible for both problems in various metric spaces, there is still a significant gap between the state-of-the-art upper and lower bounds. In this paper, we make progress on both upper and lower bounds, obtaining tight bounds for several cases, namely: •In finite n point general metrics, any coreset must consist of ω(k logn / ϵ2) points. This improves on the ω(k logn /ϵ) lower bound of Braverman, Jiang, Krauthgamer, and Wu [ICML'19] and matches the upper bounds proposed for k-median by Feldman and Langberg [STOC'11] and k-means by Cohen-Addad, Saulpic and Schwiegelshohn [STOC'21] up to polylog factors. •For doubling metrics with doubling constant D, any coreset must consist of ω(k D/ϵ2) points. This matches the k-median and k-means upper bounds by Cohen-Addad, Saulpic, and Schwiegelshohn [STOC'21] up to polylog factors. •In d-dimensional Euclidean space, any coreset for (k,z) clustering requires ω(k/ϵ2) points. This improves on the ω(k/ϵ) lower bound of Baker, Braverman, Huang, Jiang, Krauthgamer, and Wu [ICML'20] for k-median and complements the ω(kmin(d,2z/20)) lower bound of Huang and Vishnoi [STOC'20]. We complement our lower bound for d-dimensional Euclidean space with the construction of a coreset of size Õ(k/ϵ2· min(ϵ-z,k)). This improves over the Õ(k2 ϵ-4) upper bound for general power of z proposed by Braverman Jiang, Krauthgamer, and Wu [SODA'21] and over the Õ(k/ϵ4) upper bound for k-median by Huang and Vishnoi [STOC'20]. In fact, ours is the first construction breaking through the ϵ-2· min(d,ϵ-2) barrier inherent in all previous coreset constructions. To do this, we employ a novel chaining based analysis that may be of independent interest. Together our upper and lower bounds for k-median in Euclidean spaces are tight up to a factor O(ϵ-1 polylog k/ϵ).

Original languageEnglish
Title of host publicationSTOC 2022 - Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing
EditorsStefano Leonardi, Anupam Gupta
Number of pages14
PublisherAssociation for Computing Machinery
Publication yearSept 2022
Pages1038-1051
ISBN (Electronic)9781450392648
DOIs
Publication statusPublished - Sept 2022
Event54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022 - Rome, Italy
Duration: 20 Jun 202224 Jun 2022

Conference

Conference54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022
LandItaly
ByRome
Periode20/06/202224/06/2022
SponsorACM Special Interest Group on Algorithms and Computation Theory (SIGACT), Algorand, Amazon, Apple, EATCS, et al

    Research areas

  • Clustering, coreset, k-means, k-median, lower-bound, sketch

See relations at Aarhus University Citationformats

ID: 281259975