Heavy Hitters via Cluster-Preserving Clustering

Research output: Contribution to journal/Conference contribution in journal/Contribution to newspaperJournal articleResearchpeer-review

DOI

  • Kasper Green Larsen
  • Jelani Nelson, Harvard Univ, Harvard University, Sch Med
  • ,
  • Huy L Nguyen, Northeastern Univ, Northeastern University, Ctr Marine Sci, Sch Publ Policy & Urban Affairs
  • ,
  • Mikkel Thorup, Københavns Universitet, Denmark

We develop a new algorithm for the turnstile heavy hitters problem in general turnstile streams, the EXPANDERSKETCH, which finds the approximate top-k items in a universe of size n using the same asymptotic O(k log n)words of memory and O(log n) update time as the COUNTMIN and COUNTSKETCH, but requiring only O(k poly(log n)) time to answer queries instead of the O(n log n) time of the other two. The notion of "approximation" is the same l(2) sense as the COUNTSKETCH, which given known lower bounds is the strongest guarantee one can achieve in sublinear memory.

Our main innovation is an efficient reduction from the heavy hitters problem to a clustering problem in which each heavy hitter is encoded as some form of noisy spectral cluster in a graph, and the goal is to identify every cluster. Since every heavy hitter must be found, correctness requires that every cluster be found. We thus need a "cluster-preserving clustering" algorithm that partitions the graph into pieces while finding every cluster. To do this we first apply standard spectral graph partitioning, and then we use some novel local search techniques to modify the cuts obtained so as to make sure that the original clusters are sufficiently preserved. Our clustering algorithm may be of broader interest beyond heavy hitters and streaming algorithms.

Original languageEnglish
JournalCommunications of the ACM
Volume62
Issue8
Pages (from-to)95-100
Number of pages6
ISSN0001-0782
DOIs
Publication statusPublished - Aug 2019

    Research areas

  • FINDING FREQUENT, DATA STREAM

See relations at Aarhus University Citationformats

ID: 161790678