Aarhus University Seal

A lower bound for jumbled indexing

Research output: Contribution to book/anthology/report/proceedingArticle in proceedingsResearchpeer-review

  • Peyman Afshani
  • Ingo van Duijn, Aalborg University
  • ,
  • Rasmus Killmann
  • ,
  • Jesper Sindahl Nielsen

In this paper we study lower bounds for a variant of jumbled-indexing problem: given an input string S on an alphabet Σ = {σ1, . . ., σλ}, store it in a data structure such that given frequencies f1, · · ·, fλ, one can find all the substrings S0 of S where the frequency of the character σi is fi, for 1 ≤ i ≤ λ. This is a very interesting and challenging text indexing problem and it has received significant attention lately, both in the string-indexing community as well as in the theory community. It is known that when λ = 2, one can use a linear-size data structure to decide whether there exists a substring that matches the query in constant time [10]. It is also known [1] that for constant λ ≥ 3, then there exists a constant αλ such that it is not possible to achieve both O(n2−αλ) preprocessing time and O(n1−αλ) query time. We study a variant of the problem where the goal is to report all the substrings of S that match a given jumbled-indexing query. Assuming the data structure operates in the pointer machine model, we prove unconditional space lower bounds that also apply to binary alphabets: we show that if the data structure has the query time of O(n0.5−o(1) + k), where k is the output size of the query, then it must consume Ω(n2−o(1)) space. The o(1) term in our lower bound is at most log(10)1n, where log(·) notation denotes the iterative log function (i.e., log(10) n = log log log log log log log log log log n). This result has a number of interesting consequences. First, it shows that reporting all the matches is significantly harder than deciding whether a match exists, at least for the binary alphabet. This was surprising for us since we believed that in order for jumbled-indexing to be difficult, the alphabet size must be at least 3. Second, from a technical point of view, we make connections between this problem and the area of additive combinatorics as well as random walks. Previously, Chan and Lewenstein [6] had shown the connections between this problem and the area of additive combinatorics from an upper bound point of view, however, we use fundamental theorems from additive combinatorics but these tools are quite different from the theorems used in [6]. In our opinion, the fact that additive combinatorics appears in our lower bound approach as well as in the upper bound approach of Chan and Lewenstein [6] is noteworthy and demands further investigation. We believe our results open up a few new venues for research. For example, we believe it is interesting to study whether it is possible to build a data structure that uses O(n) space s.t., it can report all the matches to a jumble-indexing query in O(n2/3 + k) time; this is even interesting for a binary alphabet.

Original languageEnglish
Title of host publicationProceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms
EditorsShuchi Chawla
Number of pages15
PublisherAssociation for Computing Machinery
Publication year2020
Pages592-606
ISBN (Electronic)9781611975994
DOIs
Publication statusPublished - 2020
Event31st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2020 - Salt Lake City, United States
Duration: 5 Jan 20208 Jan 2020

Conference

Conference31st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2020
LandUnited States
BySalt Lake City
Periode05/01/202008/01/2020
SponsorACM Special Interest Group on Algorithms and Computation Theory (SIGACT), SIAM Activity Group on Discrete Mathematics

See relations at Aarhus University Citationformats

ID: 187165154