A lower bound for jumbled indexing

Publikation: Bidrag til bog/antologi/rapport/proceedingKonferencebidrag i proceedingsForskningpeer review


In this paper we study lower bounds for a variant of jumbled-indexing problem: given an input string S on an alphabet Σ = {σ1, . . ., σλ}, store it in a data structure such that given frequencies f1, · · ·, fλ, one can find all the substrings S0 of S where the frequency of the character σi is fi, for 1 ≤ i ≤ λ. This is a very interesting and challenging text indexing problem and it has received significant attention lately, both in the string-indexing community as well as in the theory community. It is known that when λ = 2, one can use a linear-size data structure to decide whether there exists a substring that matches the query in constant time [10]. It is also known [1] that for constant λ ≥ 3, then there exists a constant αλ such that it is not possible to achieve both O(n2−αλ) preprocessing time and O(n1−αλ) query time. We study a variant of the problem where the goal is to report all the substrings of S that match a given jumbled-indexing query. Assuming the data structure operates in the pointer machine model, we prove unconditional space lower bounds that also apply to binary alphabets: we show that if the data structure has the query time of O(n0.5−o(1) + k), where k is the output size of the query, then it must consume Ω(n2−o(1)) space. The o(1) term in our lower bound is at most log(10)1n, where log(·) notation denotes the iterative log function (i.e., log(10) n = log log log log log log log log log log n). This result has a number of interesting consequences. First, it shows that reporting all the matches is significantly harder than deciding whether a match exists, at least for the binary alphabet. This was surprising for us since we believed that in order for jumbled-indexing to be difficult, the alphabet size must be at least 3. Second, from a technical point of view, we make connections between this problem and the area of additive combinatorics as well as random walks. Previously, Chan and Lewenstein [6] had shown the connections between this problem and the area of additive combinatorics from an upper bound point of view, however, we use fundamental theorems from additive combinatorics but these tools are quite different from the theorems used in [6]. In our opinion, the fact that additive combinatorics appears in our lower bound approach as well as in the upper bound approach of Chan and Lewenstein [6] is noteworthy and demands further investigation. We believe our results open up a few new venues for research. For example, we believe it is interesting to study whether it is possible to build a data structure that uses O(n) space s.t., it can report all the matches to a jumble-indexing query in O(n2/3 + k) time; this is even interesting for a binary alphabet.

TitelProceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms
RedaktørerShuchi Chawla
Antal sider15
ForlagAssociation for Computing Machinery
ISBN (Elektronisk)9781611975994
StatusUdgivet - 2020
Begivenhed31st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2020 - Salt Lake City, USA
Varighed: 5 jan. 20208 jan. 2020


Konference31st Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2020
BySalt Lake City
SponsorACM Special Interest Group on Algorithms and Computation Theory (SIGACT), SIAM Activity Group on Discrete Mathematics

Se relationer på Aarhus Universitet Citationsformater

ID: 187165154