Our world is increasingly relying on information and communications technologies. This digitalization promises better decision making, productivity increases, improved resource management, greater quality of life, and much more. Realization of these promises requires colossal amounts of data, rapidly expanding with each advance of digital technologies. As data volumes multiply, it is increasingly important to ensure that our digital infrastructure is scalable. In this regard, data compression is an enabling technique. By representing the data more concisely, it becomes possible to transmit and store fewer bits, without impacting the utility of the data. Accordingly, greater volumes of data become manageable.
In this Ph.D. thesis, the generalized deduplication framework for data compression is presented. It extends the successful concept of deduplication by introducing a transformation of the data prior to elimination of redundant parts of the data. This transformation proves an opportunity to eliminate a greater number of data chunks, thereby realizing a more concise data representation. With proper configuration, desirable properties such as scalability and local decodability may even be inherited from classic deduplication.
Generalized deduplication is a generic framework, with many opportunities. Much of this thesis is dedicated to a presentation of how Internet of Things applications may benefit from it. Importantly, it is shown that the compression achieved with generalized deduplication can be competitive compared to a wide range of state-of-the-art compressors, while providing additional advantages. Moreover, it is flexible and may be deployed throughout the Internet of Things ecosystem, from constrained battery operated devices to powerful cloud storage servers.
Attention is also dedicated to the challenge of randomly accessing parts of compressed data without decoding the entire data set. Deduplication natively makes this possible, and this property is likewise carried over to the generalization, where local decoding of any part of the compressed data can be completed efficiently. Finally, the impact of adding similar local decodability constraints to universal compression methods is also investigated and observations on how to do so properly is presented.