Projects per year
Abstract
Data compression is critical to limit the growth in the required storage space and supporting infrastructure in an era of exponential growth of data. However, compression of individual files or even data deduplication in individual servers are not sufficient to address these challenges. This paper proposes generalized deduplication, a concept where similar data is systematically deduplicated by first transforming chunks of each file into two parts: a basis and a deviation. This increases the potential for compression as more chunks can have a common basis that can be deduplicated by the system. The deviation is kept small and stored together with and identifier to its chunk, e.g., hash of a chunk, in order to recover the original data without errors or distortions. This paper characterizes the performance of generalized deduplication using Golomb-Rice codes as a suitable a data transform function to discover similarities across all files stored in the system. Considering different synthetic data distributions, we show in theory and simulations that generalized deduplication can result in compression factors of 300 (high compression), i.e., 300 times less storage space, and that this compression is achieved with 60,000 times fewer data chunks inserted into the system compared to classic deduplication (compression gains start earlier). Finally, we show that the table/registry to recognize similar chunks is $10,000$ times smaller for generalized deduplication compared to the table in classic deduplication techniques, which will result in less RAM usage in the storage system.
Original language | English |
---|---|
Title of host publication | Proceeding of the 2019 IEEE 8th International Conference on Cloud Networking, CloudNet 2019 |
Publisher | IEEE |
Publication date | 2019 |
Article number | 9064140 |
ISBN (Electronic) | 9781728148328 |
DOIs | |
Publication status | Published - 2019 |
Event | International Conference on Cloud Networking - Duration: 4 Nov 2019 → 6 Nov 2019 Conference number: 8 |
Conference
Conference | International Conference on Cloud Networking |
---|---|
Number | 8 |
Period | 04/11/2019 → 06/11/2019 |
Keywords
- data deduplication
- generalized deduplication
- geometric distribution
- golomb rice
Fingerprint
Dive into the research topics of 'Generalized Deduplication: Lossless Compression by Clustering Similar Data'. Together they form a unique fingerprint.Projects
- 2 Finished