Deduplication of Textual Data by NLP Approaches

Kiana Ghassabi, Peyman Pahlevani, Daniel Enrique Lucani Rötter

Research output: Contribution to book/anthology/report/proceedingArticle in proceedingsResearchpeer-review

1 Citation (Scopus)

Abstract

With the increasing amount of digital data, data deduplication has become an increasingly popular method for reducing data in large-scale storage systems. Generalized deduplication is an alternative technique for reducing the cost of data storage by identifying similar data chunks. This paper proposes TL-GD, a method for improving cloud storage efficiency using generalized deduplication focusing on textual datasets. The core concept of this study is to develop an efficient deduplication system that combines an alternative technique for splitting data into smaller pieces and a new approach for transforming data pieces into bases and deviations. The performance of the system has been validated using two real-world datasets. We also compare the results to state-of-the-art deduplication methods. Our evaluation results show that TL-GD achieves nearly 67% lossless compression for textual navigation instructions datasets, which is a 25% improvement on average compared to existing deduplication techniques.
Original languageEnglish
Title of host publication2023 IEEE 97th Vehicular Technology Conference (VTC2023-Spring)
PublisherIEEE
Publication dateAug 2023
ISBN (Electronic)979-8-3503-1114-3, 979-8-3503-1115-0
DOIs
Publication statusPublished - Aug 2023
SeriesI E E E V T S Vehicular Technology Conference. Proceedings
ISSN1550-2252

Keywords

  • CSP
  • Generalized Deduplication
  • Storage

Fingerprint

Dive into the research topics of 'Deduplication of Textual Data by NLP Approaches'. Together they form a unique fingerprint.
  • Scale-loT

    Lucani Rötter, D. E. (Participant)

    01/01/201831/12/2022

    Project: Research

Cite this