However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. Finally, we outline the open problems and future research directions facing deduplication-based storage systems.Ĭontent-Defined Chunking (CDC) has been playing a key role in data deduplication systems in the past 15 years or so due to its high redundancy detection ability. In addition, we discuss the main applications and industry trend of data deduplication, and provide a list of the publicly available sources for deduplication research and studies. The summary and taxonomy of the state of the art on deduplication help identify and understand the most important design considerations for data deduplication systems. In this paper, we first review the background and key features of data deduplication, then summarize and classify the state-of-the-art research in data deduplication according to the key workflow of the data deduplication process. It eliminates redundant data at the file or subfile level and identifies duplicate content by its cryptographically secure hash signature (i.e., collision-resistant fingerprint), which is shown to be much more computationally efficient than the traditional compression approaches in large-scale storage systems. Our evaluation results driven by both benchmark and real-world datasets suggest NetSync performs 2×-10× faster and supports 30%-80% more clients than the state-of-the-art rsync-based WebR2sync+ and deduplication-based approach.ĭata deduplication, an efficient approach to data reduction, has gained increasing attention and popularity in large-scale storage systems due to the explosive growth of digital data. The key idea of NetSync is (1) to simplify the process of chunk matching by proposing a fast weak hash called FastFP that is piggybacked on the rolling hashes from CDC, and redesigning the delta sync protocol by exploiting deduplication locality and weak/strong hash properties (2) to minimize the sync time by adaptively choosing chunking parameters and compression methods according to the current network conditions. Besides, NetSync can choose appropriate compressing and chunking strategies for different network conditions. Inspired by the Content-Defined Chunking (CDC) technique used in data deduplication, we propose NetSync, a network adaptive and CDC-based lightweight delta sync approach with less computing and protocol (metadata) overheads than the state-of-the-art delta sync approaches. Moreover, rsync employs invariant chunking and compression methods during the sync process, making it unable to cater to services from various network environments which require the sync approach to perform well under different network conditions. However, it is difficult for this process to cater to the forthcoming high-bandwidth cloud storage services which require lightweight delta sync that can well support large files. The representative delta sync utility, rsync, matches data chunks by sliding a search window byte-by-byte to maximize the redundancy detection for bandwidth efficiency. Our evaluation results driven by both benchmark and real-world datasets suggest Dsync performs 2×-8× faster and supports 30%-50% more clients than the state-of-the-art rsync-based WebR2sync+ and deduplication-based approach.ĭelta sync (synchronization) is a key bandwidth-saving technique for cloud storage services. The key idea of Dsync is to simplify the process of chunk matching by (1) proposing a novel and fast weak hash called FastFp that is piggybacked on the rolling hashes from CDC and (2) redesigning the delta sync protocol by exploiting deduplication locality and weak/strong hash properties. Inspired by the Content-Defined Chunking (CDC) technique used in data deduplication, we propose Dsync, a CDC-based lightweight delta sync approach that has essentially less computation and protocol (metadata) overheads than the state-of-the-art delta sync approaches. This process, however, is difficult to cater to the demand of forthcoming high-bandwidth cloud storage services, which require lightweight delta sync that can well support large files. The representative delta sync utility, rsync, matches data chunks by sliding a search window byte by byte, to maximize the redundancy detection for bandwidth efficiency. Delta synchronization (sync) is a key bandwidth-saving technique for cloud storage services.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |