skip to main content
research-article

InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization

Published: 11 January 2023 Publication History

Abstract

Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers.
To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.

References

[1]
Dell Technologies. 2021. IDC The Business Value of Storage Solutions from Dell Technologies. Retrieved from https://www.delltechnologies.com/asset/zh-cn/products/storage/industry-market/idc-the-business-value-of-storage-solutions-from-dell-technologies.pdf.
[2]
FSL. 2021. Traces and Snapshots Public Archive. Retrieved from https://tracer.filesystems.org/.
[3]
R. Bauer. 2018. HDD vs SSD: What Does the Future for Storage Hold? Retrieved from https://www.backblaze.com/blog/hdd-vs-ssd-in-data-centers/.
[4]
Zhichao Cao, Shiyong Liu, Fenggang Wu, Guohua Wang, Bingzhe Li, and David H. C. Du. 2019. Sliding look-back window assisted data chunk rewriting for improving deduplication restore performance. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’19). 129–142.
[5]
Ajit Chavan, Mohammed I. Alghamdi, Xunfei Jiang, Xiao Qin, Meikang Qiu, Minghua Jiang, and Jifu Zhang. 2015. TIGER: Thermal-aware file assignment in storage clusters. IEEE Trans. Parallel Distrib. Syst. 27, 2 (2015), 558–573.
[6]
Yuhui Deng, Xinyu Huang, Liangshan Song, Yongtao Zhou, and Frank Z. Wang. 2017. Memory deduplication: An effective approach to improve the memory system. J. Info. Sci. Eng. 33, 5 (2017), 1103–1120.
[7]
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Jingning Liu, Wen Xia, Fangting Huang, and Qing Liu. 2015. Reducing fragmentation for in-line deduplication backup storage via exploiting backup history and cache knowledge. IEEE Trans. Parallel Distrib. Syst. 27, 3 (2015), 855–868.
[8]
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, and Qing Liu. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In Proceedings of the USENIX Annual Technical Conference (ATC’14). 181–192.
[9]
Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Yucheng Zhang, and Yujuan Tan. 2015. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST’15). 331–344.
[10]
Fanglu Guo and Petros Efstathopoulos. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference (ATC’11).
[11]
Fan Guo, Yongkun Li, Yinlong Xu, Song Jiang, and John C. S. Lui. 2017. Smartmd: A high performance deduplication engine with mixed pages. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 733–744.
[12]
Michal Kaczmarczyk, Marcin Barczynski, Wojciech Kilian, and Cezary Dubnicki. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference. 1–12.
[13]
Ron Kohavi, Randal M. Henne, and Dan Sommerfield. 2007. Practical guide to controlled experiments on the web: Listen to your customers not to the hippo. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 959–967.
[14]
Rongyu Lai, Yu Hua, Dan Feng, Wen Xia, Min Fu, and Yifan Yang. 2014. A near-exact defragmentation scheme to improve restore performance for cloud backup systems. In Proceedings of the International Conference on Algorithms and Architectures for Parallel Processing. Springer, 457–471.
[15]
Mark Lillibridge, Kave Eshghi, and Deepavali Bhagwat. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 183–197.
[16]
Lifang Lin, Yuhui Deng, and Yi Zhou. 2021. Improving restore performance of deduplication systems via a greedy rewriting scheme. In Proceedings of the 27th International Conference on Parallel and Distributed Systems (ICPADS’21).
[17]
Xing Lin, Guanlin Lu, Fred Douglis, Philip Shilane, and Grant Wallace. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 256–273.
[18]
Jian Liu, Yunpeng Chai, Xiao Qin, and Yuan Xiao. 2014. PLC-cache: Endurable SSD cache for deduplication-based primary storage. In Proceedings of the 30th Symposium on Mass Storage Systems and Technologies (MSST’14). IEEE, 1–12.
[19]
Shengmei Luo, Guangyan Zhang, Chengwen Wu, Samee Khan, and Keqin Li. 2015. Boafft: Distributed deduplication for big data storage in the cloud. IEEE Trans. Cloud Comput. 8, 4 (2015), 1199–1211.
[20]
Jingwei Ma, Rebecca J. Stones, Yuxiang Ma, Jingui Wang, Junjie Ren, Gang Wang, and Xiaoguang Liu. 2017. Lazy exact deduplication. ACM Trans. Stor. 13, 2 (2017), 1–26.
[21]
Amina Mseddi, Mohammad A. Salahuddin, Mohamed Faten Zhani, Halima Elbiaze, and Roch H. Glitho. 2018. Efficient replica migration scheme for distributed cloud storage systems. IEEE Trans. Cloud Comput. 9, 1 (2018), 155–167.
[22]
Aviv Nachman, Gala Yadgar, and Sarai Sheinvald. 2020. GoSeed: Generating an optimal seeding plan for deduplicated storage. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 193–207.
[23]
Fan Ni and Song Jiang. 2019. RapidCDC: Leveraging duplicate locality to accelerate chunking in CDC-based deduplication systems. In Proceedings of the ACM Symposium on Cloud Computing. 220–232.
[24]
Przemyslaw Strzelczak, Elzbieta Adamczyk, Urszula Herman-Izycka, Jakub Sakowicz, Lukasz Slusarczyk, Jaroslaw Wrona, and Cezary Dubnicki. 2013. Concurrent deletion in a distributed Content-Addressable storage system with global deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13). 161–174.
[25]
Zhen Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao et al. 2016. A long-term user-centric analysis of deduplication patterns. In Proceedings of the 32nd Symposium on Mass Storage Systems and Technologies (MSST’16). IEEE, 1–7.
[26]
Yujuan Tan, Baiping Wang, Jian Wen, Zhichao Yan, Hong Jiang, and Witawas Srisa-an. 2018. Improving restore performance in deduplication-based backup systems via a fine-grained defragmentation approach. IEEE Trans. Parallel Distrib. Syst. 29, 10 (2018), 2254–2267.
[27]
Yujuan Tan, Congcong Xu, Jing Xie, Zhichao Yan, Hong Jiang, Witawas Srisa-an, Xianzhang Chen, and Duo Liu. 2020. Improving the performance of deduplication-based storage cache via content-driven cache management methods. IEEE Trans. Parallel Distrib. Syst. 32, 1 (2020), 214–228.
[28]
Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuenning, and Erez Zadok. 2012. Generating realistic datasets for deduplication analysis. In Proceedings of the USENIX Annual Technical Conference (ATC’12). 261–272.
[29]
Chase Qishi Wu, Xiangyu Lin, Dantong Yu, Wei Xu, and Li Li. 2014. End-to-end delay minimization for scientific workflows in clouds under budget constraint. IEEE Trans. Cloud Comput. 3, 2 (2014), 169–181.
[30]
Jie Wu, Yu Hua, Pengfei Zuo, and Yuanyuan Sun. 2018. Improving restore performance in deduplication systems via a cost-efficient rewriting scheme. IEEE Trans. Parallel Distrib. Syst. 30, 1 (2018), 119–132.
[31]
Suzhen Wu, Bo Mao, Hong Jiang, Huagao Luan, and Jindong Zhou. 2019. PFP: Improving the reliability of deduplication-based storage systems with per-file parity. IEEE Trans. Parallel Distrib. Syst. 30, 9 (2019), 2117–2129.
[32]
Nai Xia, Chen Tian, Yan Luo, Hang Liu, and Xiaoliang Wang. 2018. UKSM: Swift memory deduplication via hierarchical and adaptive memory region distilling. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST’18). 325–340.
[33]
Wen Xia, Hong Jiang, Dan Feng, and Lei Tian. 2014. Combining deduplication and delta compression to achieve low-overhead data reduction on backup datasets. In Proceedings of the Data Compression Conference. IEEE, 203–212.
[34]
Wen Xia, Xiangyu Zou, Hong Jiang, Yukun Zhou, Chuanyi Liu, Dan Feng, Yu Hua, Yuchong Hu, and Yucheng Zhang. 2020. The design of fast content-defined chunking for data deduplication based storage systems. IEEE Trans. Parallel Distrib. Syst. 31, 9 (2020), 2017–2031.
[35]
Ru Yang, Yuhui Deng, Yi Zhou, and Ping Huang. 2021. Boosting the restoring performance of deduplication data by classifying backup metadata. ACM/IMS Trans. Data Sci. 2, 2 (2021), 1–16.
[36]
Datong Zhang, Yuhui Deng, Yi Zhou, Yifeng Zhu, and Xiao Qin. 2021. Improving the performance of deduplication-based backup systems via container utilization based hot fingerprint entry distilling. ACM Trans. Stor. 17, 4 (2021), 1–23.
[37]
Yucheng Zhang, Hong Jiang, Dan Feng, Wen Xia, Min Fu, Fangting Huang, and Yukun Zhou. 2015. AE: An asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM’15). IEEE, 1337–1345.
[38]
Nannan Zhao, Hadeel Albahar, Subil Abraham, Keren Chen, Vasily Tarasov, Dimitrios Skourtis, Lukas Rupprecht, Ali Anwar, and Ali R Butt. 2020. Duphunter: Flexible high-performance deduplication for docker registries. In Proceedings of the USENIX Annual Technical Conference (ATC’20). 769–783.
[39]
Yongtao Zhou, Yuhui Deng, Laurence T. Yang, Ru Yang, and Lei Si. 2018. LDFS: A low latency in-line data deduplication file system. IEEE Access 6 (2018), 15743–15753.
[40]
Benjamin Zhu, Kai Li, and R. Hugo Patterson. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), Vol. 8. 269–282.
[41]
Xiangyu Zou, Jingsong Yuan, Philip Shilane, Wen Xia, Haijun Zhang, and Xuan Wang. 2021. The dilemma between deduplication and locality: Can both be achieved? In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’21). 171–185.

Cited By

View all
  • (2024)The Design of Fast Delta Encoding for Delta Compression Based Storage SystemsACM Transactions on Storage10.1145/366481720:4(1-30)Online publication date: 14-May-2024
  • (2024)TFSemantic: A Time–Frequency Semantic GAN Framework for Imbalanced Classification Using Radio SignalsACM Transactions on Sensor Networks10.1145/361409620:4(1-22)Online publication date: 11-May-2024
  • (2024)Access-Based Carving of Data for Efficient Reproducibility of Containers2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00068(557-566)Online publication date: 6-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 19, Issue 1
February 2023
259 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3578369
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2023
Online AM: 19 November 2022
Accepted: 25 July 2022
Revised: 08 July 2022
Received: 11 January 2022
Published in TOS Volume 19, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data deduplication
  2. restore performance
  3. storage system

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Guangdong Basic and Applied Basic Research Foundation
  • International Cooperation Project of Guangdong Province
  • Science and Technology Planning Project of Guangzhou
  • Open Project Program of Wuhan National Laboratory for Optoelectronics
  • Industry-University-Research Collaboration Project of Zhuhai

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)180
  • Downloads (Last 6 weeks)19
Reflects downloads up to 23 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The Design of Fast Delta Encoding for Delta Compression Based Storage SystemsACM Transactions on Storage10.1145/366481720:4(1-30)Online publication date: 14-May-2024
  • (2024)TFSemantic: A Time–Frequency Semantic GAN Framework for Imbalanced Classification Using Radio SignalsACM Transactions on Sensor Networks10.1145/361409620:4(1-22)Online publication date: 11-May-2024
  • (2024)Access-Based Carving of Data for Efficient Reproducibility of Containers2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00068(557-566)Online publication date: 6-May-2024
  • (2024)Fog-assisted de-duplicated data exchange in distributed edge computing networksScientific Reports10.1038/s41598-024-71682-y14:1Online publication date: 4-Sep-2024
  • (2024)Multi-scale pooling learning for camouflaged instance segmentationApplied Intelligence10.1007/s10489-024-05369-254:5(4062-4076)Online publication date: 19-Mar-2024
  • (2023)APRG:A Fair Information Granule Model Based on Adaptive Probability Replacement Resampling2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00201(1413-1420)Online publication date: 17-Dec-2023
  • (2023)IOSPReD: I/O Specialized Packaging of Reduced Datasets and Data-Intensive Applications for Efficient ReproducibilityIEEE Access10.1109/ACCESS.2022.323378711(1718-1731)Online publication date: 2023
  • (2023)Research on Global BloomFilter-Based Data Routing Strategy of Deduplication in Cloud EnvironmentIETE Journal of Research10.1080/03772063.2023.219426070:3(2705-2715)Online publication date: 10-Apr-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media