Data Deduplication Approaches: Concepts, Strategies, and Challenges

Length: 404 pages
Edition: 1
Language: English
Publisher: Academic Press
Publication Date: 2020-12-11
ISBN-10: 0128233958
ISBN-13: 9780128233955
Sales Rank: #0 (See Top 100 Books)

In the age of data science, the rapidly increasing amount of data is a major concern in numerous applications of computing operations and data storage. Duplicated data or redundant data is a main challenge in the field of data science research. Data Deduplication Approaches: Concepts, Strategies, and Challenges shows readers the various methods that can be used to eliminate multiple copies of the same files as well as duplicated segments or chunks of data within the associated files. Due to ever-increasing data duplication, its deduplication has become an especially useful field of research for storage environments, in particular persistent data storage. Data Deduplication Approaches provides readers with an overview of the concepts and background of data deduplication approaches, then proceeds to demonstrate in technical detail the strategies and challenges of real-time implementations of handling big data, data science, data backup, and recovery. The book also includes future research directions, case studies, and real-world applications of data deduplication, focusing on reduced storage, backup, recovery, and reliability.

Cover image
Title page
Table of Contents
Copyright
Dedication
List of contributors
About the editors
Preface
Acknowledgement
1. Introduction to data deduplication approaches
    Abstract
    1.1 Introduction
    1.2 Methods of data deduplication
    1.3 Classic research and classification of methods
    1.4 File chunking and metadata
    1.5 Implementation strategies
    1.6 Performance evaluation and concluding remarks
    References
2. Data deduplication concepts
    Abstract
    2.1 History
    2.2 Need of data deduplication
    2.3 Techniques for data redundancy removal
    2.4 Problems with existing techniques
    2.5 Redundant arrays of independent disks
    2.6 Direct attached storage
    2.7 Storage area network
    2.8 Network attached storage
    2.9 Comparison between direct attached storage, network attached storage, and storage area network
    2.10 Data deduplication techniques
    2.11 Benefits of data deduplication
    2.12 How data deduplication operates
    2.13 Hashing
    2.14 Deduplication taxonomy
    2.15 Deduplication versus compression
    2.16 Challenges in data deduplication
    References
3. Concepts, strategies, and challenges of data deduplication
    Abstract
    3.1 Deduplication approaches
    3.2 Required components for data deduplication approaches
    3.3 Centered on granularity for elimination of data duplication
    3.4 Centered on location for elimination of data duplication
    3.5 Centered on time for elimination of data duplication
    3.6 Comparative discussion on different studied and prevailing data deduplication approaches and its challenges
    3.7 Summary
    References
4. Existing mechanisms for data deduplication
    Abstract
    4.1 Introduction
    4.2 Classification of data deduplication techniques
    4.3 Data deduplication in the cloud
    4.4 Deduplication ratio
    4.5 Importance of data deduplication
    4.6 Deduplication for big data
    4.7 Conclusion
    References
5. Classification criteria for data deduplication methods
    Abstract
    5.1 Introduction
    5.2 Granularity
    5.3 Technique to handle duplicates
    5.4 Locality assumptions for efficiency
    5.5 Place
    5.6 Time
    5.7 Data format awareness
    5.8 Indexing and techniques to find duplicates
    5.9 Scope
    5.10 Data type
    5.11 Storage type
    5.12 Conclusion
    References
6. File chunking approaches
    Abstract
    6.1 Introduction
    6.2 Materials and methods
    6.3 File-level chunking
    6.4 Implementation of file chunking
    6.5 Case study: Deduplicator
    6.6 Case study: Duplicates Cleaner
    6.7 Conclusion
    6.8 Bibliographic note
    6.9 Supporting GitHub repositories and blogs
    ‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬References‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬‬
7. Study of data deduplication for file chunking approaches
    Abstract
    7.1 Introduction
    7.2 Related literature
    7.3 Conclusion
    References
8. Essentials of data deduplication using open-source toolkit
    Abstract
    8.1 Introduction
    8.2 Basic deduplication structure
    8.3 Implementation using Python
    8.4 Record linkage toolkit
    8.5 Summary
    References
9. Efficient data deduplication scheme for scale-out distributed storage
    Abstract
    9.1 Introduction
    9.2 Distributed storage system
    9.3 Related work
    9.4 Overview of capacity optimization for scale-out distributed storage
    9.5 Bloom filter array–based data deduplication scheme for scale-out distributed storage
    9.6 Ensuring reliability in deduplication data by erasure-coded replication
    9.7 Summary
    References
10. Identification of duplicate bug reports in software bug repositories: a systematic review, challenges, and future scope
    Abstract
    10.1 Introduction
    10.2 Motivation
    10.3 Duplicate bug detection
    10.4 Systematic review
    10.5 Conclusion, challenges, and future scope
    References
11. A survey and critical analysis on energy generation from datacenter
    Abstract
    11.1 Introduction
    11.2 Datacenter framework
    11.3 Power supply among different components of datacenter
    11.4 Power distribution among different components of datacenter
    11.5 Significance of efficient energy consumption models
    11.6 Energy consumption reduction approaches
    11.7 Conclusion
    References
12. Review of MODIS EVI and NDVI data for data mining applications
    Abstract
    12.1 Introduction
    12.2 MODIS vegetation indices
    12.3 MODIS sinusoidal tiling system
    12.4 MODIS file naming conversion
    12.5 Data conversion
    12.6 Quality assurance
    12.7 Techniques to prepare EVI time series data set
    12.8 Data mining–based land cover change detection
    12.9 Summary
    References
13. Performance modeling for secure migration processes of legacy systems to the cloud computing
    Abstract
    13.1 Data migration in cloud computing
    13.2 Literature review
    13.3 Proposed work
    13.4 Proposed encryption approach
    13.5 Result and conclusion
    References
14. DedupCloud: an optimized efficient virtual machine deduplication algorithm in cloud computing environment
    Abstract
    14.1 Introduction
    14.2 Motivation
    14.3 Literature review
    14.4 Data deduplication on cloud storage systems
    14.5 DedupCloud: proposed methodology for data deduplication in cloud
    14.6 Conclusion
    References
15. Data deduplication for cloud storage
    Abstract
    15.1 Introduction
    15.2 Cloud storage
    15.3 Data deduplication for cloud storage
    15.4 Conclusion
    References
16. Data duplication using Amazon Web Services cloud storage
    Abstract
    16.1 Introduction
    16.2 The workflow of data deduplication
    16.3 Deduplication in Amazon Web Services
    16.4 How to deduplicate
    16.5 Integrate and deduplicate datasets using AWS Lake Formation FindMatches
    16.6 Additional services and benefits
    16.7 Comparison of Cloud backup services with AWS, GCP, Azure
    16.8 Key terms and definitions
    References
17. Game-theoretic analysis of encrypted cloud data deduplication
    Abstract
    17.1 Introduction
    17.2 Related work review and open research problems
    17.3 Preliminaries and notations
    17.4 Game-theoretic analysis of server-controlled deduplication
    17.5 Game-theoretic analysis of client-controlled deduplication
    17.6 Conclusion and future work
    Acknowledgment
    References
18. Data deduplication applications in cognitive science and computer vision research
    Abstract
    18.1 Introduction
    18.2 Redundancy and dimensionality reduction
    18.3 Interactive deduplication
    18.4 Image-specific data deduplication
    18.5 Cognitive science load and dimensionality problem
    18.6 Conclusion
    References
Index