Simplifying Data Engineering and Analytics with Delta: Create analytics-ready data that fuels artificial intelligence and business intelligence

Length: 334 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2022-07-29
ISBN-10: 1801814864
ISBN-13: 9781801814867
Sales Rank: #498301 (See Top 100 Books)

Explore how Delta brings reliability, performance, and governance to your data lake and all the AI and BI use cases built on top of it

Key Features

Learn Delta’s core concepts and features as well as what makes it a perfect match for data engineering and analysis
Solve business challenges of different industry verticals using a scenario-based approach
Make optimal choices by understanding the various tradeoffs provided by Delta

Book Description

Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked on. This is especially important when you consider that existing architecture is frequently reused for new use cases.

In this book, you’ll learn about the principles of distributed computing, data modeling techniques, and big data design patterns and templates that help solve end-to-end data flow problems for common scenarios and are reusable across use cases and industry verticals. You’ll also learn how to recover from errors and the best practices around handling structured, semi-structured, and unstructured data using Delta. After that, you’ll get to grips with features such as ACID transactions on big data, disciplined schema evolution, time travel to help rewind a dataset to a different time or version, and unified batch and streaming capabilities that will help you build agile and robust data products.

By the end of this Delta book, you’ll be able to use Delta as the foundational block for creating analytics-ready data that fuels all AI/BI use cases.

What you will learn

Explore the key challenges of traditional data lakes
Appreciate the unique features of Delta that come out of the box
Address reliability, performance, and governance concerns using Delta
Analyze the open data format for an extensible and pluggable architecture
Handle multiple use cases to support BI, AI, streaming, and data discovery
Discover how common data and machine learning design patterns are executed on Delta
Build and deploy data and machine learning pipelines at scale using Delta

Who this book is for

Data engineers, data scientists, ML practitioners, BI analysts, or anyone in the data domain working with big data will be able to put their knowledge to work with this practical guide to executing pipelines and supporting diverse use cases using the Delta protocol. Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book.

Simplifying Data Engineering and Analytics with Delta
Foreword
Contributors
About the author
About the reviewer
Preface
    Who this book is for
    What this book covers
    To get the most out of this book
    Download the example code files
    Download the color images
    Conventions used
    Get in touch
    Share Your Thoughts
Section 1 – Introduction to Delta Lake and Data Engineering Principles
Chapter 1: Introduction to Data Engineering
    The motivation behind data engineering
        Use cases
        How big is big data?
        But isn't ML and AI all the rage today?
    Understanding the role of data personas
    Big data ecosystem
        What characterizes big data?
        Classifying data
        Reaping value from data
        Top challenges of big data systems
    Evolution of data systems
        Rise of cloud data platforms
        SQL and NoSQL systems
        OLTP and OLAP systems
            Data platform service models 
    Distributed computing
        SMP and MPP computing
        Parallel and distributed computing
            Hadoop
            Spark
            Hadoop versus Spark
    Business justification for tech spending
        Strategy for business transformation to use data as an asset
        Big data trends and best practices
    Summary
Chapter 2: Data Modeling and ETL
    Technical requirements
    What is data modeling and why should you care?
        Advantages of a data modeling exercise
        Stages of data modeling
        Data modeling approaches for different data stores
            Relational data modeling
            Non-relational data modeling
    Understanding metadata – data about data
        Data catalog
        Types of metadata
        Why is metadata management the nerve center of data?
    Moving and transforming data using ETL
        Scenarios to consider for building ETL pipelines
            Periodic and continuous ingestion
            Bulk data migration 
            Change data capture 
            Slowly changing dimensions
        Job orchestration
    How to choose the right data format
        Text format versus binary format
        Row versus column formats
        When to use which format
        Leveraging data compression
    Common big data design patterns
        Ingestion
            Unified API
            Speed layer
        Transformations
            Handling schema changes 
            ACID transactions 
            Multihop pipeline 
        Persist
            Separation of compute from storage
            Multiple destinations
            Denormalization
            In-stream analytics
            Best practices
    Summary
    Further reading
Chapter 3: Delta – The Foundation Block for Big Data 
    Technical requirements
    Motivation for Delta
        A case of too many is too little
        Data silos to data swamps
        Characteristics of curated data lakes
        DDL commands
            CREATE
        DML commands
        APPEND
            UPDATE
            DELETE 
            MERGE 
    Demystifying Delta
        Format layout on disk
    The main features of Delta
        ACID transaction support
        Schema evolution
        Unifying batch and streaming workloads
        Time travel
        Performance
            Data skipping
            Z-Order clustering
            Delta cache
    Life with and without Delta
        Lakehouse
            Characteristics of a Lakehouse
    Summary
Section 2 – End-to-End Process of Building Delta Pipelines
Chapter 4: Unifying Batch and Streaming with Delta
    Technical requirements
    Moving toward real-time systems
        Streaming concepts
        Lambda versus Kappa architectures
    Streaming ETL
        Extract – file-based versus event-based streaming
        Transforming – stream processing
        Loading – persisting the stream
    Handling streaming scenarios
        Joining with other static and dynamic datasets
        Recovering from failures
        Handling late-arriving data
        Stateless and stateful in-stream operations
    Trade-offs in designing streaming architectures
        Cost trade-offs
        Handling latency trade-offs
        Data reprocessing
        Multi-tenancy
        De-duplication
    Streaming best practices
    Summary
Chapter 5: Data Consolidation in Delta Lake
    Technical requirements
    Why consolidate disparate data types?
    Delta unifies all types of data
        Structured data
        Semi-structured data
        Unstructured data
    Avoiding patches of data darkness
        Addressing problems in flight status using Delta
        Augmenting domain knowledge constraints to quality 
        Continuous quality monitoring
    Curating data in stages for analytics
        RDD, DataFrames, and datasets
        Spark transformations and actions
        Spark APIs and UDFs
    Ease of extending to existing and new use cases
        Delta Lake connectors
        Specialized Delta Lakes by industry
            Healthcare and life sciences Delta Lake
            Industry 4.0 manufacturing Delta Lake
            Financial services Delta Lake
            Retail Delta Lake
    Data governance
        GDPR and CCPA compliance
        Role-based data access
    Summary
Chapter 6: Solving Common Data Pattern Scenarios with Delta
    Technical requirements
    Understanding use case requirements 
    Minimizing data movement with Delta time travel
    Delta cloning
    Handling CDC
        CDC
        Change Data Feed (CDF)
    Handling Slowly Changing Dimensions (SCD)
        SCD Type 1
        SCD Type 2
    Summary
Chapter 7: Delta for Data Warehouse Use Cases
    Technical requirements
    Choosing the right architecture 
    Understanding what a data warehouse really solves
        Lacunas of data warehouses
    Discovering when a data lake does not suffice
    Addressing concurrency and latency requirements with Delta
    Visualizing data using BI reporting
        Can cubes be constructed with Delta?
    Analyzing tradeoffs in a push versus pull data flow
        Why is being open such a big deal?
    Considerations around data governance
    The rise of the lakehouse category
    Summary
Chapter 8: Handling Atypical Data Scenarios with Delta
    Technical requirements
    Emphasizing the importance of exploratory data analysis (EDA)
        From big data to good data
        Data profiling
        Statistical analysis
    Applying sampling techniques to address class imbalance  
        How to detect and address imbalance
        Synthetic data generation to deal with data imbalance
    Addressing data skew
    Providing data anonymity
    Handling bias and variance in data
        Bias versus variance
        How do we detect bias and variance?
        How do we fix bias and variance?
    Compensating for missing and out-of-range data
    Monitoring data drift
    Summary
Chapter 9: Delta for Reproducible Machine Learning Pipelines
    Technical requirements
    Data science versus machine learning
    Challenges of ML development
    Formalizing the ML development process
        What is a model?
        What is MLOps?
        Aspirations of a modern ML platform
    The role of Delta in an ML pipeline
        Delta-backed feature store
        Delta-backed model training
        Delta-backed model inferencing
        Model monitoring with Delta
    From business problem to insight generation
    Summary
Chapter 10: Delta for Data Products and Services
    Technical requirements
    DaaS 
    The need for data democratization 
    Delta for unstructured data
        NLP data (text and audio)
        Image and video data
    Data mashups using Delta
        Data blending
        Data harmonization
        Federated query
    Facilitating data sharing with Delta
        Setting up Delta sharing
        Benefits of Delta sharing
        Data clean room
    Summary
Section 3 – Operationalizing and Productionalizing Delta Pipelines
Chapter 11: Operationalizing Data and ML Pipelines
    Technical requirements
    Why operationalize?
    Understanding and monitoring SLAs
    Scaling and high availability
    Planning for DR 
        How to decide on the correct DR strategy
        How Delta helps with DR
    Guaranteeing data quality
    Automation of CI/CD pipelines 
        Code under version control
        Infrastructure as Code (IaC)
        Unit and integration testing
    Data as code – An intelligent pipeline
    Summary
Chapter 12: Optimizing Cost and Performance with Delta
    Technical requirements
    Improving performance with common strategies  
        Where to look and what to look for
    Optimizing with Delta
        Changing the data layout in storage
        Other platform optimizations
        Automation
    Is cost always inversely proportional to performance?
    Best practices for managing performance
    Summary
Chapter 13: Managing Your Data Journey
    Provisioning a multi-tenant infrastructure 
    Data democratization via policies and processes
    Capacity planning
    Managing and monitoring 
    Data sharing
    Data migration
    COE best practices 
    Summary
    Why subscribe?
Other Books You May Enjoy
    Packt is searching for authors like you
    Share Your Thoughts

Data Modeling & Design Data Processing Data Warehousing Database Storage & Design

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Simplifying Data Engineering and Analytics with Delta: Create analytics-ready data that fuels artificial intelligence and business intelligence, sometime you may not get the results, please search the main title.