Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Length: 480 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2021-10-22
ISBN-10: 1801077746
ISBN-13: 9781801077743
Sales Rank: #28021 (See Top 100 Books)

Understand the complexities of modern-day data engineering platforms and explore strategies to deal with them with the help of use case scenarios led by an industry expert in big data

Key Features

Become well-versed with the core concepts of Apache Spark and Delta Lake for building data platforms
Learn how to ingest, process, and analyze data that can be later used for training machine learning models
Understand how to operationalize data models in production using curated data

Book Description

In the world of ever-changing data and schemas, it is important to build data pipelines that can auto-adjust to changes. This book will help you build scalable data platforms that managers, data scientists, and data analysts can rely on.

Starting with an introduction to data engineering, along with its key concepts and architectures, this book will show you how to use Microsoft Azure Cloud services effectively for data engineering. You’ll cover data lake design patterns and the different stages through which the data needs to flow in a typical data lake. Once you’ve explored the main features of Delta Lake to build data lakes with fast performance and governance in mind, you’ll advance to implementing the lambda architecture using Delta Lake. Packed with practical examples and code snippets, this book takes you through real-world examples based on production scenarios faced by the author in his 10 years of experience working with big data. Finally, you’ll cover data lake deployment strategies that play an important role in provisioning the cloud resources and deploying the data pipelines in a repeatable and continuous way.

By the end of this data engineering book, you’ll know how to effectively deal with ever-changing data and create scalable data pipelines to streamline data science, ML, and artificial intelligence (AI) tasks.

What you will learn

Discover the challenges you may face in the data engineering world
Add ACID transactions to Apache Spark using Delta Lake
Understand effective design strategies to build enterprise-grade data lakes
Explore architectural and design patterns for building efficient data ingestion pipelines
Orchestrate a data pipeline for preprocessing data using Apache Spark and Delta Lake APIs
Automate deployment and monitoring of data pipelines in production
Get to grips with securing, monitoring, and managing data pipelines models efficiently

Who this book is for

This book is for aspiring data engineers and data analysts who are new to the world of data engineering and are looking for a practical guide to building scalable data platforms. If you already work with PySpark and want to use Delta Lake for data engineering, you’ll find this book useful. Basic knowledge of Python, Spark, and SQL is expected.

The Story of Data Engineering and Analytics
Discovering Storage and Compute Data Lake Architectures
Data Engineering on Microsoft Azure
Understanding Data Pipelines
Data Collection Stage – The Bronze Layer
Understanding Delta Lake
Data Curation Stage – The Silver Layer
Data Aggregation Stage – The Gold Layer
Deploying and Monitoring Pipelines in Production
Solving Data Engineering Challenges
Infrastructure Provisioning
Continuous Integration and Deployment (CI/CD) of Data Pipelines

Data Engineering with Apache Spark, Delta Lake, and Lakehouse
Foreword
Contributors
About the author
About the reviewers
Preface
    Who this book is for
    What this book covers
    Download the example code files
    Download the color images
    Conventions used
    Get in touch
    Share Your Thoughts
Section 1: Modern Data Engineering and Tools
Chapter 1: The Story of Data Engineering and Analytics
    The journey of data 
    Exploring the evolution of data analytics
        Core capabilities of storage and compute resources
        Availability of varying datasets
        The paradigm shift to distributed computing
        Adoption of cloud computing
        Data storytelling
    The monetary power of data 
        Organic growth
    Summary
Chapter 2: Discovering Storage and Compute Data Lakes
    Introducing data lakes
        Exploring the benefits of data lakes
        Adhering to compliance frameworks
        Segregating storage and compute in a data lake
    Discovering data lake architectures
        The CAP theorem
    Summary
Chapter 3: Data Engineering on Microsoft Azure
    Introducing data engineering in Azure
    Performing data engineering in Microsoft Azure
        Self-managed data engineering services (IaaS)
        Azure-managed data engineering services (PaaS)
        Data processing services in Microsoft Azure
        Data engineering as a service (SaaS)
        Data cataloging and sharing services in Microsoft Azure
    Opening a free account with Microsoft Azure
    Summary
Section 2: Data Pipelines and Stages of Data Engineering
Chapter 4: Understanding Data Pipelines
    Exploring data pipelines
        Components of a data pipeline
    Process of creating a data pipeline
        Discovery phase
        Design phase
        Development phase
        Deployment phase
    Running a data pipeline
    Sample lakehouse project
    Summary
Chapter 5: Data Collection Stage – The Bronze Layer
    Architecting the Electroniz data lake
        The cloud architecture
        The pipeline design
        The deployment strategy
    Understanding the bronze layer
    Configuring data sources
        Data preparation
    Configuring data destinations
    Building the ingestion pipelines
        Building a batch ingestion pipeline
        Testing the ingestion pipelines
        Building the streaming ingestion pipeline
    Summary
Chapter 6: Understanding Delta Lake
    Understanding how Delta Lake enables the lakehouse
    Understanding Delta Lake
        Preparing Azure resources
    Creating a Delta Lake table
    Changing data in an existing Delta Lake table
    Performing time travel
    Performing upserts of data
    Understanding isolation levels
    Understanding concurrency control
    Cleaning up Azure resources
    Summary
Chapter 7: Data Curation Stage – The Silver Layer
    The need for curating raw data
        Unstandardized data
        Invalid data
        Non-uniform data
        Inconsistent data
        Duplicate data
        Insecure data
    The process of curating raw data
        Inspecting data
        Getting approval
        Cleaning data
        Verifying data
    Developing a data curation pipeline
        Preparing Azure resources
        Creating the pipeline for the silver layer
    Running the pipeline for the silver layer
    Verifying curated data in the silver layer
        Verifying unstandardized data
        Verifying invalid data
        Verifying non-uniform data
        Verifying duplicate data
        Verifying insecure data
    Cleaning up Azure resources
    Summary
Chapter 8: Data Aggregation Stage – The Gold Layer
    The need to aggregate data
    The process of aggregating data
    Developing a data aggregation pipeline
        Preparing the Azure resources
        Creating the pipeline for the gold layer
    Running the aggregation pipeline
    Understanding data consumption
        Accessing silver layer data 
        Accessing gold layer data 
    Verifying aggregated data in the gold layer
    Meeting customer expectations
    Summary
Section 3: Data Engineering Challenges and Effective Deployment Strategies
Chapter 9: Deploying and Monitoring Pipelines in Production
    The deployment strategy
    Developing the master pipeline
    Testing the master pipeline
    Scheduling the master pipeline
    Monitoring pipelines
        Adding durability features
        Dealing with failure conditions
        Adding alerting features
    Summary
Chapter 10: Solving Data Engineering Challenges
    Schema evolution 
    Sharing data
        Preparing the Azure resources
        Creating a data share
    Data governance 
        Preparing the Azure resources
        Creating a data catalog
    Cleaning up Azure resources
    Summary
Chapter 11: Infrastructure Provisioning
    Infrastructure as code
    Deploying infrastructure using Azure Resource Manager
        Creating ARM templates
        Deploying ARM templates using the Azure portal
        Deploying ARM templates using the Azure CLI
        Deploying ARM templates containing secrets
    Deploying multiple environments using IaC
    Cleaning up Azure resources
    Summary
Chapter 12: Continuous Integration and Deployment (CI/CD) of Data Pipelines
    Understanding CI/CD
        Traditional software delivery cycle
        Modern software delivery cycle
    Designing CI/CD pipelines
    Developing CI/CD pipelines
        Creating an Azure DevOps organization
        Creating the Electroniz infrastructure CI/CD pipeline
        Creating the Electroniz code CI/CD pipeline
        Creating the CI/CD life cycle
    Summary
    Why subscribe?
Other Books You May Enjoy
    Packt is searching for authors like you
    Share Your Thoughts

Data Modeling & Design Data Processing Programming Languages Python

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Data Engineering with Apache Spark, Delta Lake, and Lakehouse, sometime you may not get the results, please search the main title.