The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

Length: 487 pages
Edition: 1
Language: English
Publisher: Apress
Publication Date: 2022-07-28
ISBN-10: 1484282329
ISBN-13: 9781484282328
Sales Rank: #1692949 (See Top 100 Books)

Design and implement a modern data lakehouse on the Azure Data Platform using Delta Lake, Apache Spark, Azure Databricks, Azure Synapse Analytics, and Snowflake. This book teaches you the intricate details of the Data Lakehouse Paradigm and how to efficiently design a cloud-based data lakehouse using highly performant and cutting-edge Apache Spark capabilities using Azure Databricks, Azure Synapse Analytics, and Snowflake. You will learn to write efficient PySpark code for batch and streaming ELT jobs on Azure. And you will follow along with practical, scenario-based examples showing how to apply the capabilities of Delta Lake and Apache Spark to optimize performance, and secure, share, and manage a high volume, high velocity, and high variety of data in your lakehouse with ease.

The patterns of success that you acquire from reading this book will help you hone your skills to build high-performing and scalable ACID-compliant lakehouses using flexible and cost-efficient decoupled storage and compute capabilities. Extensive coverage of Delta Lake ensures that you are aware of and can benefit from all that this new, open source storage layer can offer. In addition to the deep examples on Databricks in the book, there is coverage of alternative platforms such as Synapse Analytics and Snowflake so that you can make the right platform choice for your needs.

After reading this book, you will be able to implement Delta Lake capabilities, including Schema Evolution, Change Feed, Live Tables, Sharing, and Clones to enable better business intelligence and advanced analytics on your data within the Azure Data Platform.

What You Will Learn

Implement the Data Lakehouse Paradigm on Microsoft’s Azure cloud platform
Benefit from the new Delta Lake open-source storage layer for data lakehouses
Take advantage of schema evolution, change feeds, live tables, and more
Write functional PySpark code for data lakehouse ELT jobs
Optimize Apache Spark performance through partitioning, indexing, and other tuning options
Choose between alternatives such as Databricks, Synapse Analytics, and Snowflake

Who This Book Is For

Data, analytics, and AI professionals at all levels, including data architect and data engineer practitioners. Also for data professionals seeking patterns of success by which to remain relevant as they learn to build scalable data lakehouses for their organizations and customers who are migrating into the modern Azure Data Platform.

Table of Contents
About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Chapter 1: The Data Lakehouse Paradigm
    Background
    Architecture
    Ingestion and Processing
        Data Factory
        Databricks
        Functions and Logic Apps
        Synapse Analytics Serverless Pools
        Stream Analytics
        Messaging Hubs
    Storing and Serving
        Delta Lake
        Synapse Analytics Dedicated SQL Pools
        Relational Database
            Purchasing Models (SQL DTU vs. vCore Database)
            Service Tiers
            Deployment Models
        Non-relational Databases
        Snowflake
    Consumption
        Analysis Services
        Power BI
        Power Apps
        Advanced Analytics
        Cognitive Services
        Machine Learning
    Continuous Integration, Deployment, and Governance
        DevOps
        Purview
    Summary
Chapter 2: Snowflake
    Architecture
    Cost
    Security
        Azure Key Vault
        Azure Private Link
    Applications
    Replication and Failover
    Data Integration with Azure
        Data Lake Storage Gen2
        Real-Time Data Loading with ADLS gen2
        Data Factory
        Databricks
    Data Transformation
    Governance
        Column-Level Security
        Row-Level Security
        Access History
        Object Tagging
    Sharing
        Direct Share
        Data Marketplace
        Data Exchange
    Continuous Integration and Deployment
        Jenkins
        Azure DevOps
    Reporting
        Power BI
    Delta Lake, Machine Learning, and Constraints
        Delta Lake
        Machine Learning
        Constraints
    Summary
Chapter 3: Databricks
    Workspaces
        Data Science and Engineering
        Machine Learning
        SQL
    Compute
    Storage
        Mount Data Lake Storage Gen2 Account
            Getting Started
            Create a Secret Scope
            Mount Data Lake Storage Gen2
            Read Data Lake Storage Gen2 from Databricks
    Delta Lake
    Reporting
    Real-Time Analytics
    Advanced Analytics
    Security and Governance
    Continuous Integration and Deployment
    Integration with Synapse Analytics
    Dynamic Data Encryption
    Data Profile
    Query Profile
    Constraints
    Identity
    Delta Live Tables Merge
    Summary
Chapter 4: Synapse Analytics
    Workspaces
        Storage
            SQL Database (SQL Pools)
            Lake Database
            Integration Dataset
            External Datasets
        Development
        Integration
        Monitoring
        Management
    Reporting
    Continuous Integration and Deployment
    Real-Time Analytics
        Structured Streaming
        Synapse Link
    Advanced Analytics
    Security
    Governance
    Additional Features
        Delta Tables
        Machine Learning
        SQL Server Integration Services Integration Runtime (SSIS IR)
        Map Data Tool
        Data Sharing
        SQL Incremental
        Constraints
    Summary
Chapter 5: Pipelines and Jobs
    Databricks
    Data Factory
        Mapping Data Flows
        HDInsight Spark Activity
        Scheduling and Monitoring
    Synapse Analytics Workspace
    Summary
Chapter 6: Notebook Code
    PySpark
        Excel
        XML
        JSON
        ZIP
    Scala
    SQL
    Optimizing Performance
    Summary
Chapter 7: Schema Evolution
    Schema Evolution Using Parquet Format
    Schema Evolution Using Delta Format
        Append
        Overwrite
    Summary
Chapter 8: Change Data Feed
    Create Database and Tables
    Insert Data into Tables
    Change Data Capture
    Streaming Changes
    Summary
Chapter 9: Clones
    Shallow Clones
    Deep Clones
    Summary
Chapter 10: Live Tables
    Advantages of Delta Live Tables
    Create a Notebook
    Create and Run a Pipeline
    Schedule a Pipeline
    Explore Event Logs
    Summary
Chapter 11: Sharing
    Architecture
    Share Data
    Access Data
    Sharing Data with Snowflake
    Summary
Chapter 12: Dynamic Partition Pruning
    Partitions
    Prerequisites
        DPP Commands
        Create Cluster
        Create Notebook and Mount Data Lake
        Create Fact Table
        Verify Fact Table Partitions
        Create Dimension Table
    Join Results Without DPP Filter
    Join Results with DPP Filter
    Summary
Chapter 13: Z-Ordering and Data Skipping
    Prepare Data in Delta Lake
    Verify Data in Delta Lake
    Create Hive Table
    Run Optimize and Z-Order Commands
    Verify Data Skipping
    Summary
Chapter 14: Adaptive Query Execution
    How It Works
        Prerequisites
    Comparing AQE Performance on Query with Joins
        Create Datasets
        Disable AQE
        Enable AQE
    Summary
Chapter 15: Bloom Filter Index
    How a Bloom Filter Index Works
    Create a Cluster
    Create a Notebook and Insert Data
    Enable Bloom Filter Index
    Create Tables
    Create a Bloom Filter Index
    Optimize Table with Z-Order
    Verify Performance Improvements
    Summary
Chapter 16: Hyperspace
    Prerequisites
    Create Parquet Files
    Run a Query Without an Index
    Import Hyperspace
    Read the Parquet Files to a Data Frame
    Create a Hyperspace Index
    Rerun the Query with Hyperspace Index
    Other Hyperspace Management APIs
    Summary
Chapter 17: Auto Loader
    Advanced Schema Evolution
    Prerequisites
        Generate Data from SQL Database
        Load Data to Azure Data Lake Storage Gen2
        Configure Resources in Azure Portal
        Configure Databricks
    Run Auto Loader in Databricks
        Configuration Properties
        Rescue Data
        Schema Hints
        Infer Column Types
        Add New Columns
    Managing Auto Loader Resources
        Read a Stream
        Write a Stream
        Explore Results
    Summary
Chapter 18: Python Wheels
    Install Application Software
        Install Visual Studio Code and Python Extension
        Install Python
        Configure Python Interpreter Path for Visual Studio Code
        Verify Python Version in Visual Studio Code Terminal
    Set Up Wheel Directory Folders and Files
        Create Setup File
        Create Readme File
        Create License File
    Create Init File
        Create Package Function File
    Install Python Wheel Packages
        Install Wheel Package
        Install Check Wheel Package
    Create and Verify Wheel File
        Create Wheel File
        Check Wheel Contents
        Verify Wheel File
        Configure Databricks Environment
        Install Wheel to Databricks Library
    Create Databricks Notebook
        Mount Data Lake Folder
        Create Spark Database
        Verify Wheel Package
        Import Wheel Package
        Create Function Parameters
        Run Wheel Package Function
        Show Spark Tables
    Files in Databricks Repos
    Continuous Integration and Deployment
    Summary
Chapter 19: Security and Controls
    Implement Cluster, Pool, and Jobs Access Control
    Implement Workspace Access Control
    Implement Other Access and Visibility Controls
        Table Access Control
        Personal Access Tokens
        Visibility Controls
    Example Row-Level Security Implementation
        Create New User Groups
        Load Sample Data
            Create Delta Tables
        Run Queries Using Row-Level Security
        Create Row-Level Secured Views and Grant Selective User Access
    Interaction with Azure Active Directory
    Summary
Index

.NET C & C++ Windows Programming Cloud Computing Database Storage & Design Microsoft Programming Performance Optimization Software Design Testing & Engineering

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/Apress

2. In the Find a repository… box, search the book title: The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake, sometime you may not get the results, please search the main title.