
The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake
- Length: 487 pages
- Edition: 1
- Language: English
- Publisher: Apress
- Publication Date: 2022-07-28
- ISBN-10: 1484282329
- ISBN-13: 9781484282328
- Sales Rank: #1692949 (See Top 100 Books)
Design and implement a modern data lakehouse on the Azure Data Platform using Delta Lake, Apache Spark, Azure Databricks, Azure Synapse Analytics, and Snowflake. This book teaches you the intricate details of the Data Lakehouse Paradigm and how to efficiently design a cloud-based data lakehouse using highly performant and cutting-edge Apache Spark capabilities using Azure Databricks, Azure Synapse Analytics, and Snowflake. You will learn to write efficient PySpark code for batch and streaming ELT jobs on Azure. And you will follow along with practical, scenario-based examples showing how to apply the capabilities of Delta Lake and Apache Spark to optimize performance, and secure, share, and manage a high volume, high velocity, and high variety of data in your lakehouse with ease.
The patterns of success that you acquire from reading this book will help you hone your skills to build high-performing and scalable ACID-compliant lakehouses using flexible and cost-efficient decoupled storage and compute capabilities. Extensive coverage of Delta Lake ensures that you are aware of and can benefit from all that this new, open source storage layer can offer. In addition to the deep examples on Databricks in the book, there is coverage of alternative platforms such as Synapse Analytics and Snowflake so that you can make the right platform choice for your needs.
After reading this book, you will be able to implement Delta Lake capabilities, including Schema Evolution, Change Feed, Live Tables, Sharing, and Clones to enable better business intelligence and advanced analytics on your data within the Azure Data Platform.
What You Will Learn
- Implement the Data Lakehouse Paradigm on Microsoft’s Azure cloud platform
- Benefit from the new Delta Lake open-source storage layer for data lakehouses
- Take advantage of schema evolution, change feeds, live tables, and more
- Write functional PySpark code for data lakehouse ELT jobs
- Optimize Apache Spark performance through partitioning, indexing, and other tuning options
- Choose between alternatives such as Databricks, Synapse Analytics, and Snowflake
Who This Book Is For
Data, analytics, and AI professionals at all levels, including data architect and data engineer practitioners. Also for data professionals seeking patterns of success by which to remain relevant as they learn to build scalable data lakehouses for their organizations and customers who are migrating into the modern Azure Data Platform.
Table of Contents About the Author About the Technical Reviewer Acknowledgments Introduction Chapter 1: The Data Lakehouse Paradigm Background Architecture Ingestion and Processing Data Factory Databricks Functions and Logic Apps Synapse Analytics Serverless Pools Stream Analytics Messaging Hubs Storing and Serving Delta Lake Synapse Analytics Dedicated SQL Pools Relational Database Purchasing Models (SQL DTU vs. vCore Database) Service Tiers Deployment Models Non-relational Databases Snowflake Consumption Analysis Services Power BI Power Apps Advanced Analytics Cognitive Services Machine Learning Continuous Integration, Deployment, and Governance DevOps Purview Summary Chapter 2: Snowflake Architecture Cost Security Azure Key Vault Azure Private Link Applications Replication and Failover Data Integration with Azure Data Lake Storage Gen2 Real-Time Data Loading with ADLS gen2 Data Factory Databricks Data Transformation Governance Column-Level Security Row-Level Security Access History Object Tagging Sharing Direct Share Data Marketplace Data Exchange Continuous Integration and Deployment Jenkins Azure DevOps Reporting Power BI Delta Lake, Machine Learning, and Constraints Delta Lake Machine Learning Constraints Summary Chapter 3: Databricks Workspaces Data Science and Engineering Machine Learning SQL Compute Storage Mount Data Lake Storage Gen2 Account Getting Started Create a Secret Scope Mount Data Lake Storage Gen2 Read Data Lake Storage Gen2 from Databricks Delta Lake Reporting Real-Time Analytics Advanced Analytics Security and Governance Continuous Integration and Deployment Integration with Synapse Analytics Dynamic Data Encryption Data Profile Query Profile Constraints Identity Delta Live Tables Merge Summary Chapter 4: Synapse Analytics Workspaces Storage SQL Database (SQL Pools) Lake Database Integration Dataset External Datasets Development Integration Monitoring Management Reporting Continuous Integration and Deployment Real-Time Analytics Structured Streaming Synapse Link Advanced Analytics Security Governance Additional Features Delta Tables Machine Learning SQL Server Integration Services Integration Runtime (SSIS IR) Map Data Tool Data Sharing SQL Incremental Constraints Summary Chapter 5: Pipelines and Jobs Databricks Data Factory Mapping Data Flows HDInsight Spark Activity Scheduling and Monitoring Synapse Analytics Workspace Summary Chapter 6: Notebook Code PySpark Excel XML JSON ZIP Scala SQL Optimizing Performance Summary Chapter 7: Schema Evolution Schema Evolution Using Parquet Format Schema Evolution Using Delta Format Append Overwrite Summary Chapter 8: Change Data Feed Create Database and Tables Insert Data into Tables Change Data Capture Streaming Changes Summary Chapter 9: Clones Shallow Clones Deep Clones Summary Chapter 10: Live Tables Advantages of Delta Live Tables Create a Notebook Create and Run a Pipeline Schedule a Pipeline Explore Event Logs Summary Chapter 11: Sharing Architecture Share Data Access Data Sharing Data with Snowflake Summary Chapter 12: Dynamic Partition Pruning Partitions Prerequisites DPP Commands Create Cluster Create Notebook and Mount Data Lake Create Fact Table Verify Fact Table Partitions Create Dimension Table Join Results Without DPP Filter Join Results with DPP Filter Summary Chapter 13: Z-Ordering and Data Skipping Prepare Data in Delta Lake Verify Data in Delta Lake Create Hive Table Run Optimize and Z-Order Commands Verify Data Skipping Summary Chapter 14: Adaptive Query Execution How It Works Prerequisites Comparing AQE Performance on Query with Joins Create Datasets Disable AQE Enable AQE Summary Chapter 15: Bloom Filter Index How a Bloom Filter Index Works Create a Cluster Create a Notebook and Insert Data Enable Bloom Filter Index Create Tables Create a Bloom Filter Index Optimize Table with Z-Order Verify Performance Improvements Summary Chapter 16: Hyperspace Prerequisites Create Parquet Files Run a Query Without an Index Import Hyperspace Read the Parquet Files to a Data Frame Create a Hyperspace Index Rerun the Query with Hyperspace Index Other Hyperspace Management APIs Summary Chapter 17: Auto Loader Advanced Schema Evolution Prerequisites Generate Data from SQL Database Load Data to Azure Data Lake Storage Gen2 Configure Resources in Azure Portal Configure Databricks Run Auto Loader in Databricks Configuration Properties Rescue Data Schema Hints Infer Column Types Add New Columns Managing Auto Loader Resources Read a Stream Write a Stream Explore Results Summary Chapter 18: Python Wheels Install Application Software Install Visual Studio Code and Python Extension Install Python Configure Python Interpreter Path for Visual Studio Code Verify Python Version in Visual Studio Code Terminal Set Up Wheel Directory Folders and Files Create Setup File Create Readme File Create License File Create Init File Create Package Function File Install Python Wheel Packages Install Wheel Package Install Check Wheel Package Create and Verify Wheel File Create Wheel File Check Wheel Contents Verify Wheel File Configure Databricks Environment Install Wheel to Databricks Library Create Databricks Notebook Mount Data Lake Folder Create Spark Database Verify Wheel Package Import Wheel Package Create Function Parameters Run Wheel Package Function Show Spark Tables Files in Databricks Repos Continuous Integration and Deployment Summary Chapter 19: Security and Controls Implement Cluster, Pool, and Jobs Access Control Implement Workspace Access Control Implement Other Access and Visibility Controls Table Access Control Personal Access Tokens Visibility Controls Example Row-Level Security Implementation Create New User Groups Load Sample Data Create Delta Tables Run Queries Using Row-Level Security Create Row-Level Secured Views and Grant Selective User Access Interaction with Azure Active Directory Summary Index
How to download source code?
1. Go to: https://github.com/Apress
2. In the Find a repository… box, search the book title: The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake
, sometime you may not get the results, please search the main title.
3. Click the book title in the search results.
3. Click Code to download.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.