Azure Data Engineer Associate Certification Guide: A hands-on reference guide to developing your data engineering skills and preparing for the DP-203 exam

Length: 574 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2022-02-28
ISBN-10: 1801816069
ISBN-13: 9781801816069
Sales Rank: #152174 (See Top 100 Books)

Become well-versed with data engineering concepts and exam objectives to achieve Azure Data Engineer Associate certification

Key Features

Understand and apply data engineering concepts to real-world problems and prepare for the DP-203 certification exam
Explore the various Azure services for building end-to-end data solutions
Gain a solid understanding of building secure and sustainable data solutions using Azure services

Book Description

Azure is one of the leading cloud providers in the world, providing numerous services for data hosting and data processing. Most of the companies today are either cloud-native or are migrating to the cloud much faster than ever. This has led to an explosion of data engineering jobs, with aspiring and experienced data engineers trying to outshine each other.

Gaining the DP-203: Azure Data Engineer Associate certification is a sure-fire way of showing future employers that you have what it takes to become an Azure Data Engineer. This book will help you prepare for the DP-203 examination in a structured way, covering all the topics specified in the syllabus with detailed explanations and exam tips. The book starts by covering the fundamentals of Azure, and then takes the example of a hypothetical company and walks you through the various stages of building data engineering solutions. Throughout the chapters, you’ll learn about the various Azure components involved in building the data systems and will explore them using a wide range of real-world use cases. Finally, you’ll work on sample questions and answers to familiarize yourself with the pattern of the exam.

By the end of this Azure book, you’ll have gained the confidence you need to pass the DP-203 exam with ease and land your dream job in data engineering.

What you will learn

Gain intermediate-level knowledge of Azure the data infrastructure
Design and implement data lake solutions with batch and stream pipelines
Identify the partition strategies available in Azure storage technologies
Implement different table geometries in Azure Synapse Analytics
Use the transformations available in T-SQL, Spark, and Azure Data Factory
Use Azure Databricks or Synapse Spark to process data using Notebooks
Design security using RBAC, ACL, encryption, data masking, and more
Monitor and optimize data pipelines with debugging tips

Who this book is for

This book is for data engineers who want to take the DP-203: Azure Data Engineer Associate exam and are looking to gain in-depth knowledge of the Azure cloud stack.

The book will also help engineers and product managers who are new to Azure or interviewing with companies working on Azure technologies, to get hands-on experience of Azure data technologies. A basic understanding of cloud technologies, extract, transform, and load (ETL), and databases will help you get the most out of this book.

Azure Data Engineer Associate Certification Guide
Contributors
About the author
About the reviewers
Preface
    Who this book is for
    What this book covers
    Download the example code files
    Download the color images
    Get in touch
    Reviews
    Share Your Thoughts
Part 1: Azure Basics
Chapter 1: Introducing Azure Basics
    Technical requirements
    Introducing the Azure portal
    Exploring Azure accounts, subscriptions, and resource groups
        Azure account
        Azure subscription
        Resource groups
        Establishing a use case
    Introducing Azure Services
        Infrastructure as a Service (IaaS)
        Platform as a Service (PaaS)
        Software as a Service (SaaS), also known as Function as a Service (FaaS)
    Exploring Azure VMs
        Creating a VM using the Azure portal
        Creating a VM using the Azure CLI
    Exploring Azure Storage
        Azure Blob storage
        Azure Data Lake Gen 2 
        Azure Files
        Azure Queues
        Azure tables
        Azure Managed disks
    Exploring Azure Networking (VNet)
    Exploring Azure Compute
        VM Scale Sets
        Azure App Service
        Azure Kubernetes Service
        Azure Functions
        Azure Service Fabric
        Azure Batch
    Summary
Part 2: Data Storage
Chapter 2: Designing a Data Storage Structure
    Technical requirements
    Designing an Azure data lake
        How is a data lake different from a data warehouse?
        When should you use a data lake?
        Data lake zones
        Data lake architecture
        Exploring Azure technologies that can be used to build a data lake
    Selecting the right file types for storage
        Avro
        Parquet
        ORC 
        Comparing Avro, Parquet, and ORC
    Choosing the right file types for analytical queries 
    Designing storage for efficient querying
        Storage layer
        Application Layer
        Query layer
    Designing storage for data pruning
        Dedicated SQL pool example with pruning
        Spark example with pruning
    Designing folder structures for data transformation
        Streaming and IoT Scenarios
        Batch scenarios
    Designing a distribution strategy
        Round-robin tables
        Hash tables
        Replicated tables
    Designing a data archiving solution
        Hot Access Tier
        Cold Access Tier
        Archive Access Tier
        Data life cycle management
    Summary
Chapter 3: Designing a Partition Strategy
    Understanding the basics of partitioning
        Benefits of partitioning
    Designing a partition strategy for files
        Azure Blob storage
        ADLS Gen2
    Designing a partition strategy for analytical workloads
        Horizontal partitioning
        Vertical partitioning
        Functional partitioning
    Designing a partition strategy for efficiency/performance
        Iterative query performance improvement process
    Designing a partition strategy for Azure Synapse Analytics
        Performance improvement while loading data
        Performance improvement for filtering queries
    Identifying when partitioning is needed in ADLS Gen2
    Summary
Chapter 4: Designing the Serving Layer
    Technical requirements
    Learning the basics of data modeling and schemas
        Dimensional models
    Designing Star and Snowflake schemas
        Star schemas
        Snowflake schemas
    Designing SCDs
        Designing SCD1
        Designing SCD2
        Designing SCD3
        Designing SCD4
        Designing SCD5, SCD6, and SCD7
    Designing a solution for temporal data
    Designing a dimensional hierarchy
    Designing for incremental loading
        Watermarks
        File timestamps
        File partitions and folder structures
    Designing analytical stores
        Security considerations
        Scalability considerations
    Designing metastores in Azure Synapse Analytics and Azure Databricks
        Azure Synapse Analytics
        Azure Databricks (and Azure Synapse Spark)
    Summary
Chapter 5: Implementing Physical Data Storage Structures
    Technical requirements
    Getting started with Azure Synapse Analytics
    Implementing compression
        Compressing files using Synapse Pipelines or ADF
        Compressing files using Spark
    Implementing partitioning
        Using ADF/Synapse pipelines to create data partitions
        Partitioning for analytical workloads
    Implementing horizontal partitioning or sharding
        Sharding in Synapse dedicated pools
        Sharding using Spark
    Implementing distributions
        Hash distribution
        Round-robin distribution
        Replicated distribution
    Implementing different table geometries with Azure Synapse Analytics pools
        Clustered columnstore indexing
        Heap indexing
        Clustered indexing
    Implementing data redundancy
        Azure storage redundancy in the primary region
        Azure storage redundancy in secondary regions
        Azure SQL Geo Replication
        Azure Synapse SQL Data Replication
        CosmosDB Data Replication
        Example of setting up redundancy in Azure Storage
    Implementing data archiving
    Summary
Chapter 6: Implementing Logical Data Structures
    Technical requirements
    Building a temporal data solution
    Building a slowly changing dimension
        Updating new rows
        Updating the modified rows
    Building a logical folder structure
    Implementing file and folder structures for efficient querying and data pruning
        Deleting an old partition
        Adding a new partition
    Building external tables
    Summary
Chapter 7: Implementing the Serving Layer
    Technical requirements
    Delivering data in a relational star schema
    Implementing a dimensional hierarchy
        Synapse SQL serverless
        Synapse Spark
        Azure Databricks
    Maintaining metadata
        Metadata using Synapse SQL and Spark pools
        Metadata using Azure Databricks
    Summary
Part 3: Design and Develop Data Processing (25-30%)
Chapter 8: Ingesting and Transforming Data
    Technical requirements
    Transforming data by using Apache Spark 
        What are RDDs?
        What are DataFrames?
    Transforming data by using T-SQL
    Transforming data by using ADF 
        Schema transformations
        Row transformations
        Multi-I/O transformations
        ADF templates
    Transforming data by using Azure Synapse pipelines
    Transforming data by using Stream Analytics
    Cleansing data
        Handling missing/null values
        Trimming inputs
        Standardizing values
        Handling outliers
        Removing duplicates/deduping
    Splitting data
        File splits
    Shredding JSON
        Extracting values from JSON using Spark
        Extracting values from JSON using SQL
        Extracting values from JSON using ADF
    Encoding and decoding data
        Encoding and decoding using SQL
        Encoding and decoding using Spark
        Encoding and decoding using ADF 
    Configuring error handling for the transformation
    Normalizing and denormalizing values
        Denormalizing values using Pivot
        Normalizing values using Unpivot
    Transforming data by using Scala
    Performing Exploratory Data Analysis (EDA)
        Data exploration using Spark
        Data exploration using SQL
        Data exploration using ADF 
    Summary
Chapter 9: Designing and Developing a Batch Processing Solution
    Technical requirements
    Designing a batch processing solution
    Developing batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse Pipelines, PolyBase, and Azure Databricks
        Storage
        Data ingestion
        Data preparation/data cleansing
        Transformation
        Using PolyBase to ingest the data into the Analytics data store
        Using Power BI to display the insights
    Creating data pipelines 
    Integrating Jupyter/Python notebooks into a data pipeline
    Designing and implementing incremental data loads
    Designing and developing slowly changing dimensions
    Handling duplicate data
    Handling missing data
    Handling late-arriving data 
        Handling late-arriving data in the ingestion/transformation stage
        Handling late-arriving data in the serving stage
    Upserting data
    Regressing to a previous state 
    Introducing Azure Batch
        Running a sample Azure Batch job
    Configuring the batch size
    Scaling resources
        Azure Batch
        Azure Databricks 
        Synapse Spark
        Synapse SQL
    Configuring batch retention
    Designing and configuring exception handling 
        Types of errors
        Remedial actions
    Handling security and compliance requirements 
        The Azure Security Benchmark
        Best practices for Azure Batch
    Summary
Chapter 10: Designing and Developing a Stream Processing Solution
    Technical requirements
    Designing a stream processing solution
        Introducing Azure Event Hubs
        Introducing ASA
        Introducing Spark Streaming
    Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs
        A streaming solution using Event Hubs and ASA 
        A streaming solution using Event Hubs and Spark Streaming
    Processing data using Spark Structured Streaming
    Monitoring for performance and functional regressions
        Monitoring in Event Hubs
        Monitoring in ASA 
        Monitoring in Spark Streaming
    Processing time series data
        Types of timestamps
        Windowed aggregates
        Checkpointing or watermarking
        Replaying data from a previous timestamp
    Designing and creating windowed aggregates
        Tumbling windows
        Hopping windows
        Sliding windows 
        Session windows 
        Snapshot windows
    Configuring checkpoints/watermarking during processing
        Checkpointing in ASA
        Checkpointing in Event Hubs
        Checkpointing in Spark
    Replaying archived stream data
    Transformations using streaming analytics
        The COUNT and DISTINCT transformations
        CAST transformations
        LIKE transformations
    Handling schema drifts
        Handling schema drifts using Event Hubs
        Handling schema drifts in Spark
    Processing across partitions
        What are partitions?
        Processing data across partitions
    Processing within one partition
    Scaling resources
        Scaling in Event Hubs
        Scaling in ASA
        Scaling in Azure Databricks Spark Streaming
    Handling interruptions
        Handling interruptions in Event Hubs 
        Handling interruptions in ASA
    Designing and configuring exception handling
    Upserting data
    Designing and creating tests for data pipelines
    Optimizing pipelines for analytical or transactional purposes
    Summary
Chapter 11: Managing Batches and Pipelines
    Technical requirements
    Triggering batches
    Handling failed Batch loads
        Pool errors
        Node errors
        Job errors
        Task errors
    Validating Batch loads
    Scheduling data pipelines in Data Factory/Synapse pipelines
    Managing data pipelines in Data Factory/Synapse pipelines
        Integration runtimes
        ADF monitoring
    Managing Spark jobs in a pipeline
    Implementing version control for pipeline artifacts
        Configuring source control in ADF
        Integrating with Azure DevOps
        Integrating with GitHub
    Summary
Part 4: Design and Implement Data Security (10-15%)
Chapter 12: Designing Security for Data Policies and Standards
    Technical requirements
    Introducing the security and privacy requirements
    Designing and implementing data encryption for data at rest and in transit
        Encryption at rest
        Encryption in transit
    Designing and implementing a data auditing strategy
        Storage auditing
        SQL auditing
    Designing and implementing a data masking strategy
    Designing and implementing Azure role-based access control and a POSIX-like access control list for Data Lake Storage Gen2
        Restricting access using Azure RBAC
        Restricting access using ACLs
    Designing and implementing row-level and column-level security
        Designing row-level security
        Designing column-level security
    Designing and implementing a data retention policy
    Designing to purge data based on business requirements
        Purging data in Azure Data Lake Storage Gen2
        Purging data in Azure Synapse SQL
    Managing identities, keys, and secrets across different data platform technologies
        Azure Active Directory
        Azure Key Vault
        Access keys and Shared Access keys in Azure Storage
    Implementing secure endpoints (private and public)
    Implementing resource tokens in Azure Databricks
    Loading a DataFrame with sensitive information
    Writing encrypted data to tables or Parquet files
    Designing for data privacy and managing sensitive information
        Microsoft Defender
    Summary
Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)
Chapter 13: Monitoring Data Storage and Data Processing
    Technical requirements
    Implementing logging used by Azure Monitor
    Configuring monitoring services
    Understanding custom logging options
    Interpreting Azure Monitor metrics and logs
        Interpreting Azure Monitor metrics
        Interpreting Azure Monitor logs
    Measuring the performance of data movement
    Monitoring data pipeline performance
    Monitoring and updating statistics about data across a system
        Creating statistics for Synapse dedicated pools
        Updating statistics for Synapse dedicated pools
        Creating statistics for Synapse serverless pools
        Updating statistics for Synapse serverless pools
    Measuring query performance
        Monitoring Synapse SQL pool performance
        Spark query performance monitoring
    Interpreting a Spark DAG
    Monitoring cluster performance
        Monitoring overall cluster performance
        Monitoring per-node performance
        Monitoring YARN queue/scheduler performance
        Monitoring storage throttling
    Scheduling and monitoring pipeline tests
    Summary
Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing
    Technical requirements
    Compacting small files
    Rewriting user-defined functions (UDFs)
        Writing UDFs in Synapse SQL Pool
        Writing UDFs in Spark 
        Writing UDFs in Stream Analytics
    Handling skews in data
        Fixing skews at the storage level
        Fixing skews at the compute level
    Handling data spills
        Identifying data spills in Synapse SQL
        Identifying data spills in Spark
    Tuning shuffle partitions
    Finding shuffling in a pipeline
        Identifying shuffles in a SQL query plan
        Identifying shuffles in a Spark query plan
    Optimizing resource management
        Optimizing Synapse SQL pools
        Optimizing Spark
    Tuning queries by using indexers
        Indexing in Synapse SQL
        Indexing in the Synapse Spark pool using Hyperspace
    Tuning queries by using cache
    Optimizing pipelines for analytical or transactional purposes
        OLTP systems
        OLAP systems
        Implementing HTAP using Synapse Link and CosmosDB
    Optimizing pipelines for descriptive versus analytical workloads
        Common optimizations for descriptive and analytical pipelines
        Specific optimizations for descriptive and analytical pipelines
    Troubleshooting a failed Spark job
        Debugging environmental issues
        Debugging job issues
    Troubleshooting a failed pipeline run
    Summary
Part 6: Practice Exercises
Chapter 15: Sample Questions with Solutions
    Exploring the question formats
    Case study-based questions
        Case study – data lake
    Scenario-based questions
        Shared access signature
    Direct questions
        ADF transformation
    Ordering sequence questions
        ASA setup steps
    Code segment questions
        Column security
    Sample questions from the Design and Implement Data Storage section
        Case study – data lake
        Data visualization
        Data partitioning
        Synapse SQL pool table design – 1 
        Synapse SQL pool table design – 2
        Slowly changing dimensions
        Storage tiers
        Disaster recovery
        Synapse SQL external tables
    Sample questions from the Design and Develop Data Processing section
        Data lake design
        ASA windows
        Spark transformation
        ADF – integration runtimes
        ADF triggers
    Sample questions from the Design and Implement Data Security section
        TDE/Always Encrypted
        Auditing Azure SQL/Synapse SQL
        Dynamic data masking
        RBAC – POSIX
        Row-level security
    Sample questions from the Monitor and Optimize Data Storage and Data Processing section
        Blob storage monitoring
        T-SQL optimization
        ADF monitoring
        Setting up alerts in ASA
    Summary
    Why subscribe?
Other Books You May Enjoy
    Packt is searching for authors like you
    Share Your Thoughts

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Azure Data Engineer Associate Certification Guide: A hands-on reference guide to developing your data engineering skills and preparing for the DP-203 exam, sometime you may not get the results, please search the main title.