Azure Data Engineer Associate Certification Guide: A hands-on reference guide to developing your data engineering skills and preparing for the DP-203 exam
- Length: 574 pages
- Edition: 1
- Language: English
- Publisher: Packt Publishing
- Publication Date: 2022-02-28
- ISBN-10: 1801816069
- ISBN-13: 9781801816069
- Sales Rank: #152174 (See Top 100 Books)
Become well-versed with data engineering concepts and exam objectives to achieve Azure Data Engineer Associate certification
Key Features
- Understand and apply data engineering concepts to real-world problems and prepare for the DP-203 certification exam
- Explore the various Azure services for building end-to-end data solutions
- Gain a solid understanding of building secure and sustainable data solutions using Azure services
Book Description
Azure is one of the leading cloud providers in the world, providing numerous services for data hosting and data processing. Most of the companies today are either cloud-native or are migrating to the cloud much faster than ever. This has led to an explosion of data engineering jobs, with aspiring and experienced data engineers trying to outshine each other.
Gaining the DP-203: Azure Data Engineer Associate certification is a sure-fire way of showing future employers that you have what it takes to become an Azure Data Engineer. This book will help you prepare for the DP-203 examination in a structured way, covering all the topics specified in the syllabus with detailed explanations and exam tips. The book starts by covering the fundamentals of Azure, and then takes the example of a hypothetical company and walks you through the various stages of building data engineering solutions. Throughout the chapters, you’ll learn about the various Azure components involved in building the data systems and will explore them using a wide range of real-world use cases. Finally, you’ll work on sample questions and answers to familiarize yourself with the pattern of the exam.
By the end of this Azure book, you’ll have gained the confidence you need to pass the DP-203 exam with ease and land your dream job in data engineering.
What you will learn
- Gain intermediate-level knowledge of Azure the data infrastructure
- Design and implement data lake solutions with batch and stream pipelines
- Identify the partition strategies available in Azure storage technologies
- Implement different table geometries in Azure Synapse Analytics
- Use the transformations available in T-SQL, Spark, and Azure Data Factory
- Use Azure Databricks or Synapse Spark to process data using Notebooks
- Design security using RBAC, ACL, encryption, data masking, and more
- Monitor and optimize data pipelines with debugging tips
Who this book is for
This book is for data engineers who want to take the DP-203: Azure Data Engineer Associate exam and are looking to gain in-depth knowledge of the Azure cloud stack.
The book will also help engineers and product managers who are new to Azure or interviewing with companies working on Azure technologies, to get hands-on experience of Azure data technologies. A basic understanding of cloud technologies, extract, transform, and load (ETL), and databases will help you get the most out of this book.
Azure Data Engineer Associate Certification Guide Contributors About the author About the reviewers Preface Who this book is for What this book covers Download the example code files Download the color images Get in touch Reviews Share Your Thoughts Part 1: Azure Basics Chapter 1: Introducing Azure Basics Technical requirements Introducing the Azure portal Exploring Azure accounts, subscriptions, and resource groups Azure account Azure subscription Resource groups Establishing a use case Introducing Azure Services Infrastructure as a Service (IaaS) Platform as a Service (PaaS) Software as a Service (SaaS), also known as Function as a Service (FaaS) Exploring Azure VMs Creating a VM using the Azure portal Creating a VM using the Azure CLI Exploring Azure Storage Azure Blob storage Azure Data Lake Gen 2 Azure Files Azure Queues Azure tables Azure Managed disks Exploring Azure Networking (VNet) Exploring Azure Compute VM Scale Sets Azure App Service Azure Kubernetes Service Azure Functions Azure Service Fabric Azure Batch Summary Part 2: Data Storage Chapter 2: Designing a Data Storage Structure Technical requirements Designing an Azure data lake How is a data lake different from a data warehouse? When should you use a data lake? Data lake zones Data lake architecture Exploring Azure technologies that can be used to build a data lake Selecting the right file types for storage Avro Parquet ORC Comparing Avro, Parquet, and ORC Choosing the right file types for analytical queries Designing storage for efficient querying Storage layer Application Layer Query layer Designing storage for data pruning Dedicated SQL pool example with pruning Spark example with pruning Designing folder structures for data transformation Streaming and IoT Scenarios Batch scenarios Designing a distribution strategy Round-robin tables Hash tables Replicated tables Designing a data archiving solution Hot Access Tier Cold Access Tier Archive Access Tier Data life cycle management Summary Chapter 3: Designing a Partition Strategy Understanding the basics of partitioning Benefits of partitioning Designing a partition strategy for files Azure Blob storage ADLS Gen2 Designing a partition strategy for analytical workloads Horizontal partitioning Vertical partitioning Functional partitioning Designing a partition strategy for efficiency/performance Iterative query performance improvement process Designing a partition strategy for Azure Synapse Analytics Performance improvement while loading data Performance improvement for filtering queries Identifying when partitioning is needed in ADLS Gen2 Summary Chapter 4: Designing the Serving Layer Technical requirements Learning the basics of data modeling and schemas Dimensional models Designing Star and Snowflake schemas Star schemas Snowflake schemas Designing SCDs Designing SCD1 Designing SCD2 Designing SCD3 Designing SCD4 Designing SCD5, SCD6, and SCD7 Designing a solution for temporal data Designing a dimensional hierarchy Designing for incremental loading Watermarks File timestamps File partitions and folder structures Designing analytical stores Security considerations Scalability considerations Designing metastores in Azure Synapse Analytics and Azure Databricks Azure Synapse Analytics Azure Databricks (and Azure Synapse Spark) Summary Chapter 5: Implementing Physical Data Storage Structures Technical requirements Getting started with Azure Synapse Analytics Implementing compression Compressing files using Synapse Pipelines or ADF Compressing files using Spark Implementing partitioning Using ADF/Synapse pipelines to create data partitions Partitioning for analytical workloads Implementing horizontal partitioning or sharding Sharding in Synapse dedicated pools Sharding using Spark Implementing distributions Hash distribution Round-robin distribution Replicated distribution Implementing different table geometries with Azure Synapse Analytics pools Clustered columnstore indexing Heap indexing Clustered indexing Implementing data redundancy Azure storage redundancy in the primary region Azure storage redundancy in secondary regions Azure SQL Geo Replication Azure Synapse SQL Data Replication CosmosDB Data Replication Example of setting up redundancy in Azure Storage Implementing data archiving Summary Chapter 6: Implementing Logical Data Structures Technical requirements Building a temporal data solution Building a slowly changing dimension Updating new rows Updating the modified rows Building a logical folder structure Implementing file and folder structures for efficient querying and data pruning Deleting an old partition Adding a new partition Building external tables Summary Chapter 7: Implementing the Serving Layer Technical requirements Delivering data in a relational star schema Implementing a dimensional hierarchy Synapse SQL serverless Synapse Spark Azure Databricks Maintaining metadata Metadata using Synapse SQL and Spark pools Metadata using Azure Databricks Summary Part 3: Design and Develop Data Processing (25-30%) Chapter 8: Ingesting and Transforming Data Technical requirements Transforming data by using Apache Spark What are RDDs? What are DataFrames? Transforming data by using T-SQL Transforming data by using ADF Schema transformations Row transformations Multi-I/O transformations ADF templates Transforming data by using Azure Synapse pipelines Transforming data by using Stream Analytics Cleansing data Handling missing/null values Trimming inputs Standardizing values Handling outliers Removing duplicates/deduping Splitting data File splits Shredding JSON Extracting values from JSON using Spark Extracting values from JSON using SQL Extracting values from JSON using ADF Encoding and decoding data Encoding and decoding using SQL Encoding and decoding using Spark Encoding and decoding using ADF Configuring error handling for the transformation Normalizing and denormalizing values Denormalizing values using Pivot Normalizing values using Unpivot Transforming data by using Scala Performing Exploratory Data Analysis (EDA) Data exploration using Spark Data exploration using SQL Data exploration using ADF Summary Chapter 9: Designing and Developing a Batch Processing Solution Technical requirements Designing a batch processing solution Developing batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse Pipelines, PolyBase, and Azure Databricks Storage Data ingestion Data preparation/data cleansing Transformation Using PolyBase to ingest the data into the Analytics data store Using Power BI to display the insights Creating data pipelines Integrating Jupyter/Python notebooks into a data pipeline Designing and implementing incremental data loads Designing and developing slowly changing dimensions Handling duplicate data Handling missing data Handling late-arriving data Handling late-arriving data in the ingestion/transformation stage Handling late-arriving data in the serving stage Upserting data Regressing to a previous state Introducing Azure Batch Running a sample Azure Batch job Configuring the batch size Scaling resources Azure Batch Azure Databricks Synapse Spark Synapse SQL Configuring batch retention Designing and configuring exception handling Types of errors Remedial actions Handling security and compliance requirements The Azure Security Benchmark Best practices for Azure Batch Summary Chapter 10: Designing and Developing a Stream Processing Solution Technical requirements Designing a stream processing solution Introducing Azure Event Hubs Introducing ASA Introducing Spark Streaming Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs A streaming solution using Event Hubs and ASA A streaming solution using Event Hubs and Spark Streaming Processing data using Spark Structured Streaming Monitoring for performance and functional regressions Monitoring in Event Hubs Monitoring in ASA Monitoring in Spark Streaming Processing time series data Types of timestamps Windowed aggregates Checkpointing or watermarking Replaying data from a previous timestamp Designing and creating windowed aggregates Tumbling windows Hopping windows Sliding windows Session windows Snapshot windows Configuring checkpoints/watermarking during processing Checkpointing in ASA Checkpointing in Event Hubs Checkpointing in Spark Replaying archived stream data Transformations using streaming analytics The COUNT and DISTINCT transformations CAST transformations LIKE transformations Handling schema drifts Handling schema drifts using Event Hubs Handling schema drifts in Spark Processing across partitions What are partitions? Processing data across partitions Processing within one partition Scaling resources Scaling in Event Hubs Scaling in ASA Scaling in Azure Databricks Spark Streaming Handling interruptions Handling interruptions in Event Hubs Handling interruptions in ASA Designing and configuring exception handling Upserting data Designing and creating tests for data pipelines Optimizing pipelines for analytical or transactional purposes Summary Chapter 11: Managing Batches and Pipelines Technical requirements Triggering batches Handling failed Batch loads Pool errors Node errors Job errors Task errors Validating Batch loads Scheduling data pipelines in Data Factory/Synapse pipelines Managing data pipelines in Data Factory/Synapse pipelines Integration runtimes ADF monitoring Managing Spark jobs in a pipeline Implementing version control for pipeline artifacts Configuring source control in ADF Integrating with Azure DevOps Integrating with GitHub Summary Part 4: Design and Implement Data Security (10-15%) Chapter 12: Designing Security for Data Policies and Standards Technical requirements Introducing the security and privacy requirements Designing and implementing data encryption for data at rest and in transit Encryption at rest Encryption in transit Designing and implementing a data auditing strategy Storage auditing SQL auditing Designing and implementing a data masking strategy Designing and implementing Azure role-based access control and a POSIX-like access control list for Data Lake Storage Gen2 Restricting access using Azure RBAC Restricting access using ACLs Designing and implementing row-level and column-level security Designing row-level security Designing column-level security Designing and implementing a data retention policy Designing to purge data based on business requirements Purging data in Azure Data Lake Storage Gen2 Purging data in Azure Synapse SQL Managing identities, keys, and secrets across different data platform technologies Azure Active Directory Azure Key Vault Access keys and Shared Access keys in Azure Storage Implementing secure endpoints (private and public) Implementing resource tokens in Azure Databricks Loading a DataFrame with sensitive information Writing encrypted data to tables or Parquet files Designing for data privacy and managing sensitive information Microsoft Defender Summary Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%) Chapter 13: Monitoring Data Storage and Data Processing Technical requirements Implementing logging used by Azure Monitor Configuring monitoring services Understanding custom logging options Interpreting Azure Monitor metrics and logs Interpreting Azure Monitor metrics Interpreting Azure Monitor logs Measuring the performance of data movement Monitoring data pipeline performance Monitoring and updating statistics about data across a system Creating statistics for Synapse dedicated pools Updating statistics for Synapse dedicated pools Creating statistics for Synapse serverless pools Updating statistics for Synapse serverless pools Measuring query performance Monitoring Synapse SQL pool performance Spark query performance monitoring Interpreting a Spark DAG Monitoring cluster performance Monitoring overall cluster performance Monitoring per-node performance Monitoring YARN queue/scheduler performance Monitoring storage throttling Scheduling and monitoring pipeline tests Summary Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing Technical requirements Compacting small files Rewriting user-defined functions (UDFs) Writing UDFs in Synapse SQL Pool Writing UDFs in Spark Writing UDFs in Stream Analytics Handling skews in data Fixing skews at the storage level Fixing skews at the compute level Handling data spills Identifying data spills in Synapse SQL Identifying data spills in Spark Tuning shuffle partitions Finding shuffling in a pipeline Identifying shuffles in a SQL query plan Identifying shuffles in a Spark query plan Optimizing resource management Optimizing Synapse SQL pools Optimizing Spark Tuning queries by using indexers Indexing in Synapse SQL Indexing in the Synapse Spark pool using Hyperspace Tuning queries by using cache Optimizing pipelines for analytical or transactional purposes OLTP systems OLAP systems Implementing HTAP using Synapse Link and CosmosDB Optimizing pipelines for descriptive versus analytical workloads Common optimizations for descriptive and analytical pipelines Specific optimizations for descriptive and analytical pipelines Troubleshooting a failed Spark job Debugging environmental issues Debugging job issues Troubleshooting a failed pipeline run Summary Part 6: Practice Exercises Chapter 15: Sample Questions with Solutions Exploring the question formats Case study-based questions Case study – data lake Scenario-based questions Shared access signature Direct questions ADF transformation Ordering sequence questions ASA setup steps Code segment questions Column security Sample questions from the Design and Implement Data Storage section Case study – data lake Data visualization Data partitioning Synapse SQL pool table design – 1 Synapse SQL pool table design – 2 Slowly changing dimensions Storage tiers Disaster recovery Synapse SQL external tables Sample questions from the Design and Develop Data Processing section Data lake design ASA windows Spark transformation ADF – integration runtimes ADF triggers Sample questions from the Design and Implement Data Security section TDE/Always Encrypted Auditing Azure SQL/Synapse SQL Dynamic data masking RBAC – POSIX Row-level security Sample questions from the Monitor and Optimize Data Storage and Data Processing section Blob storage monitoring T-SQL optimization ADF monitoring Setting up alerts in ASA Summary Why subscribe? Other Books You May Enjoy Packt is searching for authors like you Share Your Thoughts
Donate to keep this site alive
How to download source code?
1. Go to: https://github.com/PacktPublishing
2. In the Find a repository… box, search the book title: Azure Data Engineer Associate Certification Guide: A hands-on reference guide to developing your data engineering skills and preparing for the DP-203 exam
, sometime you may not get the results, please search the main title.
3. Click the book title in the search results.
3. Click Code to download.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.