Data Engineering with Google Cloud Platform: A practical guide to operationalizing scalable data analytics systems on GCP

Length: 440 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2022-03-31
ISBN-10: 1800561326
ISBN-13: 9781800561328
Sales Rank: #3614454 (See Top 100 Books)

Build and deploy your own data pipelines on GCP, make key architectural decisions, and gain the confidence to boost your career as a data engineer

Key Features

Understand data engineering concepts, the role of a data engineer, and the benefits of using GCP for building your solution
Learn how to use the various GCP products to ingest, consume, and transform data and orchestrate pipelines
Discover tips to prepare for and pass the Professional Data Engineer exam

Book Description

With this book, you’ll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards.

Starting with a quick overview of the fundamental concepts of data engineering, you’ll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you’ll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You’ll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you’ll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP.

By the end of this data engineering book, you’ll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.

What you will learn

Load data into BigQuery and materialize its output for downstream consumption
Build data pipeline orchestration using Cloud Composer
Develop Airflow jobs to orchestrate and automate a data warehouse
Build a Hadoop data lake, create ephemeral clusters, and run jobs on the Dataproc cluster
Leverage Pub/Sub for messaging and ingestion for event-driven systems
Use Dataflow to perform ETL on streaming data
Unlock the power of your data with Data Studio
Calculate the GCP cost estimation for your end-to-end data solutions

Who this book is for

This book is for data engineers, data analysts, and anyone looking to design and manage data processing pipelines using GCP. You’ll find this book useful if you are preparing to take Google’s Professional Data Engineer exam. Beginner-level understanding of data science, the Python programming language, and Linux commands is necessary. A basic understanding of data processing and cloud computing, in general, will help you make the most out of this book.

Data Engineering with Google Cloud Platform
Contributors
About the author
About the reviewer
Preface
    Who this book is for
    What this book covers
    To get the most out of this book
    Download the example code files
    Download the color images
    Conventions used
    Get in touch
    Share Your Thoughts
Section 1: Getting Started with Data Engineering with GCP
Chapter 1: Fundamentals of Data Engineering
    Understanding the data life cycle
        Understanding the need for a data warehouse
    Knowing the roles of a data engineer before starting
        Data engineer versus data scientist
        The focus of data engineers
    Foundational concepts for data engineering
        ETL concept in data engineering 
        The difference between ETL and ELT
        What is NOT big data?
        A quick look at how big data technologies store data
        A quick look at how to process multiple files using MapReduce
    Summary
    Exercise
    See also
Chapter 2: Big Data Capabilities on GCP
    Technical requirements
    Understanding what the cloud is
        The difference between the cloud and non-cloud era
        The on-demand nature of the cloud 
    Getting started with Google Cloud Platform
        Introduction to the GCP console 
        Practicing pinning services
        Creating your first GCP project
        Using GCP Cloud Shell
    A quick overview of GCP services for data engineering
        Understanding the GCP serverless service
        Service mapping and prioritization
        The concept of quotas on GCP services
        User account versus service account
    Summary
Section 2: Building Solutions with GCP Components
Chapter 3: Building a Data Warehouse in BigQuery
    Technical requirements
    Introduction to Google Cloud Storage and BigQuery
        BigQuery data location
    Introduction to the BigQuery console
        Creating a dataset in BigQuery using the console
        Loading a local CSV file into the BigQuery table
        Using public data in BigQuery
        Data types in BigQuery compared to other databases
        Timestamp data in BigQuery compared to other databases
    Preparing the prerequisites before developing our data warehouse
        Step 1: Access your Cloud shell
        Step 2: Check the current setup using the command line 
        Step 3: The gcloud init command
        Step 4: Download example data from Git
        Step 5: Upload data to GCS from Git
    Practicing developing a data warehouse
        Data warehouse in BigQuery – Requirements for scenario 1
        Steps and planning for handling scenario 1
        Data warehouse in BigQuery – Requirements for scenario 2
        Steps and planning for handling scenario 2
    Summary
    Exercise – Scenario 3
    See also
Chapter 4: Building Orchestration for Batch Data Loading Using Cloud Composer
    Technical requirements
    Introduction to Cloud Composer
    Understanding the working of Airflow 
        Provisioning Cloud Composer in a GCP project
    Exercise: Build data pipeline orchestration using Cloud Composer
        Level 1 DAG – Creating dummy workflows
        Level 2 DAG – Scheduling a pipeline from Cloud SQL to GCS and BigQuery datasets
        Level 3 DAG – Parameterized variables
        Level 4 DAG – Guaranteeing task idempotency in Cloud Composer
        Level 5 DAG – Handling late data using a sensor
    Summary
Chapter 5: Building a Data Lake Using Dataproc
    Technical requirements
    Introduction to Dataproc
        A brief history of the data lake and Hadoop ecosystem
        A deeper look into Hadoop components
        How much Hadoop-related knowledge do you need on GCP? 
        Introducing the Spark RDD and the DataFrame concept
        Introducing the data lake concept
        Hadoop and Dataproc positioning on GCP
    Exercise – Building a data lake on a Dataproc cluster
        Creating a Dataproc cluster on GCP
        Using Cloud Storage as an underlying Dataproc file system 
    Exercise: Creating and running jobs on a Dataproc cluster
        Preparing log data in GCS and HDFS
        Developing Spark ETL from HDFS to HDFS
        Developing Spark ETL from GCS to GCS
        Developing Spark ETL from GCS to BigQuery
    Understanding the concept of the ephemeral cluster
        Practicing using a workflow template on Dataproc
    Building an ephemeral cluster using Dataproc and Cloud Composer
    Summary 
Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow
    Technical requirements
    Processing streaming data
        Streaming data for data engineers
        Introduction to Pub/Sub
        Introduction to Dataflow
    Exercise – Publishing event streams to cloud Pub/Sub
        Creating a Pub/Sub topic
        Creating and running a Pub/Sub publisher using Python
        Creating a Pub/Sub subscription
    Exercise – Using Cloud Dataflow to stream data from Pub/Sub to GCS
        Creating a HelloWorld application using Apache Beam
        Creating a Dataflow streaming job without aggregation
        Creating a streaming job with aggregation
    Summary
Chapter 7: Visualizing Data for Making Data-Driven Decisions with Data Studio
    Technical requirements
    Unlocking the power of your data with Data Studio
    From data to metrics in minutes with an illustrative use case
        Understanding what BigQuery INFORMATION_SCHEMA is
        Exercise – Exploring the BigQuery INFORMATION_SCHEMA table using Data Studio
        Exercise – Creating a Data Studio report using data from a bike-sharing data warehouse
    Understanding how Data Studio can impact the cost of BigQuery
        What kind of table could be 1 TB in size? 
        How can a table be accessed 10,000 times in a month? 
    How to create materialized views and understanding how BI Engine works
        Understanding BI Engine
    Summary
Chapter 8: Building Machine Learning Solutions on Google Cloud Platform
    Technical requirements
    A quick look at machine learning
    Exercise – practicing ML code using Python
        Preparing the ML dataset by using a table from the BigQuery public dataset
        Training the ML model using Random Forest in Python
        Creating Batch Prediction using the training dataset's output
    The MLOps landscape in GCP
        Understanding the basic principles of MLOps
        Introducing GCP services related to MLOps
    Exercise – leveraging pre-built GCP models as a service 
        Uploading the image to a GCS bucket
        Creating a detect text function in Python
    Exercise – using GCP in AutoML to train an ML model
    Exercise – deploying a dummy workflow with Vertex AI Pipeline
        Creating a dedicated regional GCS bucket
        Developing the pipeline on Python
        Monitoring the pipeline on the Vertex AI Pipeline console
    Exercise – deploying a scikit-learn model pipeline with Vertex AI
        Creating the first pipeline, which will result in an ML model file in GCS
        Running the first pipeline in Vertex AI Pipeline
        Creating the second pipeline, which will use the model file from the prediction results as a CSV file in GCS
        Running the second pipeline in Vertex AI Pipeline
    Summary 
Section 3: Key Strategies for Architecting Top-Notch Data Pipelines
Chapter 9: User and Project Management in GCP 
    Technical requirements 
    Understanding IAM in GCP 
    Planning a GCP project structure
        Understanding the GCP organization, folder, and project hierarchy
        Deciding how many projects we should have in a GCP organization
    Controlling user access to our data warehouse
        Use-case scenario – planninga BigQuery ACL on an e-commerce organization
        Column-level security in BigQuery
    Practicing the concept of IaC using Terraform
        Exercise – creating and running basic Terraform scripts
        Self-exercise – managing a GCP project and resources using Terraform
    Summary
Chapter 10: Cost Strategy in GCP
    Technical requirements
    Estimating the cost of your end-to-end data solution in GCP
        Comparing BigQuery on-demand and flat-rate 
        Example – estimating data engineering use case
    Tips for optimizing BigQuery using partitioned and clustered tables 
        Partitioned tables
        Clustered tables
        Exercise – optimizing BigQuery on-demand cost
    Summary
Chapter 11: CI/CD on Google Cloud Platform for Data Engineers
    Technical requirements 
    Introduction to CI/CD
        Understanding the data engineer's relationship with CI/CD practices
    Understanding CI/CD components with GCP services
    Exercise – implementing continuous integration using Cloud Build
        Creating a GitHub repository using Cloud Source Repository
        Developing the code and Cloud Build scripts
        Creating the Cloud Build Trigger
        Pushing the code to the GitHub repository
    Exercise – deploying Cloud Composer jobs using Cloud Build 
        Preparing the CI/CD environment
        Preparing the cloudbuild.yaml configuration file
        Pushing the DAG to our GitHub repository
        Checking the CI/CD result in the GCS bucket and Cloud Composer
    Summary
    Further reading
Chapter 12: Boosting Your Confidence as a Data Engineer
    Overviewing the Google Cloud certification
        Exam preparation tips
        Extra GCP services material
    Quiz – reviewing all the concepts you've learned about
        Questions
        Answers
    The past, present, and future of Data Engineering
    Boosting your confidence and final thoughts
    Summary
    Why subscribe?
Other Books You May Enjoy
    Packt is searching for authors like you
    Share Your Thoughts

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Data Engineering with Google Cloud Platform: A practical guide to operationalizing scalable data analytics systems on GCP, sometime you may not get the results, please search the main title.