Simplifying Data Engineering and Analytics with Delta: Create analytics-ready data that fuels artificial intelligence and business intelligence
- Length: 334 pages
- Edition: 1
- Language: English
- Publisher: Packt Publishing
- Publication Date: 2022-07-29
- ISBN-10: 1801814864
- ISBN-13: 9781801814867
- Sales Rank: #498301 (See Top 100 Books)
Explore how Delta brings reliability, performance, and governance to your data lake and all the AI and BI use cases built on top of it
Key Features
- Learn Delta’s core concepts and features as well as what makes it a perfect match for data engineering and analysis
- Solve business challenges of different industry verticals using a scenario-based approach
- Make optimal choices by understanding the various tradeoffs provided by Delta
Book Description
Delta helps you generate reliable insights at scale and simplifies architecture around data pipelines, allowing you to focus primarily on refining the use cases being worked on. This is especially important when you consider that existing architecture is frequently reused for new use cases.
In this book, you’ll learn about the principles of distributed computing, data modeling techniques, and big data design patterns and templates that help solve end-to-end data flow problems for common scenarios and are reusable across use cases and industry verticals. You’ll also learn how to recover from errors and the best practices around handling structured, semi-structured, and unstructured data using Delta. After that, you’ll get to grips with features such as ACID transactions on big data, disciplined schema evolution, time travel to help rewind a dataset to a different time or version, and unified batch and streaming capabilities that will help you build agile and robust data products.
By the end of this Delta book, you’ll be able to use Delta as the foundational block for creating analytics-ready data that fuels all AI/BI use cases.
What you will learn
- Explore the key challenges of traditional data lakes
- Appreciate the unique features of Delta that come out of the box
- Address reliability, performance, and governance concerns using Delta
- Analyze the open data format for an extensible and pluggable architecture
- Handle multiple use cases to support BI, AI, streaming, and data discovery
- Discover how common data and machine learning design patterns are executed on Delta
- Build and deploy data and machine learning pipelines at scale using Delta
Who this book is for
Data engineers, data scientists, ML practitioners, BI analysts, or anyone in the data domain working with big data will be able to put their knowledge to work with this practical guide to executing pipelines and supporting diverse use cases using the Delta protocol. Basic knowledge of SQL, Python programming, and Spark is required to get the most out of this book.
Simplifying Data Engineering and Analytics with Delta Foreword Contributors About the author About the reviewer Preface Who this book is for What this book covers To get the most out of this book Download the example code files Download the color images Conventions used Get in touch Share Your Thoughts Section 1 – Introduction to Delta Lake and Data Engineering Principles Chapter 1: Introduction to Data Engineering The motivation behind data engineering Use cases How big is big data? But isn't ML and AI all the rage today? Understanding the role of data personas Big data ecosystem What characterizes big data? Classifying data Reaping value from data Top challenges of big data systems Evolution of data systems Rise of cloud data platforms SQL and NoSQL systems OLTP and OLAP systems Data platform service models Distributed computing SMP and MPP computing Parallel and distributed computing Hadoop Spark Hadoop versus Spark Business justification for tech spending Strategy for business transformation to use data as an asset Big data trends and best practices Summary Chapter 2: Data Modeling and ETL Technical requirements What is data modeling and why should you care? Advantages of a data modeling exercise Stages of data modeling Data modeling approaches for different data stores Relational data modeling Non-relational data modeling Understanding metadata – data about data Data catalog Types of metadata Why is metadata management the nerve center of data? Moving and transforming data using ETL Scenarios to consider for building ETL pipelines Periodic and continuous ingestion Bulk data migration Change data capture Slowly changing dimensions Job orchestration How to choose the right data format Text format versus binary format Row versus column formats When to use which format Leveraging data compression Common big data design patterns Ingestion Unified API Speed layer Transformations Handling schema changes ACID transactions Multihop pipeline Persist Separation of compute from storage Multiple destinations Denormalization In-stream analytics Best practices Summary Further reading Chapter 3: Delta – The Foundation Block for Big Data Technical requirements Motivation for Delta A case of too many is too little Data silos to data swamps Characteristics of curated data lakes DDL commands CREATE DML commands APPEND UPDATE DELETE MERGE Demystifying Delta Format layout on disk The main features of Delta ACID transaction support Schema evolution Unifying batch and streaming workloads Time travel Performance Data skipping Z-Order clustering Delta cache Life with and without Delta Lakehouse Characteristics of a Lakehouse Summary Section 2 – End-to-End Process of Building Delta Pipelines Chapter 4: Unifying Batch and Streaming with Delta Technical requirements Moving toward real-time systems Streaming concepts Lambda versus Kappa architectures Streaming ETL Extract – file-based versus event-based streaming Transforming – stream processing Loading – persisting the stream Handling streaming scenarios Joining with other static and dynamic datasets Recovering from failures Handling late-arriving data Stateless and stateful in-stream operations Trade-offs in designing streaming architectures Cost trade-offs Handling latency trade-offs Data reprocessing Multi-tenancy De-duplication Streaming best practices Summary Chapter 5: Data Consolidation in Delta Lake Technical requirements Why consolidate disparate data types? Delta unifies all types of data Structured data Semi-structured data Unstructured data Avoiding patches of data darkness Addressing problems in flight status using Delta Augmenting domain knowledge constraints to quality Continuous quality monitoring Curating data in stages for analytics RDD, DataFrames, and datasets Spark transformations and actions Spark APIs and UDFs Ease of extending to existing and new use cases Delta Lake connectors Specialized Delta Lakes by industry Healthcare and life sciences Delta Lake Industry 4.0 manufacturing Delta Lake Financial services Delta Lake Retail Delta Lake Data governance GDPR and CCPA compliance Role-based data access Summary Chapter 6: Solving Common Data Pattern Scenarios with Delta Technical requirements Understanding use case requirements Minimizing data movement with Delta time travel Delta cloning Handling CDC CDC Change Data Feed (CDF) Handling Slowly Changing Dimensions (SCD) SCD Type 1 SCD Type 2 Summary Chapter 7: Delta for Data Warehouse Use Cases Technical requirements Choosing the right architecture Understanding what a data warehouse really solves Lacunas of data warehouses Discovering when a data lake does not suffice Addressing concurrency and latency requirements with Delta Visualizing data using BI reporting Can cubes be constructed with Delta? Analyzing tradeoffs in a push versus pull data flow Why is being open such a big deal? Considerations around data governance The rise of the lakehouse category Summary Chapter 8: Handling Atypical Data Scenarios with Delta Technical requirements Emphasizing the importance of exploratory data analysis (EDA) From big data to good data Data profiling Statistical analysis Applying sampling techniques to address class imbalance How to detect and address imbalance Synthetic data generation to deal with data imbalance Addressing data skew Providing data anonymity Handling bias and variance in data Bias versus variance How do we detect bias and variance? How do we fix bias and variance? Compensating for missing and out-of-range data Monitoring data drift Summary Chapter 9: Delta for Reproducible Machine Learning Pipelines Technical requirements Data science versus machine learning Challenges of ML development Formalizing the ML development process What is a model? What is MLOps? Aspirations of a modern ML platform The role of Delta in an ML pipeline Delta-backed feature store Delta-backed model training Delta-backed model inferencing Model monitoring with Delta From business problem to insight generation Summary Chapter 10: Delta for Data Products and Services Technical requirements DaaS The need for data democratization Delta for unstructured data NLP data (text and audio) Image and video data Data mashups using Delta Data blending Data harmonization Federated query Facilitating data sharing with Delta Setting up Delta sharing Benefits of Delta sharing Data clean room Summary Section 3 – Operationalizing and Productionalizing Delta Pipelines Chapter 11: Operationalizing Data and ML Pipelines Technical requirements Why operationalize? Understanding and monitoring SLAs Scaling and high availability Planning for DR How to decide on the correct DR strategy How Delta helps with DR Guaranteeing data quality Automation of CI/CD pipelines Code under version control Infrastructure as Code (IaC) Unit and integration testing Data as code – An intelligent pipeline Summary Chapter 12: Optimizing Cost and Performance with Delta Technical requirements Improving performance with common strategies Where to look and what to look for Optimizing with Delta Changing the data layout in storage Other platform optimizations Automation Is cost always inversely proportional to performance? Best practices for managing performance Summary Chapter 13: Managing Your Data Journey Provisioning a multi-tenant infrastructure Data democratization via policies and processes Capacity planning Managing and monitoring Data sharing Data migration COE best practices Summary Why subscribe? Other Books You May Enjoy Packt is searching for authors like you Share Your Thoughts
Donate to keep this site alive
How to download source code?
1. Go to: https://github.com/PacktPublishing
2. In the Find a repository… box, search the book title: Simplifying Data Engineering and Analytics with Delta: Create analytics-ready data that fuels artificial intelligence and business intelligence
, sometime you may not get the results, please search the main title.
3. Click the book title in the search results.
3. Click Code to download.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.