Data Quality Fundamentals: A Practitioner’s Guide to Building Trustworthy Data Pipelines

by Barr Moses, Lior Gavish, Molly Vorwerck

Length: 308 pages
Edition: 1
Language: English
Publisher: O'Reilly Media
Publication Date: 2022-10-18
ISBN-10: 1098112040
ISBN-13: 9781098112042
Sales Rank: #1014973 (See Top 100 Books)

Do your product dashboards look funky? Are your quarterly reports stale? Is the dataset you’re using broken or just plain wrong? These problems affect almost every team, yet they’re usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to any of the questions above, this book is for you.

Many data engineering teams today face the “good pipelines, bad data” problem. It doesn’t matter how advanced your data infrastructure is if the data you’re piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck from the data reliability company Monte Carlo explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world’s most innovative companies.

Build more trustworthy and reliable data pipelines
Write scripts to make data checks and identify broken pipelines with data observability
Program your own data quality monitors from scratch
Develop and lead data quality initiatives at your company
Generate a dashboard to highlight your company’s key data assets
Automate data lineage graphs across your data ecosystem
Build anomaly detectors for your critical data assets

Preface
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
1. Why Data Quality Deserves Attention—Now
    What Is Data Quality?
    Framing the Current Moment
        Understanding the “Rise of Data Downtime”
            Migration to the cloud
            More data sources
            Increasingly complex data pipelines
            More specialized data teams
            Decentralized data teams
        Other Industry Trends Contributing to the Current Moment
            Data mesh
            Streaming data
            Rise of the data lakehouse
    Summary
2. Assembling the Building Blocks of a Reliable Data System
    Understanding the Difference Between Operational and Analytical Data
    What Makes Them Different?
    Data Warehouses Versus Data Lakes
        Data Warehouses: Table Types at the Schema Level
        Data Lakes: Manipulations at the File Level
        What About the Data Lakehouse?
        Syncing Data Between Warehouses and Lakes
    Collecting Data Quality Metrics
        What Are Data Quality Metrics?
        How to Pull Data Quality Metrics
            Scalability
            Monitoring across other parts of your stack
            Example: Pulling data quality metrics from Snowflake
                Step 1: Map your inventory
                Step 2: Monitor for data freshness and volume
                Step 3: Build your query history
                Step 4: Health check
        Using Query Logs to Understand Data Quality in the Warehouse
        Using Query Logs to Understand Data Quality in the Lake
    Designing a Data Catalog
    Building a Data Catalog
    Summary
3. Collecting, Cleaning, Transforming, and Testing Data
    Collecting Data
        Application Log Data
        API Responses
        Sensor Data
    Cleaning Data
    Batch Versus Stream Processing
    Data Quality for Stream Processing
        AWS Kinesis
        Apache Kafka
    Normalizing Data
        Handling Heterogeneous Data Sources
            Warehouse data versus lake data: heterogeneity edition
        Schema Checking and Type Coercion
        Syntactic Versus Semantic Ambiguity in Data
        Managing Operational Data Transformations Across AWS Kinesis and Apache Kafka
            AWS Kinesis
            Apache Kafka
    Running Analytical Data Transformations
        Ensuring Data Quality During ETL
        Ensuring Data Quality During Transformation
    Alerting and Testing
        dbt Unit Testing
        Great Expectations Unit Testing
        Deequ Unit Testing
    Managing Data Quality with Apache Airflow
        Scheduler SLAs
        Installing Circuit Breakers with Apache Airflow
        SQL Check Operators
    Summary
4. Monitoring and Anomaly Detection for Your Data Pipelines
    Knowing Your Known Unknowns and Unknown Unknowns
    Building an Anomaly Detection Algorithm
        Monitoring for Freshness
        Understanding Distribution
    Building Monitors for Schema and Lineage
        Anomaly Detection for Schema Changes and Lineage
        Visualizing Lineage
        Investigating a Data Anomaly
    Scaling Anomaly Detection with Python and Machine Learning
        Improving Data Monitoring Alerting with Machine Learning
        Accounting for False Positives and False Negatives
        Improving Precision and Recall
        Detecting Freshness Incidents with Data Monitoring
        F-Scores
        Does Model Accuracy Matter?
    Beyond the Surface: Other Useful Anomaly Detection Approaches
    Designing Data Quality Monitors for Warehouses Versus Lakes
    Summary
5. Architecting for Data Reliability
    Measuring and Maintaining High Data Reliability at Ingestion
    Measuring and Maintaining Data Quality in the Pipeline
    Understanding Data Quality Downstream
    Building Your Data Platform
        Data Ingestion
        Data Storage and Processing
        Data Transformation and Modeling
        Business Intelligence and Analytics
        Data Discovery and Governance
    Developing Trust in Your Data
        Data Observability
        Measuring the ROI on Data Quality
            Calculating the cost of data downtime
            Updating your data downtime cost to reflect external factors
        How to Set SLAs, SLOs, and SLIs for Your Data
            Step 1: Defining data reliability with SLAs
            Step 2: Measuring data reliability with SLIs
            Step 3: Tracking data reliability with SLOs
    Case Study: Blinkist
    Summary
6. Fixing Data Quality Issues at Scale
    Fixing Quality Issues in Software Development
    Data Incident Management
        Incident Detection
        Response
        Root Cause Analysis
            Step 1: Look at your lineage
            Step 2: Look at the code
            Step 3: Look at your data
            Step 4: Look at your operational environment
            Step 5: Leverage your peers
        Resolution
        Blameless Postmortem
    Incident Response and Mitigation
        Establishing a Routine of Incident Management
            Step 1: Route notifications to the appropriate team members
            Step 2: Assess the severity of the incident
            Step 3: Communicate status updates as often as possible
            Step 4: Define and align on data SLOs and SLIs to prevent future incidents and downtime
        Why Data Incident Commanders Matter
    Case Study: Data Incident Management at PagerDuty
        The DataOps Landscape at PagerDuty
        Data Challenges at PagerDuty
        Using DevOps Best Practices to Scale Data Incident Management
            Best practice #1: Ensure your incident management covers the entire data life cycle
            Best practice #2: Incident management should include noise suppression
            Best practice #3: Group data assets and incidents to intelligently route alerts
    Summary
7. Building End-to-End Lineage
    Building End-to-End Field-Level Lineage for Modern Data Systems
        Basic Lineage Requirements
        Data Lineage Design
        Parsing the Data
        Building the User Interface
    Case Study: Architecting for Data Reliability at Fox
        Exercise “Controlled Freedom” When Dealing with Stakeholders
        Invest in a Decentralized Data Team
        Avoid Shiny New Toys in Favor of Problem-Solving Tech
        To Make Analytics Self-Serve, Invest in Data Trust
    Summary
8. Democratizing Data Quality
    Treating Your “Data” Like a Product
    Perspectives on Treating Data Like a Product
        Convoy Case Study: Data as a Service or Output
        Uber Case Study: The Rise of the Data Product Manager
        Applying the Data-as-a-Product Approach
            Gain stakeholder alignment early–and often
            Apply a product management mindset
            Invest in self-serve tooling
            Prioritize data quality and reliability
            Find the right team structure for your data organization
    Building Trust in Your Data Platform
        Align Your Product’s Goals with the Goals of the Business
        Gain Feedback and Buy-in from the Right Stakeholders
        Prioritize Long-Term Growth and Sustainability Versus Short-Term Gains
        Sign Off on Baseline Metrics for Your Data and How You Measure Them
        Know When to Build Versus Buy
    Assigning Ownership for Data Quality
        Chief Data Officer
        Business Intelligence Analyst
        Analytics Engineer
        Data Scientist
        Data Governance Lead
        Data Engineer
        Data Product Manager
        Who Is Responsible for Data Reliability?
    Creating Accountability for Data Quality
    Balancing Data Accessibility with Trust
    Certifying Your Data
    Seven Steps to Implementing a Data Certification Program
        Step 1: Build out your data observability capabilities
        Step 2: Determine your data owners
        Step 3: Understand what “good” data looks like
        Step 4: Set clear SLAs, SLOs, and SLIs for your most important data sets
        Step 5: Develop your communication and incident management processes
        Step 6: Determine a mechanism to tag the data as certified
        Step 7: Train your data team and downstream consumers
    Case Study: Toast’s Journey to Finding the Right Structure for Their Data Team
        In the Beginning: When a Small Team Struggles to Meet Data Demands
        Supporting Hypergrowth as a Decentralized Data Operation
        Regrouping, Recentralizing, and Refocusing on Data Trust
        Considerations When Scaling Your Data Team
            Hire data generalists, not specialists—with one exception
            Prioritize building a diverse data team from day one
            Overcommunication is key to change management
            Don’t overvalue a “single source of truth”
    Increasing Data Literacy
    Prioritizing Data Governance and Compliance
        Prioritizing a Data Catalog
            In-house
            Third-party
            Open source
        Beyond Catalogs: Enforcing Data Governance
    Building a Data Quality Strategy
        Make Leadership Accountable for Data Quality
        Set Data Quality KPIs
        Spearhead a Data Governance Program
        Automate Your Lineage and Data Governance Tooling
        Create a Communications Plan
    Summary
9. Data Quality in the Real World: Conversations and Case Studies
    Building a Data Mesh for Greater Data Quality
        Domain-Oriented Data Owners and Pipelines
        Self-Serve Functionality
        Interoperability and Standardization of Communications
    Why Implement a Data Mesh?
        To Mesh or Not to Mesh? That Is the Question
        Calculating Your Data Mesh Score
    A Conversation with Zhamak Dehghani: The Role of Data Quality Across the Data Mesh
        Can You Build a Data Mesh from a Single Solution?
        Is Data Mesh Another Word for Data Virtualization?
        Does Each Data Product Team Manage Their Own Separate Data Stores?
        Is a Self-Serve Data Platform the Same Thing as a Decentralized Data Mesh?
        Is the Data Mesh Right for All Data Teams?
        Does One Person on Your Team “Own” the Data Mesh?
        Does the Data Mesh Cause Friction Between Data Engineers and Data Analysts?
    Case Study: Kolibri Games’ Data Stack Journey
        First Data Needs
        Pursuing Performance Marketing
        2018: Professionalize and Centralize
        Getting Data-Oriented
        Getting Data-Driven
        Building a Data Mesh
        Five Key Takeaways from a Five-Year Data Evolution
    Making Metadata Work for the Business
    Unlocking the Value of Metadata with Data Discovery
        Data Warehouse and Lake Considerations
        Data Catalogs Can Drown in a Data Lake—or Even a Data Mesh
        Moving from Traditional Data Catalogs to Modern Data Discovery
    Deciding When to Get Started with Data Quality at Your Company
        You’ve Recently Migrated to the Cloud
        Your Data Stack Is Scaling with More Data Sources, More Tables, and More Complexity
        Your Data Team Is Growing
        Your Team Is Spending at Least 30% of Their Time Firefighting Data Quality Issues
        Your Team Has More Data Consumers Than They Did One Year Ago
        Your Company Is Moving to a Self-Service Analytics Model
        Data Is a Key Part of the Customer Value Proposition
        Data Quality Starts with Trust
    Summary
10. Pioneering the Future of Reliable Data Systems
    Be Proactive, Not Reactive
    Predictions for the Future of Data Quality and Reliability
        Data Warehouses and Lakes Will Merge
        Emergence of New Roles on the Data Team
        Rise of Automation
        More Distributed Environments and the Rise of Data Domains
    So Where Do We Go from Here?
Index

Data Mining Data Processing Database Storage & Design Databases

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://www.oreilly.com/

2. Search the book title: Data Quality Fundamentals: A Practitioner’s Guide to Building Trustworthy Data Pipelines, sometime you may not get the results, please search the main title

3. Click the book title in the search results

3. Publisher resources section, click Download Example Code.

1. Disable the AdBlock plugin. Otherwise, you may not get any links.

2. Solve the CAPTCHA.

3. Click download link.

4. Lead to download server to download.