97 Things Every Data Engineer Should Know

Length: 250 pages
Edition: 1
Language: English
Publisher: O'Reilly Media
Publication Date: 2021-07-13
ISBN-10: 1492062413
ISBN-13: 9781492062417
Sales Rank: #2684894 (See Top 100 Books)

Take advantage of today’s sky-high demand for data engineers. With this in-depth book, current and aspiring engineers will learn powerful real-world best practices for managing data big and small. Contributors from notable companies including Twitter, Google, Stitch Fix, Microsoft, Capital One, and LinkedIn share their experiences and lessons learned for overcoming a variety of specific and often nagging challenges.

Edited by Tobias Macey, host of the popular Data Engineering Podcast, this book presents 97 concise and useful tips for cleaning, prepping, wrangling, storing, processing, and ingesting data. Data engineers, data architects, data team managers, data scientists, machine learning engineers, and software engineers will greatly benefit from the wisdom and experience of their peers.

Topics include:

Building pipelines
Stream processing
Data privacy and security
Data governance and lineage
Data storage and architecture
The ecosystem of modern tools
Data team makeup and culture
Career advice

Table of contents

1. A (Book) Case for Eventual Consistency
2. A/B and How to Be
3. About the Storage Layer
4. Analytics as the Secret Glue for Microservice Architectures
5. Automate Your Infrastructure
6. Automate Your Pipeline Tests
7. Be Intentional About the Batching Model in Your Data Pipelines
8. Beware of Silver-Bullet Syndrome
9. Building a Career as a Data Engineer
10. Business Dashboards for Data Pipelines
11. Caution: Data Science Projects Can Turn into the Emperor’s New Clothes
12. Change Data Capture
13. Column Names as Contracts
14. Consensual, Privacy-Aware Data Collection
15. Cultivate Good Working Relationships with Data Consumers
16. Data Engineering != Spark
17. Data Engineering for Autonomy and Rapid Innovation
18. Data Engineering from a Data Scientist’s Perspective
19. Data Pipeline Design Patterns for Reusability and Extensibility
20. Data Quality for Data Engineers
21. Data Security for Data Engineers
22. Data Validation Is More Than Summary Statistics
23. Data Warehouses Are the Past, Present, and Future
24. Defining and Managing Messages in Log-Centric Architectures
25. Demystify the Source and Illuminate the Data Pipeline
26. Develop Communities, Not Just Code
27. Effective Data Engineering in the Cloud World
28. Embrace the Data Lake Architecture
29. Embracing Data Silos
30. Engineering Reproducible Data Science Projects
31. Five Best Practices for Stable Data Processing
32. Focus on Maintainability and Break Up Those ETL Tasks
33. Friends Don’t Let Friends Do Dual-Writes
34. Fundamental Knowledge
35. Getting the “Structured” Back into SQL
36. Give Data Products a Frontend with Latent Documentation
37. How Data Pipelines Evolve
38. How to Build Your Data Platform like a Product
39. How to Prevent a Data Mutiny
40. Know the Value per Byte of Your Data
41. Know Your Latencies
42. Learn to Use a NoSQL Database, but Not like an RDBMS
43. Let the Robots Enforce the Rules
44. Listen to Your Users—but Not Too Much
45. Low-Cost Sensors and the Quality of Data
46. Maintain Your Mechanical Sympathy
47. Metadata ≥ Data
48. Metadata Services as a Core Component of the Data Platform
49. Mind the Gap: Your Data Lake Provides No ACID Guarantees
50. Modern Metadata for the Modern Data Stack
51. Most Data Problems Are Not Big Data Problems
52. Moving from Software Engineering to Data Engineering
53. Observability for Data Engineers
54. Perfect Is the Enemy of Good
55. Pipe Dreams
56. Preventing the Data Lake Abyss
57. Prioritizing User Experience in Messaging Systems
58. Privacy Is Your Problem
59. QA and All Its Sexiness
60. Seven Things Data Engineers Need to Watch Out for in ML Projects
61. Six Dimensions for Picking an Analytical Data Warehouse
62. Small Files in a Big Data World
63. Streaming Is Different from Batch
64. Tardy Data
65. Tech Should Take a Back Seat for Data Project Success
66. Ten Must-Ask Questions for Data-Engineering Projects
67. The Data Pipeline Is Not About Speed
68. The Dos and Don’ts of Data Engineering
69. The End of ETL as We Know It
70. The Haiku Approach to Writing Software
71. The Hidden Cost of Data Input/Output
72. The Holy War Between Proprietary and Open Source Is a Lie
73. The Implications of the CAP Theorem
74. The Importance of Data Lineage
75. The Many Meanings of Missingness
76. The Six Words That Will Destroy Your Career
77. The Three Invaluable Benefits of Open Source for Testing Data Quality
78. The Three Rs of Data Engineering
79. The Two Types of Data Engineering and Data Engineers
80. The Yin and Yang of Big Data Scalability
81. Threading and Concurrency in Data Processing
82. Three Important Distributed Programming Concepts
83. Time (Semantics) Won’t Wait
84. Tools Don’t Matter, Patterns and Practices Do
85. Total Opportunity Cost of Ownership
86. Understanding the Ways Different Data Domains Solve Problems
87. What Is a Data Engineer? Clue: We’re Data Science Enablers
88. What Is a Data Mesh, and How Not to Mesh It Up
89. What Is Big Data?
90. What to Do When You Don’t Get Any Credit
91. When Our Data Science Team Didn’t Produce Value
92. When to Avoid the Naive Approach
93. When to Be Cautious About Sharing Data
94. When to Talk and When to Listen
95. Why Data Science Teams Need Generalists, Not Specialists
96. With Great Data Comes Great Responsibility
97. Your Data Tests Failed! Now What?

Preface
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
1. A (Book) Case for Eventual Consistency
    Denise Koessler Gosnell, PhD
2. A/B and How to Be
    Sonia Mehta
3. About the Storage Layer
    Julien Le Dem
4. Analytics as the Secret Glue for Microservice Architectures
    Elias Nema
5. Automate Your Infrastructure
    Christiano Anderson
6. Automate Your Pipeline Tests
    Tom White
        Build an End-to-End Test of the Whole Pipeline
        Use a Small Amount of Representative Data
        Prefer Textual Data Formats over Binary
        Ensure That Tests Can Be Run Locally
        Make Tests Deterministic
        Make It Easy to Add More Tests
7. Be Intentional About the Batching Model in Your Data Pipelines
    Raghotham Murthy
        Data Time Window Batching Model
        Arrival Time Window Batching Model
        ATW and DTW Batching in the Same Pipeline
8. Beware of Silver-Bullet Syndrome
    Thomas Nield
9. Building a Career as a Data Engineer
    Vijay Kiran
10. Business Dashboards for Data Pipelines
    Valliappa (Lak) Lakshmanan
11. Caution: Data Science Projects Can Turn into the Emperor’s New Clothes
    Shweta Katre
12. Change Data Capture
    Raghotham Murthy
13. Column Names as Contracts
    Emily Riederer
14. Consensual, Privacy-Aware Data Collection
    Katharine Jarmul
        Attach Consent Metadata
        Track Data Provenance
        Drop or Encrypt Sensitive Fields
15. Cultivate Good Working Relationships with Data Consumers
    Ido Shlomo
        Don’t Let Consumers Solve Engineering Problems
        Adapt Your Expectations
        Understand Consumers’ Jobs
16. Data Engineering != Spark
    Jesse Anderson
        Batch and Real-Time Systems
        Computation Component
        Storage Component
        NoSQL Databases
        Messaging Component
17. Data Engineering for Autonomy and Rapid Innovation
    Jeff Magnusson
        Implement Reusable Patterns in the ETL Framework
        Choose a Framework and Tool Set Accessible Within the Organization
        Move the Logic to the Edges of the Pipelines
        Create and Support Staging Tables
        Bake Data-Flow Logic into Tooling and Infrastructure
18. Data Engineering from a Data Scientist’s Perspective
    Bill Franks
        Database Administration, ETL, and Such
        Why the Need for Data Engineers?
        What’s the Future?
19. Data Pipeline Design Patterns for Reusability and Extensibility
    Mukul Sood
20. Data Quality for Data Engineers
    Katharine Jarmul
21. Data Security for Data Engineers
    Katharine Jarmul
        Learn About Security
        Monitor, Log, and Test Access
        Encrypt Data
        Automate Security Tests
        Ask for Help
22. Data Validation Is More Than Summary Statistics
    Emily Riederer
23. Data Warehouses Are the Past, Present, and Future
    James Densmore
24. Defining and Managing Messages in Log-Centric Architectures
    Boris Lublinsky
25. Demystify the Source and Illuminate the Data Pipeline
    Meghan Kwartler
26. Develop Communities, Not Just Code
    Emily Riederer
27. Effective Data Engineering in the Cloud World
    Dipti Borkar
        Disaggregated Data Stack
        Orchestrate, Orchestrate, Orchestrate
        Copying Data Creates Problems
        S3 Compatibility
        SQL and Structured Data Are Still In
28. Embrace the Data Lake Architecture
    Vinoth Chandar
        Common Pitfalls
        Data Lakes
        Advantages
        Implementation
29. Embracing Data Silos
    Bin Fan and Amelia Wong
        Why Data Silos Exist
        Embracing Data Silos
30. Engineering Reproducible Data Science Projects
    Dr. Tianhui Michael Li
31. Five Best Practices for Stable Data Processing
    Christian Lauer
        Prevent Errors
        Set Fair Processing Times
        Use Data-Quality Measurement Jobs
        Ensure Transaction Security
        Consider Dependency on Other Systems
        Conclusion
32. Focus on Maintainability and Break Up Those ETL Tasks
    Chris Moradi
33. Friends Don’t Let Friends Do Dual-Writes
    Gunnar Morling
34. Fundamental Knowledge
    Pedro Marcelino
35. Getting the “Structured” Back into SQL
    Elias Nema
36. Give Data Products a Frontend with Latent Documentation
    Emily Riederer
37. How Data Pipelines Evolve
    Chris Heinzmann
38. How to Build Your Data Platform like a Product
    Barr Moses and Atul Gupte
        Align Your Product’s Goals with the Goals of the Business
        Gain Feedback and Buy-in from the Right Stakeholders
        Prioritize Long-Term Growth and Sustainability over Short-Term Gains
        Sign Off on Baseline Metrics for Your Data and How You Measure It
39. How to Prevent a Data Mutiny
    Sean Knapp
40. Know the Value per Byte of Your Data
    Dhruba Borthakur
41. Know Your Latencies
    Dhruba Borthakur
42. Learn to Use a NoSQL Database, but Not like an RDBMS
    Kirk Kirkconnell
43. Let the Robots Enforce the Rules
    Anthony Burdi
44. Listen to Your Users—but Not Too Much
    Amanda Tomlinson
45. Low-Cost Sensors and the Quality of Data
    Dr. Shivanand Prabhoolall Guness
46. Maintain Your Mechanical Sympathy
    Tobias Macey
47. Metadata ≥ Data
    Jonathan Seidman
48. Metadata Services as a Core Component of the Data Platform
    Lohit VijayaRenu
        Discoverability
        Security Control
        Schema Management
        Application Interface and Service Guarantee
49. Mind the Gap: Your Data Lake Provides No ACID Guarantees
    Einat Orr
50. Modern Metadata for the Modern Data Stack
    Prukalpa Sankar
        Data Assets > Tables
        Complete Data Visibility, Not Piecemeal Solutions
        Built for Metadata That Itself Is Big Data
        Embedded Collaboration at Its Heart
51. Most Data Problems Are Not Big Data Problems
    Thomas Nield
52. Moving from Software Engineering to Data Engineering
    John Salinas
53. Observability for Data Engineers
    Barr Moses
        How Good Data Turns Bad
        Introducing Data Observability
54. Perfect Is the Enemy of Good
    Bob Haffner
55. Pipe Dreams
    Scott Haines
56. Preventing the Data Lake Abyss
    Scott Haines
        Establishing Data Contracts
        From Generic Data Lake to Data Structure Store
57. Prioritizing User Experience in Messaging Systems
    Jowanza Joseph
58. Privacy Is Your Problem
    Stephen Bailey, PhD
59. QA and All Its Sexiness
    Sonia Mehta
60. Seven Things Data Engineers Need to Watch Out for in ML Projects
    Dr. Sandeep Uttamchandani
61. Six Dimensions for Picking an Analytical Data Warehouse
    Gleb Mezhanskiy
        Scalability
        Price Elasticity
        Interoperability
        Querying and Transformation Features
        Speed
        Zero Maintenance
62. Small Files in a Big Data World
    Adi Polak
        What Are Small Files, and Why Are They a Problem?
        Why Does It Happen?
        Detect and Mitigate
        Conclusion
        References
63. Streaming Is Different from Batch
    Dean Wampler, PhD
64. Tardy Data
    Ariel Shaqed
65. Tech Should Take a Back Seat for Data Project Success
    Andrew Stevenson
66. Ten Must-Ask Questions for Data-Engineering Projects
    Haidar Hadi
        Question 1: What Are the Touch Points?
        Question 2: What Are the Granularities?
        Question 3: What Are the Input and Output Schemas?
        Question 4: What Is the Algorithm?
        Question 5: Do You Need Backfill Data?
        Question 6: When Is the Project Due Date?
        Question 7: Why Was That Due Date Set?
        Question 8: Which Hosting Environment?
        Question 9: What Is the SLA?
        Question 10: Who Will Be Taking Over This Project?
67. The Data Pipeline Is Not About Speed
    Rustem Feyzkhanov
68. The Dos and Don’ts of Data Engineering
    Christopher Bergh
        Don’t Be a Hero
        Don’t Rely on Hope
        Don’t Rely on Caution
        Do DataOps
69. The End of ETL as We Know It
    Paul Singman
        Replacing ETL with Intentional Data Transfer
        Agreeing on a Data Model Contract
        Removing Data Processing Latencies
        Taking the First Steps
70. The Haiku Approach to Writing Software
    Mitch Seymour
        Understand the Constraints Up Front
        Start Strong Since Early Decisions Can Impact the Final Product
        Keep It as Simple as Possible
        Engage the Creative Side of Your Brain
71. The Hidden Cost of Data Input/Output
    Lohit VijayaRenu
        Data Compression
        Data Format
        Data Serialization
72. The Holy War Between Proprietary and Open Source Is a Lie
    Paige Roberts
73. The Implications of the CAP Theorem
    Paul Doran
74. The Importance of Data Lineage
    Julien Le Dem
75. The Many Meanings of Missingness
    Emily Riederer
76. The Six Words That Will Destroy Your Career
    Bartosz Mikulski
77. The Three Invaluable Benefits of Open Source for Testing Data Quality
    Tom Baeyens
78. The Three Rs of Data Engineering
    Tobias Macey
        Reliability
        Reproducibility
        Repeatability
        Conclusion
79. The Two Types of Data Engineering and Data Engineers
    Jesse Anderson
        Types of Data Engineering
        Types of Data Engineers
        Why These Differences Matter to You
80. The Yin and Yang of Big Data Scalability
    Paul Brebner
81. Threading and Concurrency in Data Processing
    Matthew Housley, PhD
        Operating System Threading
        Threading Overhead
        Solving the C10K Problem
        Scaling Is Not a Magic Bullet
        Further Reading
82. Three Important Distributed Programming Concepts
    Adi Polak
        MapReduce Algorithm
        Distributed Shared Memory Model
        Message Passing/Actors Model
        Conclusions
83. Time (Semantics) Won’t Wait
    Marta Paes Moreira and Fabian Hueske
84. Tools Don’t Matter, Patterns and Practices Do
    Bas Geerdink
85. Total Opportunity Cost of Ownership
    Joe Reis
86. Understanding the Ways Different Data Domains Solve Problems
    Matthew Seal
87. What Is a Data Engineer? Clue: We’re Data Science Enablers
    Lewis Gavin
        AI and Machine Learning Models Require Data
        Clean Data == Better Model
        Finally Building a Model
        A Model Is Useful Only If Someone Will Use It
        So What Am I Getting At?
88. What Is a Data Mesh, and How Not to Mesh It Up
    Barr Moses and Lior Gavish
        Why Use a Data Mesh?
        The Final Link: Observability
89. What Is Big Data?
    Ami Levin
90. What to Do When You Don’t Get Any Credit
    Jesse Anderson
91. When Our Data Science Team Didn’t Produce Value
    Joel Nantais
92. When to Avoid the Naive Approach
    Nimrod Parasol
93. When to Be Cautious About Sharing Data
    Thomas Nield
94. When to Talk and When to Listen
    Steven Finkelstein
95. Why Data Science Teams Need Generalists, Not Specialists
    Eric Colson
96. With Great Data Comes Great Responsibility
    Lohit VijayaRenu
        Put Yourself in the User’s Shoes
        Ensure Ethical Use of User Information
        Watch Your Data Footprint
97. Your Data Tests Failed! Now What?
    Sam Bail, PhD
        System Response
        Logging and Alerting
        Alert Response
        Stakeholder Communication
        Root Cause Identification
        Issue Resolution
Contributors
Index