97 Things Every Data Engineer Should Know
- Length: 250 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2021-07-13
- ISBN-10: 1492062413
- ISBN-13: 9781492062417
- Sales Rank: #2684894 (See Top 100 Books)
Take advantage of today’s sky-high demand for data engineers. With this in-depth book, current and aspiring engineers will learn powerful real-world best practices for managing data big and small. Contributors from notable companies including Twitter, Google, Stitch Fix, Microsoft, Capital One, and LinkedIn share their experiences and lessons learned for overcoming a variety of specific and often nagging challenges.
Edited by Tobias Macey, host of the popular Data Engineering Podcast, this book presents 97 concise and useful tips for cleaning, prepping, wrangling, storing, processing, and ingesting data. Data engineers, data architects, data team managers, data scientists, machine learning engineers, and software engineers will greatly benefit from the wisdom and experience of their peers.
Topics include:
- Building pipelines
- Stream processing
- Data privacy and security
- Data governance and lineage
- Data storage and architecture
- The ecosystem of modern tools
- Data team makeup and culture
- Career advice
Table of contents
1. A (Book) Case for Eventual Consistency
2. A/B and How to Be
3. About the Storage Layer
4. Analytics as the Secret Glue for Microservice Architectures
5. Automate Your Infrastructure
6. Automate Your Pipeline Tests
7. Be Intentional About the Batching Model in Your Data Pipelines
8. Beware of Silver-Bullet Syndrome
9. Building a Career as a Data Engineer
10. Business Dashboards for Data Pipelines
11. Caution: Data Science Projects Can Turn into the Emperor’s New Clothes
12. Change Data Capture
13. Column Names as Contracts
14. Consensual, Privacy-Aware Data Collection
15. Cultivate Good Working Relationships with Data Consumers
16. Data Engineering != Spark
17. Data Engineering for Autonomy and Rapid Innovation
18. Data Engineering from a Data Scientist’s Perspective
19. Data Pipeline Design Patterns for Reusability and Extensibility
20. Data Quality for Data Engineers
21. Data Security for Data Engineers
22. Data Validation Is More Than Summary Statistics
23. Data Warehouses Are the Past, Present, and Future
24. Defining and Managing Messages in Log-Centric Architectures
25. Demystify the Source and Illuminate the Data Pipeline
26. Develop Communities, Not Just Code
27. Effective Data Engineering in the Cloud World
28. Embrace the Data Lake Architecture
29. Embracing Data Silos
30. Engineering Reproducible Data Science Projects
31. Five Best Practices for Stable Data Processing
32. Focus on Maintainability and Break Up Those ETL Tasks
33. Friends Don’t Let Friends Do Dual-Writes
34. Fundamental Knowledge
35. Getting the “Structured” Back into SQL
36. Give Data Products a Frontend with Latent Documentation
37. How Data Pipelines Evolve
38. How to Build Your Data Platform like a Product
39. How to Prevent a Data Mutiny
40. Know the Value per Byte of Your Data
41. Know Your Latencies
42. Learn to Use a NoSQL Database, but Not like an RDBMS
43. Let the Robots Enforce the Rules
44. Listen to Your Users—but Not Too Much
45. Low-Cost Sensors and the Quality of Data
46. Maintain Your Mechanical Sympathy
47. Metadata ≥ Data
48. Metadata Services as a Core Component of the Data Platform
49. Mind the Gap: Your Data Lake Provides No ACID Guarantees
50. Modern Metadata for the Modern Data Stack
51. Most Data Problems Are Not Big Data Problems
52. Moving from Software Engineering to Data Engineering
53. Observability for Data Engineers
54. Perfect Is the Enemy of Good
55. Pipe Dreams
56. Preventing the Data Lake Abyss
57. Prioritizing User Experience in Messaging Systems
58. Privacy Is Your Problem
59. QA and All Its Sexiness
60. Seven Things Data Engineers Need to Watch Out for in ML Projects
61. Six Dimensions for Picking an Analytical Data Warehouse
62. Small Files in a Big Data World
63. Streaming Is Different from Batch
64. Tardy Data
65. Tech Should Take a Back Seat for Data Project Success
66. Ten Must-Ask Questions for Data-Engineering Projects
67. The Data Pipeline Is Not About Speed
68. The Dos and Don’ts of Data Engineering
69. The End of ETL as We Know It
70. The Haiku Approach to Writing Software
71. The Hidden Cost of Data Input/Output
72. The Holy War Between Proprietary and Open Source Is a Lie
73. The Implications of the CAP Theorem
74. The Importance of Data Lineage
75. The Many Meanings of Missingness
76. The Six Words That Will Destroy Your Career
77. The Three Invaluable Benefits of Open Source for Testing Data Quality
78. The Three Rs of Data Engineering
79. The Two Types of Data Engineering and Data Engineers
80. The Yin and Yang of Big Data Scalability
81. Threading and Concurrency in Data Processing
82. Three Important Distributed Programming Concepts
83. Time (Semantics) Won’t Wait
84. Tools Don’t Matter, Patterns and Practices Do
85. Total Opportunity Cost of Ownership
86. Understanding the Ways Different Data Domains Solve Problems
87. What Is a Data Engineer? Clue: We’re Data Science Enablers
88. What Is a Data Mesh, and How Not to Mesh It Up
89. What Is Big Data?
90. What to Do When You Don’t Get Any Credit
91. When Our Data Science Team Didn’t Produce Value
92. When to Avoid the Naive Approach
93. When to Be Cautious About Sharing Data
94. When to Talk and When to Listen
95. Why Data Science Teams Need Generalists, Not Specialists
96. With Great Data Comes Great Responsibility
97. Your Data Tests Failed! Now What?
Preface O’Reilly Online Learning How to Contact Us Acknowledgments 1. A (Book) Case for Eventual Consistency Denise Koessler Gosnell, PhD 2. A/B and How to Be Sonia Mehta 3. About the Storage Layer Julien Le Dem 4. Analytics as the Secret Glue for Microservice Architectures Elias Nema 5. Automate Your Infrastructure Christiano Anderson 6. Automate Your Pipeline Tests Tom White Build an End-to-End Test of the Whole Pipeline Use a Small Amount of Representative Data Prefer Textual Data Formats over Binary Ensure That Tests Can Be Run Locally Make Tests Deterministic Make It Easy to Add More Tests 7. Be Intentional About the Batching Model in Your Data Pipelines Raghotham Murthy Data Time Window Batching Model Arrival Time Window Batching Model ATW and DTW Batching in the Same Pipeline 8. Beware of Silver-Bullet Syndrome Thomas Nield 9. Building a Career as a Data Engineer Vijay Kiran 10. Business Dashboards for Data Pipelines Valliappa (Lak) Lakshmanan 11. Caution: Data Science Projects Can Turn into the Emperor’s New Clothes Shweta Katre 12. Change Data Capture Raghotham Murthy 13. Column Names as Contracts Emily Riederer 14. Consensual, Privacy-Aware Data Collection Katharine Jarmul Attach Consent Metadata Track Data Provenance Drop or Encrypt Sensitive Fields 15. Cultivate Good Working Relationships with Data Consumers Ido Shlomo Don’t Let Consumers Solve Engineering Problems Adapt Your Expectations Understand Consumers’ Jobs 16. Data Engineering != Spark Jesse Anderson Batch and Real-Time Systems Computation Component Storage Component NoSQL Databases Messaging Component 17. Data Engineering for Autonomy and Rapid Innovation Jeff Magnusson Implement Reusable Patterns in the ETL Framework Choose a Framework and Tool Set Accessible Within the Organization Move the Logic to the Edges of the Pipelines Create and Support Staging Tables Bake Data-Flow Logic into Tooling and Infrastructure 18. Data Engineering from a Data Scientist’s Perspective Bill Franks Database Administration, ETL, and Such Why the Need for Data Engineers? What’s the Future? 19. Data Pipeline Design Patterns for Reusability and Extensibility Mukul Sood 20. Data Quality for Data Engineers Katharine Jarmul 21. Data Security for Data Engineers Katharine Jarmul Learn About Security Monitor, Log, and Test Access Encrypt Data Automate Security Tests Ask for Help 22. Data Validation Is More Than Summary Statistics Emily Riederer 23. Data Warehouses Are the Past, Present, and Future James Densmore 24. Defining and Managing Messages in Log-Centric Architectures Boris Lublinsky 25. Demystify the Source and Illuminate the Data Pipeline Meghan Kwartler 26. Develop Communities, Not Just Code Emily Riederer 27. Effective Data Engineering in the Cloud World Dipti Borkar Disaggregated Data Stack Orchestrate, Orchestrate, Orchestrate Copying Data Creates Problems S3 Compatibility SQL and Structured Data Are Still In 28. Embrace the Data Lake Architecture Vinoth Chandar Common Pitfalls Data Lakes Advantages Implementation 29. Embracing Data Silos Bin Fan and Amelia Wong Why Data Silos Exist Embracing Data Silos 30. Engineering Reproducible Data Science Projects Dr. Tianhui Michael Li 31. Five Best Practices for Stable Data Processing Christian Lauer Prevent Errors Set Fair Processing Times Use Data-Quality Measurement Jobs Ensure Transaction Security Consider Dependency on Other Systems Conclusion 32. Focus on Maintainability and Break Up Those ETL Tasks Chris Moradi 33. Friends Don’t Let Friends Do Dual-Writes Gunnar Morling 34. Fundamental Knowledge Pedro Marcelino 35. Getting the “Structured” Back into SQL Elias Nema 36. Give Data Products a Frontend with Latent Documentation Emily Riederer 37. How Data Pipelines Evolve Chris Heinzmann 38. How to Build Your Data Platform like a Product Barr Moses and Atul Gupte Align Your Product’s Goals with the Goals of the Business Gain Feedback and Buy-in from the Right Stakeholders Prioritize Long-Term Growth and Sustainability over Short-Term Gains Sign Off on Baseline Metrics for Your Data and How You Measure It 39. How to Prevent a Data Mutiny Sean Knapp 40. Know the Value per Byte of Your Data Dhruba Borthakur 41. Know Your Latencies Dhruba Borthakur 42. Learn to Use a NoSQL Database, but Not like an RDBMS Kirk Kirkconnell 43. Let the Robots Enforce the Rules Anthony Burdi 44. Listen to Your Users—but Not Too Much Amanda Tomlinson 45. Low-Cost Sensors and the Quality of Data Dr. Shivanand Prabhoolall Guness 46. Maintain Your Mechanical Sympathy Tobias Macey 47. Metadata ≥ Data Jonathan Seidman 48. Metadata Services as a Core Component of the Data Platform Lohit VijayaRenu Discoverability Security Control Schema Management Application Interface and Service Guarantee 49. Mind the Gap: Your Data Lake Provides No ACID Guarantees Einat Orr 50. Modern Metadata for the Modern Data Stack Prukalpa Sankar Data Assets > Tables Complete Data Visibility, Not Piecemeal Solutions Built for Metadata That Itself Is Big Data Embedded Collaboration at Its Heart 51. Most Data Problems Are Not Big Data Problems Thomas Nield 52. Moving from Software Engineering to Data Engineering John Salinas 53. Observability for Data Engineers Barr Moses How Good Data Turns Bad Introducing Data Observability 54. Perfect Is the Enemy of Good Bob Haffner 55. Pipe Dreams Scott Haines 56. Preventing the Data Lake Abyss Scott Haines Establishing Data Contracts From Generic Data Lake to Data Structure Store 57. Prioritizing User Experience in Messaging Systems Jowanza Joseph 58. Privacy Is Your Problem Stephen Bailey, PhD 59. QA and All Its Sexiness Sonia Mehta 60. Seven Things Data Engineers Need to Watch Out for in ML Projects Dr. Sandeep Uttamchandani 61. Six Dimensions for Picking an Analytical Data Warehouse Gleb Mezhanskiy Scalability Price Elasticity Interoperability Querying and Transformation Features Speed Zero Maintenance 62. Small Files in a Big Data World Adi Polak What Are Small Files, and Why Are They a Problem? Why Does It Happen? Detect and Mitigate Conclusion References 63. Streaming Is Different from Batch Dean Wampler, PhD 64. Tardy Data Ariel Shaqed 65. Tech Should Take a Back Seat for Data Project Success Andrew Stevenson 66. Ten Must-Ask Questions for Data-Engineering Projects Haidar Hadi Question 1: What Are the Touch Points? Question 2: What Are the Granularities? Question 3: What Are the Input and Output Schemas? Question 4: What Is the Algorithm? Question 5: Do You Need Backfill Data? Question 6: When Is the Project Due Date? Question 7: Why Was That Due Date Set? Question 8: Which Hosting Environment? Question 9: What Is the SLA? Question 10: Who Will Be Taking Over This Project? 67. The Data Pipeline Is Not About Speed Rustem Feyzkhanov 68. The Dos and Don’ts of Data Engineering Christopher Bergh Don’t Be a Hero Don’t Rely on Hope Don’t Rely on Caution Do DataOps 69. The End of ETL as We Know It Paul Singman Replacing ETL with Intentional Data Transfer Agreeing on a Data Model Contract Removing Data Processing Latencies Taking the First Steps 70. The Haiku Approach to Writing Software Mitch Seymour Understand the Constraints Up Front Start Strong Since Early Decisions Can Impact the Final Product Keep It as Simple as Possible Engage the Creative Side of Your Brain 71. The Hidden Cost of Data Input/Output Lohit VijayaRenu Data Compression Data Format Data Serialization 72. The Holy War Between Proprietary and Open Source Is a Lie Paige Roberts 73. The Implications of the CAP Theorem Paul Doran 74. The Importance of Data Lineage Julien Le Dem 75. The Many Meanings of Missingness Emily Riederer 76. The Six Words That Will Destroy Your Career Bartosz Mikulski 77. The Three Invaluable Benefits of Open Source for Testing Data Quality Tom Baeyens 78. The Three Rs of Data Engineering Tobias Macey Reliability Reproducibility Repeatability Conclusion 79. The Two Types of Data Engineering and Data Engineers Jesse Anderson Types of Data Engineering Types of Data Engineers Why These Differences Matter to You 80. The Yin and Yang of Big Data Scalability Paul Brebner 81. Threading and Concurrency in Data Processing Matthew Housley, PhD Operating System Threading Threading Overhead Solving the C10K Problem Scaling Is Not a Magic Bullet Further Reading 82. Three Important Distributed Programming Concepts Adi Polak MapReduce Algorithm Distributed Shared Memory Model Message Passing/Actors Model Conclusions 83. Time (Semantics) Won’t Wait Marta Paes Moreira and Fabian Hueske 84. Tools Don’t Matter, Patterns and Practices Do Bas Geerdink 85. Total Opportunity Cost of Ownership Joe Reis 86. Understanding the Ways Different Data Domains Solve Problems Matthew Seal 87. What Is a Data Engineer? Clue: We’re Data Science Enablers Lewis Gavin AI and Machine Learning Models Require Data Clean Data == Better Model Finally Building a Model A Model Is Useful Only If Someone Will Use It So What Am I Getting At? 88. What Is a Data Mesh, and How Not to Mesh It Up Barr Moses and Lior Gavish Why Use a Data Mesh? The Final Link: Observability 89. What Is Big Data? Ami Levin 90. What to Do When You Don’t Get Any Credit Jesse Anderson 91. When Our Data Science Team Didn’t Produce Value Joel Nantais 92. When to Avoid the Naive Approach Nimrod Parasol 93. When to Be Cautious About Sharing Data Thomas Nield 94. When to Talk and When to Listen Steven Finkelstein 95. Why Data Science Teams Need Generalists, Not Specialists Eric Colson 96. With Great Data Comes Great Responsibility Lohit VijayaRenu Put Yourself in the User’s Shoes Ensure Ethical Use of User Information Watch Your Data Footprint 97. Your Data Tests Failed! Now What? Sam Bail, PhD System Response Logging and Alerting Alert Response Stakeholder Communication Root Cause Identification Issue Resolution Contributors Index
Donate to keep this site alive
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: 97 Things Every Data Engineer Should Know
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.