Fundamentals of Data Engineering: Plan and Build Robust Data Systems
Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you will learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available in the framework of the data engineering lifecycle.
Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You will understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, governance, and deployment that are critical in any data environment regardless of the underlying technology.
This book will help you:
- Assess data engineering problems using an end-to-end data framework of best practices
- Cut through marketing hype when choosing data technologies, architecture, and processes
- Use the data engineering lifecycle to design and build a robust architecture
- Incorporate data governance and security across the data engineering lifecycle
Preface What This Book Isn’t What This Book Is About Who Should Read This Book Prerequisites What You’ll Learn and How It Will Improve Your Abilities The Book Outline Conventions Used in This Book How to Contact Us Acknowledgments I. Foundation and Building Blocks 1. Data Engineering Described What Is Data Engineering? Data Engineering Defined The Data Engineering Lifecycle Evolution of the Data Engineer Data Engineering and Data Science Data Engineering Skills and Activities Data Maturity and the Data Engineer The Background and Skills of a Data Engineer Business Responsibilities Technical Responsibilities The Continuum of Data Engineering Roles, from A to B Data Engineers Inside an Organization Internal-Facing Versus External-Facing Data Engineers Data Engineers and Other Technical Roles Data Engineers and Business Leadership Conclusion Additional Resources 2. The Data Engineering Lifecycle What Is the Data Engineering Lifecycle? The Data Lifecycle Versus the Data Engineering Lifecycle Generation: Source Systems Storage Ingestion Transformation Serving Data Major Undercurrents Across the Data Engineering Lifecycle Security Data Management Orchestration DataOps Data Architecture Software Engineering Conclusion Additional Resources 3. Designing Good Data Architecture What Is Data Architecture? Enterprise Architecture, Defined Data Architecture Defined “Good” Data Architecture Principles of Good Data Architecture Principle 1: Choose Common Components Wisely Principle 2: Plan for Failure Principle 3: Architect for Scalability Principle 4: Architecture Is Leadership Principle 5: Always Be Architecting Principle 6: Build Loosely Coupled Systems Principle 7: Make Reversible Decisions Principle 8: Prioritize Security Principle 9: Embrace FinOps Major Architecture Concepts Domains and Services Distributed Systems, Scalability, and Designing for Failure Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices User Access: Single Versus Multitenant Event-Driven Architecture Brownfield Versus Greenfield Projects Examples and Types of Data Architecture Data Warehouse Data Lake Convergence, Next-Generation Data Lakes, and the Data Platform Modern Data Stack Lambda Architecture Kappa Architecture The Dataflow Model and Unified Batch and Streaming Architecture for IoT Data Mesh Other Data Architecture Examples Who’s Involved with Designing a Data Architecture? Conclusion Additional Resources 4. Choosing Technologies Across the Data Engineering Lifecycle Team Size and Capabilities Speed to Market Interoperability Cost Optimization and Business Value Total Cost of Ownership Total Opportunity Cost of Ownership FinOps Today Versus the Future: Immutable Versus Transitory Technologies Our Advice Location On Premises Cloud Hybrid Cloud Multicloud Decentralized: Blockchain and the Edge Our Advice Cloud Repatriation Arguments Build Versus Buy Open Source Software Proprietary Walled Gardens Our Advice Monolith Versus Modular Monolith Modularity The Distributed Monolith Pattern Our Advice Serverless Versus Servers Serverless Containers When Infrastructure Makes Sense Our Advice Optimization, Performance, and the Benchmark Wars Big Data...for the 1990s Nonsensical Cost Comparisons Asymmetric Optimization Caveat Emptor Undercurrents and Their Impacts on Choosing Technologies Data Management DataOps Data Architecture Orchestration Example: Airflow Software Engineering Conclusion II. The Data Engineering Lifecycle in Depth 5. Data Generation in Source Systems Sources of Data: How Is Data Created? Source Systems: Main Ideas Files and Unstructured Data APIs Application Databases (OLTP systems) Online Analytical Processing System Change Data Capture Logs Database Logs CRUD Insert-Only Messages and Streams Types of Time Source System Practical Details Databases APIs Data Sharing Third-Party Data Sources Message Queues and Event-Streaming Platforms Whom You’ll Work With Undercurrents and Their Impact on Source Systems Security Data Management DataOps Data Architecture Orchestration Software Engineering Conclusion Additional Resources 6. Storage Raw Ingredients of Data Storage Magnetic Disk Drive Solid-State Drive Random Access Memory Networking and CPU Serialization Compression Caching Data Storage Systems Single Machine Versus Distributed Storage Eventual Versus Strong Consistency File Storage Block Storage Object Storage Cache and Memory-Based Storage Systems The Hadoop Distributed File System Streaming Storage Indexes, Partitioning, and Clustering Data Engineering Storage Abstractions The Data Warehouse The Data Lake The Data Lakehouse Data Platforms Stream-to-Batch Storage Architecture Big Ideas and Trends in Storage Data Catalog Data Sharing Schema Separation of Compute from Storage Data Storage Lifecycle and Data Retention Single-Tenant Versus Multitenant Storage Whom You’ll Work With Undercurrents Security Data Management DataOps Data Architecture Orchestration Software Engineering Conclusion Additional Resources 7. Ingestion What Is Data Ingestion? Key Engineering Considerations for the Ingestion Phase Bounded Versus Unbounded Frequency Synchronous Versus Asynchronous Ingestion Serialization and Deserialization Throughput and Scalability Reliability and Durability Payload Push Versus Pull Versus Poll Patterns Batch Ingestion Considerations Snapshot or Differential Extraction File-Based Export and Ingestion ETL Versus ELT Inserts, Updates, and Batch Size Data Migration Message and Stream Ingestion Considerations Schema Evolution Late-Arriving Data Ordering and Multiple Delivery Replay Time to Live Message Size Error Handling and Dead-Letter Queues Consumer Pull and Push Location Ways to Ingest Data Direct Database Connection Change Data Capture APIs Message Queues and Event-Streaming Platforms Managed Data Connectors Moving Data with Object Storage EDI Databases and File Export Practical Issues with Common File Formats Shell SSH SFTP and SCP Webhooks Web Interface Web Scraping Transfer Appliances for Data Migration Data Sharing Whom You’ll Work With Upstream Stakeholders Downstream Stakeholders Undercurrents Security Data Management DataOps Orchestration Software Engineering Conclusion Additional Resources 8. Queries, Modeling, and Transformation Queries What Is a Query? The Life of a Query The Query Optimizer Improving Query Performance Queries on Streaming Data Data Modeling What Is a Data Model? Conceptual, Logical, and Physical Data Models Normalization Techniques for Modeling Batch Analytical Data Modeling Streaming Data Transformations Batch Transformations Materialized Views, Federation, and Query Virtualization Streaming Transformations and Processing Whom You’ll Work With Upstream Stakeholders Downstream Stakeholders Undercurrents Security Data Management DataOps Data Architecture Orchestration Software Engineering Conclusion Additional Resources 9. Serving Data for Analytics, Machine Learning, and Reverse ETL General Considerations for Serving Data Trust What’s the Use Case, and Who’s the User? Data Products Self-Service or Not? Data Definitions and Logic Data Mesh Analytics Business Analytics Operational Analytics Embedded Analytics Machine Learning What a Data Engineer Should Know About ML Ways to Serve Data for Analytics and ML File Exchange Databases Streaming Systems Query Federation Data Sharing Semantic and Metrics Layers Serving Data in Notebooks Reverse ETL Ways to Serve Data with Reverse ETL Whom You’ll Work With Undercurrents Security Data Management DataOps Data Architecture Orchestration Software Engineering Conclusion Additional Resources III. Security, Privacy, and the Future of Data Engineering 10. Security and Privacy People The Power of Negative Thinking Always Be Paranoid Processes Security Theater Versus Security Habit Active Security The Principle of Least Privilege Shared Responsibility in the Cloud Always Back Up Your Data An Example Security Policy Technology Patch and Update Systems Encryption Logging, Monitoring, and Alerting Network Access Security for Low-Level Data Engineering Conclusion Additional Resources 11. The Future of Data Engineering The Data Engineering Lifecycle Isn’t Going Away The Decline of Complexity and the Rise of Easy-to-Use Data Tools The Cloud-Scale Data OS and Improved Interoperability “Enterprisey” Data Engineering Titles and Responsibilities Will Morph... Moving Beyond the Modern Data Stack, Toward the Live Data Stack The Live Data Stack Streaming Pipelines and Real-Time Analytical Databases The Fusion of Data with Applications The Tight Feedback Between Applications and ML Dark Matter Data and the Rise of...Spreadsheets?! Conclusion A. Serialization and Compression Technical Details Serialization Formats Row-Based Serialization Columnar Serialization Hybrid Serialization Database Storage Engines Compression: gzip, bzip2, Snappy, etc. B. Cloud Networking Cloud Network Topology Data Egress Charges Availability Zones Regions GCP-Specific Networking and Multiregional Redundancy Direct Network Connections to the Clouds CDNs The Future of Data Egress Fees Index About the Authors
How to download source code?
1. Go to:
2. Search the book title:
Fundamentals of Data Engineering: Plan and Build Robust Data Systems, sometime you may not get the results, please search the main title
3. Click the book title in the search results
Publisher resources section, click
Download Example Code.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.