
The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture
- Length: 250 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2023-01-31
- ISBN-10: 1098116585
- ISBN-13: 9781098116583
- Sales Rank: #4306015 (See Top 100 Books)
More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.
This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, product management leader at Microsoft, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.
- Learn the benefits of a cloud-based big data strategy for your organization
- Get guidance and best practices for designing performant and scalable data lakes
- Examine architecture and design choices, and data governance principles and strategies
- Build a data strategy that scales as your organizational and business needs increase
- Implement a scalable data lake in the cloud
- Use cloud-based advanced analytics to gain more value from your data
Preface Why I Wrote This Book Who Should Read This Book? Introducing Klodars Corporation Navigating the Book Conventions Used in This Book O’Reilly Online Learning How to Contact Us Acknowledgments 1. Big Data—Beyond the Buzz What Is Big Data? Elastic Data Infrastructure—The Challenge Cloud Computing Fundamentals Cloud Computing Terminology Value Proposition of the Cloud Cloud Data Lake Architecture Limitations of On-Premises Data Warehouse Solutions What Is a Cloud Data Lake Architecture? Benefits of a Cloud Data Lake Architecture Defining Your Cloud Data Lake Journey Summary 2. Big Data Architectures on the Cloud Why Klodars Corporation Moves to the Cloud Fundamentals of Cloud Data Lake Architectures A Word on Variety of Data Cloud Data Lake Storage Big Data Analytics Engines MapReduce Apache Hadoop Apache Spark Real-time stream processing pipelines Cloud Data Warehouses Modern Data Warehouse Architecture Reference Architecture Sample Use Case for a Modern Data Warehouse Architecture Benefits and Challenges of Modern Data Warehouse Architecture Data Lakehouse Architecture Reference Architecture for the Data Lakehouse Data formats Metadata Compute engines Sample Use Case for Data Lakehouse Architecture Benefits and Challenges of the Data Lakehouse Architecture Data Warehouses and Unstructured Data Data Mesh Reference Architecture Sample Use Case for a Data Mesh Architecture Challenges and Benefits of a Data Mesh Architecture What Is the Right Architecture for Me? Know Your Customers Know Your Business Drivers Consider Your Growth and Future Scenarios Design Considerations Hybrid Approaches Summary 3. Design Considerations for Your Data Lake Setting Up the Cloud Data Lake Infrastructure Identify Your Goals How Klodars Corporation defined the data lake goals Plan Your Architecture and Deliverables How Klodars Corporation planned their architecture and deliverables Implement the Cloud Data Lake Release and Operationalize Organizing Data in Your Data Lake A Day in the Life of Data Data Lake Zones Organization Mechanisms Introduction to Data Governance Actors Involved in Data Governance Data Classification Metadata Management, Data Catalog, and Data Sharing Data Access Management Data Quality and Observability Data Governance at Klodars Corporation Data Governance Wrap-Up Manage Data Lake Costs Demystifying Data Lake Costs on the Cloud Data Lake Cost Strategy Data Lake Environments and Associated Costs Cost strategy based on data Transactions and impact on costs Summary 4. Scalable Data Lakes A Sneak Peek into Scalability What Is Scalability? Scale in Our Day-to-Day Life Scalability in Data Lake Architectures Internals of Data Lake Processing Systems Data Copy Internals Components of a data copy solution Understanding resource utilization of a data copy job ELT/ETL Processing Internals Components of an Apache Spark application Understanding resource utilization of a Spark job A Note on Other Interactive Queries Considerations for Scalable Data Lake Solutions Pick the Right Cloud Offerings Hybrid and multicloud solutions IaaS versus PaaS versus SaaS solutions Cloud offerings for Klodars Corporation Plan for Peak Capacity Data Formats and Job Profile Summary 5. Optimizing Cloud Data Lake Architectures for Performance Basics of Measuring Performance Goals and Metrics for Performance Measuring Performance Optimizing for Faster Performance Cloud Data Lake Performance SLAs, SLOs, and SLIs Example: How Klodars Corporation Managed Its SLAs, SLOs, and SLIs Drivers of Performance Performance Drivers for a Copy Job Performance Drivers for a Spark Job Optimization Principles and Techniques for Performance Tuning Data Formats Exploring Apache Parquet Other popular data formats How Klodars Corporation picked their data formats Data Organization and Partitioning Optimal data organization strategy for Klodars Corporation Choosing the Right Configurations on Apache Spark Minimize Overheads with Data Transfer Premium Offerings and Performance The Case of Bigger Virtual Machines The Case of Flash Storage Summary 6. Deep Dive on Data Formats Why Do We Need These Open Data Formats? Why Do We Need to Store Tabular Data? Why Is It a Problem to Store Tabular Data in a Cloud Data Lake Storage? Delta Lake Why Was Delta Lake Founded? Eliminate data silos across business analysts, data scientists, and data engineers Provide a unified data and computational system for batch and real-time streaming data Support bulk updates or changes to existing data Handle errors due to schema changes and incorrect data How Does Delta Lake Work? When Do You Use Delta Lake? Apache Iceberg Why Was Apache Iceberg Founded? How Does Apache Iceberg Work? When Do You Use Apache Iceberg? Apache Hudi Why Was Apache Hudi Founded? How Does Apache Hudi Work? Copy-on-write tables Merge-on-read tables When Do You Use Apache Hudi? Summary 7. Decision Framework for Your Architecture Cloud Data Lake Assessment Cloud Data Lake Assessment Questionnaire Analysis for Your Cloud Data Lake Assessment Starting from Scratch Migrating an Existing Data Lake or Data Warehouse to the Cloud Improving an Existing Cloud Data Lake Phase 1 of Decision Framework: Assess Understand Customer Requirements Understand Opportunities for Improvement Know Your Business Drivers Complete the Assess Phase by Prioritizing the Requirements Phase 2 of Decision Framework: Define Finalize the Design Choices for the Cloud Data Lake Picking your architecture Picking your cloud provider Decision points for data lake migrations Plan Your Cloud Data Lake Project Deliverables Phase 3 of Decision Framework: Implement Phase 4 of Decision Framework: Operationalize Summary 8. Six Lessons for a Data Informed Future Lesson 1: Focus on the How and When, Not the If and Why, When It Comes to Cloud Data Lakes Lesson 2: With Great Power Comes Great Responsibility—Data Is No Exception Lesson 3: Customers Lead Technology, Not the Other Way Around Lesson 4: Change Is Inevitable, so Be Prepared Lesson 5: Build Empathy and Prioritize Ruthlessly Lesson 6: Big Impact Does Not Happen Overnight Summary A. Cloud Data Lake Decision Framework Template Phase 1: Assess Framework Phase 2: Define Framework Planning the Cloud Data Lake Deliverables Phase 3: Implement Framework Index
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.