The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture

Length: 250 pages
Edition: 1
Language: English
Publisher: O'Reilly Media
Publication Date: 2023-01-31
ISBN-10: 1098116585
ISBN-13: 9781098116583
Sales Rank: #4306015 (See Top 100 Books)

More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.

This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, product management leader at Microsoft, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.

Learn the benefits of a cloud-based big data strategy for your organization
Get guidance and best practices for designing performant and scalable data lakes
Examine architecture and design choices, and data governance principles and strategies
Build a data strategy that scales as your organizational and business needs increase
Implement a scalable data lake in the cloud
Use cloud-based advanced analytics to gain more value from your data

Preface
    Why I Wrote This Book
    Who Should Read This Book?
        Introducing Klodars Corporation
    Navigating the Book
    Conventions Used in This Book
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
1. Big Data—Beyond the Buzz
    What Is Big Data?
    Elastic Data Infrastructure—The Challenge
    Cloud Computing Fundamentals
        Cloud Computing Terminology
        Value Proposition of the Cloud
    Cloud Data Lake Architecture
        Limitations of On-Premises Data Warehouse Solutions
        What Is a Cloud Data Lake Architecture?
        Benefits of a Cloud Data Lake Architecture
    Defining Your Cloud Data Lake Journey
    Summary
2. Big Data Architectures on the Cloud
    Why Klodars Corporation Moves to the Cloud
    Fundamentals of Cloud Data Lake Architectures
        A Word on Variety of Data
        Cloud Data Lake Storage
        Big Data Analytics Engines
            MapReduce
            Apache Hadoop
            Apache Spark
            Real-time stream processing pipelines
        Cloud Data Warehouses
    Modern Data Warehouse Architecture
        Reference Architecture
        Sample Use Case for a Modern Data Warehouse Architecture
        Benefits and Challenges of Modern Data Warehouse Architecture
    Data Lakehouse Architecture
        Reference Architecture for the Data Lakehouse
            Data formats
            Metadata
            Compute engines
        Sample Use Case for Data Lakehouse Architecture
        Benefits and Challenges of the Data Lakehouse Architecture
        Data Warehouses and Unstructured Data
    Data Mesh
        Reference Architecture
        Sample Use Case for a Data Mesh Architecture
        Challenges and Benefits of a Data Mesh Architecture
    What Is the Right Architecture for Me?
        Know Your Customers
        Know Your Business Drivers
        Consider Your Growth and Future Scenarios
        Design Considerations
        Hybrid Approaches
    Summary
3. Design Considerations for Your Data Lake
    Setting Up the Cloud Data Lake Infrastructure
        Identify Your Goals
            How Klodars Corporation defined the data lake goals
        Plan Your Architecture and Deliverables
            How Klodars Corporation planned their architecture and deliverables
        Implement the Cloud Data Lake
        Release and Operationalize
    Organizing Data in Your Data Lake
        A Day in the Life of Data
        Data Lake Zones
        Organization Mechanisms
    Introduction to Data Governance
        Actors Involved in Data Governance
        Data Classification
        Metadata Management, Data Catalog, and Data Sharing
        Data Access Management
        Data Quality and Observability
        Data Governance at Klodars Corporation
        Data Governance Wrap-Up
    Manage Data Lake Costs
        Demystifying Data Lake Costs on the Cloud
        Data Lake Cost Strategy
            Data Lake Environments and Associated Costs
            Cost strategy based on data
            Transactions and impact on costs
    Summary
4. Scalable Data Lakes
    A Sneak Peek into Scalability
        What Is Scalability?
        Scale in Our Day-to-Day Life
        Scalability in Data Lake Architectures
    Internals of Data Lake Processing Systems
        Data Copy Internals
            Components of a data copy solution
            Understanding resource utilization of a data copy job
        ELT/ETL Processing Internals
            Components of an Apache Spark application
            Understanding resource utilization of a Spark job
        A Note on Other Interactive Queries
    Considerations for Scalable Data Lake Solutions
        Pick the Right Cloud Offerings
            Hybrid and multicloud solutions
            IaaS versus PaaS versus SaaS solutions
            Cloud offerings for Klodars Corporation
        Plan for Peak Capacity
        Data Formats and Job Profile
    Summary
5. Optimizing Cloud Data Lake  Architectures for Performance
    Basics of Measuring Performance
        Goals and Metrics for Performance
        Measuring Performance
        Optimizing for Faster Performance
    Cloud Data Lake Performance
        SLAs, SLOs, and SLIs
        Example: How Klodars Corporation Managed Its SLAs, SLOs, and SLIs
    Drivers of Performance
        Performance Drivers for a Copy Job
        Performance Drivers for a Spark Job
    Optimization Principles and Techniques for  Performance Tuning
        Data Formats
            Exploring Apache Parquet
            Other popular data formats
            How Klodars Corporation picked their data formats
        Data Organization and Partitioning
            Optimal data organization strategy for Klodars Corporation
        Choosing the Right Configurations on Apache Spark
    Minimize Overheads with Data Transfer
    Premium Offerings and Performance
        The Case of Bigger Virtual Machines
        The Case of Flash Storage
    Summary
6. Deep Dive on Data Formats
    Why Do We Need These Open Data Formats?
        Why Do We Need to Store Tabular Data?
        Why Is It a Problem to Store Tabular Data in a Cloud Data  Lake Storage?
    Delta Lake
        Why Was Delta Lake Founded?
            Eliminate data silos across business analysts, data scientists, and data engineers
            Provide a unified data and computational system for batch and real-time streaming data
            Support bulk updates or changes to existing data
            Handle errors due to schema changes and incorrect data
        How Does Delta Lake Work?
        When Do You Use Delta Lake?
    Apache Iceberg
        Why Was Apache Iceberg Founded?
        How Does Apache Iceberg Work?
        When Do You Use Apache Iceberg?
    Apache Hudi
        Why Was Apache Hudi Founded?
        How Does Apache Hudi Work?
            Copy-on-write tables
            Merge-on-read tables
        When Do You Use Apache Hudi?
    Summary
7. Decision Framework for Your Architecture
    Cloud Data Lake Assessment
        Cloud Data Lake Assessment Questionnaire
    Analysis for Your Cloud Data Lake Assessment
        Starting from Scratch
        Migrating an Existing Data Lake or Data Warehouse to the Cloud
        Improving an Existing Cloud Data Lake
    Phase 1 of Decision Framework: Assess
        Understand Customer Requirements
        Understand Opportunities for Improvement
        Know Your Business Drivers
        Complete the Assess Phase by Prioritizing the Requirements
    Phase 2 of Decision Framework: Define
        Finalize the Design Choices for the Cloud Data Lake
            Picking your architecture
            Picking your cloud provider
            Decision points for data lake migrations
        Plan Your Cloud Data Lake Project Deliverables
    Phase 3 of Decision Framework: Implement
    Phase 4 of Decision Framework: Operationalize
    Summary
8. Six Lessons for a Data Informed Future
    Lesson 1: Focus on the How and When, Not the If and Why, When It Comes to Cloud Data Lakes
    Lesson 2: With Great Power Comes Great  Responsibility—Data Is No Exception
    Lesson 3: Customers Lead Technology, Not the Other  Way Around
    Lesson 4: Change Is Inevitable, so Be Prepared
    Lesson 5: Build Empathy and Prioritize Ruthlessly
    Lesson 6: Big Impact Does Not Happen Overnight
    Summary
A. Cloud Data Lake Decision  Framework Template
    Phase 1: Assess Framework
    Phase 2: Define Framework
        Planning the Cloud Data Lake Deliverables
    Phase 3: Implement Framework
Index

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://www.oreilly.com/

2. Search the book title: The Cloud Data Lake: A Guide to Building Robust Cloud Data Architecture, sometime you may not get the results, please search the main title

3. Click the book title in the search results

3. Publisher resources section, click Download Example Code.

1. Disable the AdBlock plugin. Otherwise, you may not get any links.

2. Solve the CAPTCHA.

3. Click download link.

4. Lead to download server to download.