Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood
- Length: 416 pages
- Edition: 1
- Language: English
- Publisher: Wiley
- Publication Date: 2021-09-08
- ISBN-10: 1119713021
- ISBN-13: 9781119713029
- Sales Rank: #528245 (See Top 100 Books)
PEEK “UNDER THE HOOD” OF BIG DATA ANALYTICS
The world of big data analytics grows ever more complex. And while many people can work superficially with specific frameworks, far fewer understand the fundamental principles of large-scale, distributed data processing systems and how they operate. In Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood, renowned big-data experts and computer scientists Drs. Supun Kamburugamuve and Saliya Ekanayake deliver a practical guide to applying the principles of big data to software development for optimal performance.
The authors discuss foundational components of large-scale data systems and walk readers through the major software design decisions that define performance, application type, and usability. You???ll learn how to recognize problems in your applications resulting in performance and distributed operation issues, diagnose them, and effectively eliminate them by relying on the bedrock big data principles explained within.
Moving beyond individual frameworks and APIs for data processing, this book unlocks the theoretical ideas that operate under the hood of every big data processing system.
Ideal for data scientists, data architects, dev-ops engineers, and developers, Foundations of Data Intensive Applications: Large Scale Data Analytics under the Hood shows readers how to:
- Identify the foundations of large-scale, distributed data processing systems
- Make major software design decisions that optimize performance
- Diagnose performance problems and distributed operation issues
- Understand state-of-the-art research in big data
- Explain and use the major big data frameworks and understand what underpins them
- Use big data analytics in the real world to solve practical problems
Cover Title Page Copyright Page About the Authors About the Editor Acknowledgments Contents at a Glance Contents Introduction History of Data-Intensive Applications Data Processing Architecture Foundations of Data-Intensive Applications Who Should Read This Book? Organization of the Book Scope of the Book References References Chapter 1 Data Intensive Applications Anatomy of a Data-Intensive Application A Histogram Example Program Process Management Communication Execution Data Structures Putting It Together Application Resource Management Messaging Data Structures Tasks and Execution Fault Tolerance Remote Execution Parallel Applications Serial Applications Lloyd’s K-MeansAlgorithm Parallelizing Algorithms Decomposition Task Assignment Orchestration Mapping K-MeansAlgorithm Parallel and Distributed Computing Memory Abstractions Shared Memory Distributed Memory Hybrid (Shared + Distributed) Memory Partitioned Global Address Space Memory Application Classes and Frameworks Parallel Interaction Patterns Pleasingly Parallel Dataflow Iterative Irregular Data Abstractions Data-Intensive Frameworks Components Workflows An Example What Makes It Difficult? Developing Applications Concurrency Data Partitioning Debugging Diverse Environments Computer Networks Synchronization Thread Synchronization Data Synchronization Ordering of Events Faults Consensus Summary References Chapter 2 Data and Storage Storage Systems Storage for Distributed Systems Direct-AttachedStorage Storage Area Network Network-AttachedStorage DAS or SAN or NAS? Storage Abstractions Block Storage File Systems Object Storage Data Formats XML JSON CSV Apache Parquet Apache Avro Avro Data Definitions (Schema) Code Generation Without Code Generation Avro File Schema Evolution Protocol Buffers, Flat Buffers, and Thrift Data Replication Synchronous and Asynchronous Replication Single-Leader and Multileader Replication Data Locality Disadvantages of Replication Data Partitioning Vertical Partitioning Horizontal Partitioning (Sharding) Hybrid Partitioning Considerations for Partitioning NoSQL Databases Data Models Key-ValueDatabases Document Databases Wide Column Databases Graph Databases CAP Theorem Message Queuing Message Processing Guarantees Durability of Messages Acknowledgments Storage First Brokers and Transient Brokers Summary References Chapter 3 Computing Resources A Demonstration Computer Clusters Anatomy of a Computer Cluster Data Analytics in Clusters Dedicated Clusters Classic Parallel Systems Big Data Systems Shared Clusters OpenMPI on a Slurm Cluster Spark on a Yarn Cluster Distributed Application Life Cycle Life Cycle Steps Step 1: Preparation of the Job Package Step 2: Resource Acquisition Step 3: Distributing the Application (Job) Artifacts Step 4: Bootstrapping the Distributed Environment Step 5: Monitoring Step 6: Termination Computing Resources Data Centers Physical Machines Network Virtual Machines Containers Processor, Random Access Memory, and Cache Cache Multiple Processors in a Computer Nonuniform Memory Access Uniform Memory Access Hard Disk GPUs Mapping Resources to Applications Cluster Resource Managers Kubernetes Kubernetes Architecture Kubernetes Application Concepts Data-IntensiveApplications on Kubernetes Slurm Yarn Job Scheduling Scheduling Policy Objective Functions Throughput and Latency Priorities Lowering Distance Among the Processes Data Locality Completion Deadline Algorithms First in First Out Gang Scheduling List Scheduling Backfill Scheduling Summary References Chapter 4 Data Structures Virtual Memory Paging and TLB Cache The Need for Data Structures Cache and Memory Layout Memory Fragmentation Data Transfer Data Transfer Between Frameworks Cross-LanguageData Transfer Object and Text Data Serialization Vectors and Matrices 1D Vectors Matrices Row-Majorand Column-Major Formats N-Dimensional Arrays/Tensors NumPy Sparse Matrices Table Table Formats Column Data Format Row Data Format Apache Arrow Arrow Data Format Primitive Types Variable-Length Data Arrow Serialization Arrow Example Pandas DataFrame Column vs. Row Tables Summary References Chapter 5 Programming Models Introduction Parallel Programming Models Parallel Process Interaction Problem Decomposition Data Structures Data Structures and Operations Data Types Local Operations Distributed Operations Array Tensor Indexing Slicing Broadcasting Table Graph Data Message Passing Model Model Message Passing Frameworks Message Passing Interface Bulk Synchronous Parallel K-Means Distributed Data Model Eager Model Dataflow Model Data Frames, Datasets, and Tables Input and Output Task Graphs (Dataflow Graphs) Model User Program to Task Graph Tasks and Functions Source Task Compute Task Implicit vs. Explicit Parallel Models Remote Execution Components Batch Dataflow Data Abstractions Table Abstraction Matrix/Tensors Functions Source Compute Sink An Example Caching State Evaluation Strategy Lazy Evaluation Eager Evaluation Iterative Computations DOALL Parallel DOACROSS Parallel Pipeline Parallel Task Graph Models for Iterative Computations K-MeansAlgorithm Streaming Dataflow Data Abstractions Streams Distributed Operations Streaming Functions Sources Compute Sink An Example Windowing Windowing Strategies Operations on Windows Handling Late Events SQL Queries Summary References Chapter 6 Messaging Network Services TCP/IP RDMA Messaging for Data Analytics Anatomy of a Message Data Packing Protocol Message Types Control Messages External Data Sources Data Transfer Messages Distributed Operations How Are They Used? Task Graph Parallel Processes Anatomy of a Distributed Operation Data Abstractions Distributed Operation API Streaming and Batch Operations Streaming Operations Batch Operations Distributed Operations on Arrays Broadcast Reduce and AllReduce Gather and AllGather Scatter AllToAll Optimized Operations Broadcast Reduce AllReduce Gather and AllGather Collective Algorithms Scatter and AllToAll Collective Algorithms Distributed Operations on Tables Shuffle Partitioning Data Handling Large Data Fetch-Based Algorithm (Asynchronous Algorithm) Distributed Synchronization Algorithm GroupBy Aggregate Join Join Algorithms Distributed Joins Performance of Joins More Operations Advanced Topics Data Packing Memory Considerations Message Coalescing Compression Stragglers Nonblocking vs. Blocking Operations Blocking Operations Nonblocking Operations Summary References Chapter 7 Parallel Tasks CPUs Cache False Sharing Vectorization Threads and Processes Concurrency and Parallelism Context Switches and Scheduling Mutual Exclusion User-Level Threads Process Affinity NUMA-Aware Programming Accelerators Task Execution Scheduling Static Scheduling Dynamic Scheduling Loosely Synchronous and Asynchronous Execution Loosely Synchronous Parallel System Asynchronous Parallel System (Fully Distributed) Actor Model Asynchronous Messages Actor Frameworks Execution Models Process Model Thread Model Remote Execution Tasks for Data Analytics SPMD and MPMD Execution Batch Tasks Data Partitions Operations Task Graph Scheduling Threads, CPU Cores, and Partitions Data Locality Execution Streaming Execution State Immutable Data State in Driver Distributed State Streaming Tasks Streams and Data Partitioning Partitions Operations Scheduling Uniform Resources Resource-Aware Scheduling Execution Dynamic Scaling Back Pressure (Flow Control) Rate-Based Flow Control Credit-Based Flow Control State Summary References Chapter 8 Case Studies Apache Hadoop Programming Model Architecture Cluster Resource Management Apache Spark Programming Model RDD API SQL, DataFrames, and DataSets Architecture Resource Managers Task Schedulers Executors Communication Operations Apache Spark Streaming Apache Storm Programming Model Architecture Cluster Resource Managers Communication Operations Kafka Streams Programming Model Architecture PyTorch Programming Model Execution Cylon Programming Model Architecture Execution Communication Operations Rapids cuDF Programming Model Architecture Summary References Chapter 9 Fault Tolerance Dependable Systems and Failures Fault Tolerance Is Not Free Dependable Systems Failures Process Failures Network Failures Node Failures Byzantine Faults Failure Models Failure Detection Recovering from Faults Recovery Methods Stateless Programs Batch Systems Streaming Systems Processing Guarantees Role of Cluster Resource Managers Checkpointing State Consistent Global State Uncoordinated Checkpointing Coordinated Checkpointing Chandy-Lamport Algorithm Batch Systems When to Checkpoint? Snapshot Data Streaming Systems Case Study: Apache Storm Message Tracking Failure Recovery Case Study: Apache Flink Checkpointing Failure Recovery Batch Systems Iterative Programs Case Study: Apache Spark RDD Recomputing Checkpointing Recovery from Failures Summary References Chapter 10 Performance and Productivity Performance Metrics System Performance Metrics Parallel Performance Metrics Speedup Strong Scaling Weak Scaling Parallel Efficiency Amdahl’s Law Gustafson’s Law Throughput Latency Benchmarks LINPACK Benchmark NAS Parallel Benchmark BigDataBench TPC Benchmarks HiBench Performance Factors Memory Execution Distributed Operators Disk I/O Garbage Collection Finding Issues Serial Programs Profiling Scaling Strong Scaling Weak Scaling Debugging Distributed Applications Programming Languages C/C++ Java Memory Management Data Structures Interfacing with Python Python C/C++ Code integration Productivity Choice of Frameworks Operating Environment CPUs and GPUs Public Clouds Future of Data-Intensive Applications Summary References Index EULA
Donate to keep this site alive
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.