APACHE SPARK: INVENT THE FUTURE

Length: 482 pages
Edition: 1
Language: English
Publisher: Independently published
Publication Date: 2021-06-23
ISBN-10: B097SNB8T6
ISBN-13: 9798525708488
Sales Rank: #3583299 (See Top 100 Books)

http://Ernesto.Net is the leader in high tech training courseware in the fields of Data Science, Full Stack Programming, Big Data, and Blockchain. This book is the primary courseware artifact used in the Apache Spark Master Class. In addition to the book, the following materials are also available:

Complete Containerized Lab Environment (Zero Student Setup and hands on abs all done in the browser!) powered by: http://FeNAgO.com
Student Workbook
Instructor Workbook
Customized Video Training (Optional)
Customizable Content (Optional)

CHAPTER 1 INTRODUCTION TO APACHE SPARK
    Theory
An Overview of Big Data
    Quick Introduction to Hadoop
        Why Hadoop?
    Quick Introduction to Hadoop Distributed File System
        Block Placement in HDFS
        HDFS Architecture
    Introduction to MapReduce
        Architecture of MapReduce
LAB EXERCISE
SUMMARY
REFERENCES
CHAPTER 2: PROGRAMMING WITH SCALA
    Theory
        What is Scala?
        Why Scala?
        Data Types in Scala
        Functions in Scala
        Collections in Scala
        Coding Scala
        Conclusion
AIM
LAB EXERCISE 1: PROGRAMMING WITH SCALA
    Task 1: Download and Install JDK
    Task 2: Download and Install Scala
    Task 3: Scala Basics
    Task 4: Loops
    Task 5: Functions
    Task 6: Collections
LAB CHALLENGE
SUMMARY
REFERENCES
CHAPTER 3: HANDS ON SPARK
    Theory
        Introduction to RDD
        Architecture of Spark
AIM
LAB EXERCISE 2: HANDS ON SPARK
    Task 1: Download and Install Spark
    Task 2: Installing Spark on Multi-Node Cluster
    Task 3: Creating RDDs from Spark-Shell
    Task 4: Basic RDD operations
    Task 5: Download and Install IntelliJ IDEA
    Task 6: Configuring Intellij IDEA
SUMMARY
REFERENCES
CHAPTER 4: INTERNALS OF SPARK
    Theory
        Characteristics of RDD
        RDD Operations
            RDD Transformations
            RDD Actions
        Lineage Graph
        Directed Acyclic Graph
AIM
LAB EXERCISE 3: SPARK PROGRAM
    Task 1: Creating a new package in IntelliJ IDEA
    Task 2: Spark Program – Loading Data
    Task 3: Spark Program – Performing Operations
    Task 4: Spark Program – Saving Data
    Task 5: Spark Program – Lineage Graph
    Task 6: Spark Web Interface
SUMMARY
REFERENCES
CHAPTER 5: RDD KEY-VALUE PAIRS & CACHING
    Theory
        Paired RDD
            Paired RDD Transformations
            Two Paired RDD Transformations
            Paired RDD Actions
        RDD Caching and Persistence
            Persistence Storage Levels
AIM
LAB EXERCISE 4: PAIRED RDD – HANDS ON
    Task 1: Creating a Tuple
    Task 2: Creating a Paired RDD
    Task 3: Performing Operations on Paired RDD
    Task 4: Performing more Operations on Paired RDD
    Task 5: Performing Joins on Paired RDDs
    Task 6: Performing Actions on Paired RDDs
LAB CHALLENGE
SUMMARY
REFERENCES
CHAPTER 6: SHARED VARIABLES
    Theory
        What are Shared Variables?
        Why Shared Variables?
        Broadcast Variables
            Optimizing Broadcast Variables
        Accumulators
            Points to remember when Accumulators are used
        Scala Monadic Collections
            Either Monadic Collection
            Option Monadic Collection
            Try Monadic Collection
AIM
LAB EXERCISE 5: SHARED VARIABLES – HANDS ON
    Task 1: Using Accumulator method
    Task 2: Implementing Record Parser
    Task 3: Implementing Counters
    Task 4: Implementing Accumulators V2
    Task 5: Implementing Custom Accumulators V2
    Task 6: Using Broadcast Variables
SUMMARY
REFERENCES
CHAPTER 7: SPARK SQL
    Theory
        Types of Data
        What is Spark SQL?
        Why Spark SQL?
        Spark SQL Architecture
AIM
LAB EXERCISE 6: SPARK SQL – HANDS ON
    Task 1: Creating Data Frame using Data Source API
    Task 2: Creating DataFrame from an RDD
    Task 3: Creating Data Frame using StructType
    Task 4: Querying data using Spark SQL
    Task 5: Joins using Spark SQL
    Task 6: Operations using DataFrame API
SUMMARY
REFERENCES
CHAPTER 8: DATASETS
    Theory
        RDD vs. DataFrame
        What are Datasets?
        Why Datasets?
AIM
LAB EXERCISE 7: DATASETS & FUNCTIONS
    Task 1: Creating Dataset using Data Source API
    Task 2: Creating Dataset from an RDD
    Task 3: Aggregate and Collection Functions
        Aggregate Functions
        Collection Functions
    Task 4: Date/Time Functions
    Task 5: Math and String Functions
        Math Functions
        String Functions
    Task 6: Window Functions
SUMMARY
REFERENCES
CHAPTER 9: USER-DEFINED FUNCTIONS
    Theory
        Why User-Defined Functions?
        Steps to implement User-Defined Function
        UDAF Types
        Function currying in Scala
        Partially applied functions in Scala
AIM
LAB EXERCISE 8: USER DEFINED FUNCTIONS
    Task 1: Defining Currying Functions
    Task 2: Using partially applied functions
    Task 3: Writing User Defined Function
    Task 4: Writing Untyped UDAF
    Task 5: Using Untyped UDAF
    Task 6: Typed UDAF
SUMMARY
REFERENCES
CHAPTER 10: FILE FORMATS
    Theory
        DataSource API
            Reading Data
                Read Modes
            Writing Data
                Save Modes
            Text Files
            CSV Files
            JSON
            Parquet Files
            ORC Files
        RDD API
            Text Files
            Sequence Files
            Hadoop Files
AIM
LAB EXERCISE 9: USING FILE FORMATS
    Task 1: Text Files
        RDD API
        DataSource API
    Task 2: CSV Files
    Task 3: JSON Files
    Task 4: Parquet Files
    Task 5: ORC Files
    Task 6: Hadoop and Sequence Files
        Sequence Files
        Hadoop Files
SUMMARY
REFERENCES
CHAPTER 11: SPARK CONFIGURATIONS & OPTIMIZATIONS
    Theory
        Spark Configurations
            Spark Configuration Properties
            Environment Variables
            Logging
        Performance Optimization
            Using Datasets extensively
            Avoiding UDF and UDAF
            Data Serialization
            Spark Memory Tuning
            Level of Parallelism
            Levels of Data Locality
            Use Broadcast Variables
            Filter Data as soon as possible
            Logs
            More Power
AIM
LAB EXERCISE 10: SPARK CONFIGURATIONS & OPTIMIZATIONS
    Task 1: Spark Configuration File
    Task 2: Using spark-submit Tool
    Task 3: Environment Variables File
    Task 4: Logging Properties File
    Task 5: Checking Log Files
SUMMARY
REFERENCES