Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark

Length: 500 pages
Edition: 1
Language: English
Publisher: O'Reilly Media
Publication Date: 2022-01-18
ISBN-10: 1492082384
ISBN-13: 9781492082385
Sales Rank: #4235868 (See Top 100 Books)

Apache Spark’s speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.

In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You’ll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.

With this book, you will:

Learn how to select Spark transformations for optimized solutions
Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
Understand data partitioning for optimized queries
Design machine learning algorithms including Naive Bayes, linear regression, and logistic regression
Build and apply a model using PySpark design patterns
Apply motif-finding algorithms to graph data
Analyze graph data by using the GraphFrames API
Apply PySpark algorithms to clinical and genomics data (such as DNA-Seq)

Preface
    Why you wrote this book
    Who this book is for
    How this book is organized
        Chapters 1-4:
        Chapters 5-8:
        Chapters 9-12:
        Bonus chapters
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
1. Introduction to Data Algorithms
    Why Spark for Data Analytics
        Spark’s ecosystem
        Spark Architecture
    The Power of PySpark
        PySpark Architecture
    Spark Data Abstractions
        RDD Example
    Spark’s Operations
        Transformations
        Actions
        Don’t collect() on Large RDDs
        DataFrame Example
    Using PySpark Shell Programming
        Step-1: Enter into PySpark Shell
        Step-2: Create RDD from Collection
        Step-3: Aggregate and Merge Values of Keys
        Step-4: Filter RDD’s Elements
        Step-5: Group Similar Keys
        Step-6: Aggregate Values for Similar Keys
    ETL Example
        Extraction
        Transformation
        Loading
    Summary
2. Transformations in Action
    DNA Base Count Example
    DNA Base Count Problem
        FASTA Format
        FASTA Example
    DNA Base Count Solution 1
        Step 1: Create an RDD of String from Input
        Step 2: Define a Mapper Function
        Step 3: Find Frequencies of DNA Letters
        Pros and Cons of Solution 1
    DNA Base Count Solution 2
        Step-1: Create an RDD[String] from Input
        Step 2: Define a Mapper Function
        Step 3: Find Frequencies of DNA Letters
        Pros and Cons of Solution 2
    DNA Base Count Solution 3
        Understanding mapPartitions() Transformation
        Step 1: Create an RDD of String from Input
        Step 2: Define a Function to Handle a Partition
        Step-3: Apply Custom Function to Each Partition
        Pros and Cons of Solution 3
    Introduction to Handling Empty Partitions
    Summary
3. Mapper Transformations
    Data Abstractions and Mappers
    What are Transformations?
    Creating New RDDs
    Lazy Transformations
    The map() Transformation
        Custom Map Functions
    The flatMap() Transformation
        map() vs. flatMap()
    The mapValues() Transformation
    The flatMapValues() Transformation
    The  mapPartitions() Transformation
        Handling Empty Partitions
        Benefits of mapPartitions() Transformation
        Drawbacks of mapPartitions() Transformation
    Summary
4. Reductions in Spark
    Creating Pair RDDs
    Reducer transformations
    Spark’s Reductions
    Simple Warmup Example
        Solution by reduceByKey()
        Solution by groupByKey()
        Solution by aggregateByKey()
        Solution by combineByKey()
    What is a Monoid?
        Monoid Examples
        Non-Monoid Examples
    The Movie Problem
        Input Data Set to Analyze
        How does aggregateByKey() work?
        First Solution Using aggregateByKey()
        Second Solution using aggregateByKey()
        Complete PySpark Solution by groupByKey()
        Complete PySpark Solution using reduceByKey()
        Complete PySpark Solution using combineByKey()
    Shuffle Step in Reductions
        Shuffle Step for groupByKey()
        Shuffle Step for reduceByKey()
    Summary
5. Partitioning Data
    Introduction to Partitions
        Partitions in Spark
    Managing Partitions
    Default Partitioning
        Explicit Partitioning
    Physical Partitioning for SQL Queries
        Pysical Partitioning Example
    Physical Partitioning Data in Spark
        Partition as Text Format
        Partition as Parquet Format
    How to query partitioned data
        Amazon Athena Example
    Summary
6. Graph Algorithms
    Introduction to Graphs
    GraphFrames API
        How to Use GraphFrames
        GraphFrames Attributes and Structures
    GraphFrame’s Algorithms
        Finding Triangles
        Motif Finding
        Subgraphs
    Real World Applications
        Gene Analysis Problem
        Social Recommendation
        Facebook Circles
        Connected Components
        Analyzing Flight Data
    Creating Undirected Graph
    Summary
7. Interacting with External Data Sources
    Relational Databases
        Read from JDBC
        Write DataFrame to JDBC
    Reading Text Files
    Read/Write CSV Files
        Reading CSV Files
        Writing CSV Files
    Read/Write JSON Files
        Reading JSON Files
        Writing JSON Files
    Amazon S3
        Reading from Amazon S3
        Writing to Amazon S3
    Read/Write Hadoop Files
        Read Hadoop Text Files
        Write Hadoop Text Files
        Read/Write HDFS SequenceFiles
    Parquet Files
        Write Parquet Files
        Read Parquet Files
    Handling Avro Files
        Read Avro Files
        Write Avro Files
    Image Data Sources
        Image Directory
        Creating a DataFrame from Images
    Read/Write MS SQL Server
        Write MS SQL Server
        Read MS SQL Server
    Summary
8. Ranking Algorithms
    Rank Product
        Application of Rank Product
        Calculation of the Rank Product
        Formalizing Rank Product
    PySpark Solution
        Input Data Format
        Output Data Format
        Rank Prodcut Solution by combineByKey()
        Rank Product Solution by groupByKey()
    Page Rank Algorithm
        PageRank’s Iterative Computation
        Custom PageRank in PySpark — 1
        Custom PageRank in PySpark — 2
        PageRank by GraphFrame
    Summary
9. Fundamental Data Design Patterns
    Input-Map-Output
    Input-Filter-Output
        RDD Solution
        DataFrame Solution
    Input-Map-Reduce-Output
    Input-Multiple-Maps-Reduce-Output
        Solution by RDD
        Solution by DataFrame
    Input-Map-Combiner-Reduce-Output
    Input-MapPartitions-Reduce-Output
    Inverted Index Pattern
        Problem Statement
        Input
        Output
        PySpark Solution
    Summary
10. Common Data Design Patterns
    InMapper Combining
        Basic MapReduce Design Pattern
        InMapper Combining Per Record
        InMapper Combiner Per Partition
    Top-10
        Top-N Formalized
        PySpark Solution
        Implementation in PySpark
        How to Find Bottom-10
    MinMax
        Solution-1: Classic MapReduce
        Solution-2: Sorting
        Solution-3: Spark’s mapPartitions()
        PySpark Solution
    The Composite Pattern and Monoids
        Monoids
        Monoidic and Non-Monoidic Examples
        Non-Monoid Example
        Monoid Example
        PySpark Implementation of Monodized Mean
        Functors and Monoids
        Conclusion on Using Monoids
    Binning: Data Organization Pattern
    Sorting: Data Organization Pattern
    Summary
11. Join Design Patterns
    Introduction to the Join Operation
    Join in MapReduce
        Map Phase
        Reducer Phase
        Implementation in PySpark
    Map-Side Join by RDDs
    Map-Side Join by DataFrames
        Step-1: Create Cache for Airports
        Step-2: Create Cache for Airlines
        Step-3: Create Facts Table
        Step-4: Apply Map-Side Join
    Efficient Joins using Bloom filters
        Bloom filter
        A Simple Bloom Filter Example
        Bloom Filter in Python
        Using Bloom Filter in PySpark
    Summary
12. Feature Engineering in PySpark
    Introduction to Feature Engineering
    Adding a new feature
    Applying UDF
    Pipeline
    Binarizer
    Imputer
    Tokenization
        Tokenizer
        RegexTokenizer
        Tokenization with Pipeline
    Standardization
    Normalization
        Scaling a column using Pipeline
        MinMaxScaler on multiple columns
        Normalization using Normalizer
    String Indexing
        Apply StringIndexer to a Single Column
        Apply StringIndexer to several columns
    Vector assembler
    Bucketing
        QuantileDiscretizer
    Logarithm Transformation
    One Hot Encoding
    TF-IDF
    FeatureHasher
    SQL Transformer
    Summary
About the Author

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://www.oreilly.com/

2. Search the book title: Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark, sometime you may not get the results, please search the main title

3. Click the book title in the search results

3. Publisher resources section, click Download Example Code.

1. Disable the AdBlock plugin. Otherwise, you may not get any links.

2. Solve the CAPTCHA.

3. Click download link.

4. Lead to download server to download.