Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark
- Length: 500 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2022-01-18
- ISBN-10: 1492082384
- ISBN-13: 9781492082385
- Sales Rank: #4235868 (See Top 100 Books)
Apache Spark’s speed, ease of use, sophisticated analytics, and multilanguage support makes practical knowledge of this cluster-computing framework a required skill for data engineers and data scientists. With this hands-on guide, anyone looking for an introduction to Spark will learn practical algorithms and examples using PySpark.
In each chapter, author Mahmoud Parsian shows you how to solve a data problem with a set of Spark transformations and algorithms. You’ll learn how to tackle problems involving ETL, design patterns, machine learning algorithms, data partitioning, and genomics analysis. Each detailed recipe includes PySpark algorithms using the PySpark driver and shell script.
With this book, you will:
- Learn how to select Spark transformations for optimized solutions
- Explore powerful transformations and reductions including reduceByKey(), combineByKey(), and mapPartitions()
- Understand data partitioning for optimized queries
- Design machine learning algorithms including Naive Bayes, linear regression, and logistic regression
- Build and apply a model using PySpark design patterns
- Apply motif-finding algorithms to graph data
- Analyze graph data by using the GraphFrames API
- Apply PySpark algorithms to clinical and genomics data (such as DNA-Seq)
Preface Why you wrote this book Who this book is for How this book is organized Chapters 1-4: Chapters 5-8: Chapters 9-12: Bonus chapters Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments 1. Introduction to Data Algorithms Why Spark for Data Analytics Spark’s ecosystem Spark Architecture The Power of PySpark PySpark Architecture Spark Data Abstractions RDD Example Spark’s Operations Transformations Actions Don’t collect() on Large RDDs DataFrame Example Using PySpark Shell Programming Step-1: Enter into PySpark Shell Step-2: Create RDD from Collection Step-3: Aggregate and Merge Values of Keys Step-4: Filter RDD’s Elements Step-5: Group Similar Keys Step-6: Aggregate Values for Similar Keys ETL Example Extraction Transformation Loading Summary 2. Transformations in Action DNA Base Count Example DNA Base Count Problem FASTA Format FASTA Example DNA Base Count Solution 1 Step 1: Create an RDD of String from Input Step 2: Define a Mapper Function Step 3: Find Frequencies of DNA Letters Pros and Cons of Solution 1 DNA Base Count Solution 2 Step-1: Create an RDD[String] from Input Step 2: Define a Mapper Function Step 3: Find Frequencies of DNA Letters Pros and Cons of Solution 2 DNA Base Count Solution 3 Understanding mapPartitions() Transformation Step 1: Create an RDD of String from Input Step 2: Define a Function to Handle a Partition Step-3: Apply Custom Function to Each Partition Pros and Cons of Solution 3 Introduction to Handling Empty Partitions Summary 3. Mapper Transformations Data Abstractions and Mappers What are Transformations? Creating New RDDs Lazy Transformations The map() Transformation Custom Map Functions The flatMap() Transformation map() vs. flatMap() The mapValues() Transformation The flatMapValues() Transformation The mapPartitions() Transformation Handling Empty Partitions Benefits of mapPartitions() Transformation Drawbacks of mapPartitions() Transformation Summary 4. Reductions in Spark Creating Pair RDDs Reducer transformations Spark’s Reductions Simple Warmup Example Solution by reduceByKey() Solution by groupByKey() Solution by aggregateByKey() Solution by combineByKey() What is a Monoid? Monoid Examples Non-Monoid Examples The Movie Problem Input Data Set to Analyze How does aggregateByKey() work? First Solution Using aggregateByKey() Second Solution using aggregateByKey() Complete PySpark Solution by groupByKey() Complete PySpark Solution using reduceByKey() Complete PySpark Solution using combineByKey() Shuffle Step in Reductions Shuffle Step for groupByKey() Shuffle Step for reduceByKey() Summary 5. Partitioning Data Introduction to Partitions Partitions in Spark Managing Partitions Default Partitioning Explicit Partitioning Physical Partitioning for SQL Queries Pysical Partitioning Example Physical Partitioning Data in Spark Partition as Text Format Partition as Parquet Format How to query partitioned data Amazon Athena Example Summary 6. Graph Algorithms Introduction to Graphs GraphFrames API How to Use GraphFrames GraphFrames Attributes and Structures GraphFrame’s Algorithms Finding Triangles Motif Finding Subgraphs Real World Applications Gene Analysis Problem Social Recommendation Facebook Circles Connected Components Analyzing Flight Data Creating Undirected Graph Summary 7. Interacting with External Data Sources Relational Databases Read from JDBC Write DataFrame to JDBC Reading Text Files Read/Write CSV Files Reading CSV Files Writing CSV Files Read/Write JSON Files Reading JSON Files Writing JSON Files Amazon S3 Reading from Amazon S3 Writing to Amazon S3 Read/Write Hadoop Files Read Hadoop Text Files Write Hadoop Text Files Read/Write HDFS SequenceFiles Parquet Files Write Parquet Files Read Parquet Files Handling Avro Files Read Avro Files Write Avro Files Image Data Sources Image Directory Creating a DataFrame from Images Read/Write MS SQL Server Write MS SQL Server Read MS SQL Server Summary 8. Ranking Algorithms Rank Product Application of Rank Product Calculation of the Rank Product Formalizing Rank Product PySpark Solution Input Data Format Output Data Format Rank Prodcut Solution by combineByKey() Rank Product Solution by groupByKey() Page Rank Algorithm PageRank’s Iterative Computation Custom PageRank in PySpark — 1 Custom PageRank in PySpark — 2 PageRank by GraphFrame Summary 9. Fundamental Data Design Patterns Input-Map-Output Input-Filter-Output RDD Solution DataFrame Solution Input-Map-Reduce-Output Input-Multiple-Maps-Reduce-Output Solution by RDD Solution by DataFrame Input-Map-Combiner-Reduce-Output Input-MapPartitions-Reduce-Output Inverted Index Pattern Problem Statement Input Output PySpark Solution Summary 10. Common Data Design Patterns InMapper Combining Basic MapReduce Design Pattern InMapper Combining Per Record InMapper Combiner Per Partition Top-10 Top-N Formalized PySpark Solution Implementation in PySpark How to Find Bottom-10 MinMax Solution-1: Classic MapReduce Solution-2: Sorting Solution-3: Spark’s mapPartitions() PySpark Solution The Composite Pattern and Monoids Monoids Monoidic and Non-Monoidic Examples Non-Monoid Example Monoid Example PySpark Implementation of Monodized Mean Functors and Monoids Conclusion on Using Monoids Binning: Data Organization Pattern Sorting: Data Organization Pattern Summary 11. Join Design Patterns Introduction to the Join Operation Join in MapReduce Map Phase Reducer Phase Implementation in PySpark Map-Side Join by RDDs Map-Side Join by DataFrames Step-1: Create Cache for Airports Step-2: Create Cache for Airlines Step-3: Create Facts Table Step-4: Apply Map-Side Join Efficient Joins using Bloom filters Bloom filter A Simple Bloom Filter Example Bloom Filter in Python Using Bloom Filter in PySpark Summary 12. Feature Engineering in PySpark Introduction to Feature Engineering Adding a new feature Applying UDF Pipeline Binarizer Imputer Tokenization Tokenizer RegexTokenizer Tokenization with Pipeline Standardization Normalization Scaling a column using Pipeline MinMaxScaler on multiple columns Normalization using Normalizer String Indexing Apply StringIndexer to a Single Column Apply StringIndexer to several columns Vector assembler Bucketing QuantileDiscretizer Logarithm Transformation One Hot Encoding TF-IDF FeatureHasher SQL Transformer Summary About the Author
Donate to keep this site alive
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: Data Algorithms with Spark: Recipes and Design Patterns for Scaling Up using PySpark
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.