Data Analysis with Python and PySpark

Length: 425 pages
Edition: 1
Language: English
Publisher: Manning
Publication Date: 2022-03-15
ISBN-10: 1617297208
ISBN-13: 9781617297205
Sales Rank: #4698696 (See Top 100 Books)

Data Analysis with Python and PySpark is a carefully engineered tutorial that helps you use PySpark to deliver your data-driven applications at any scale.

When it comes to data analytics, it pays to think big. PySpark blends the powerful Spark big data processing engine with the Python programming language to provide a data analysis platform that can scale up for nearly any task. Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects.

Data Analysis with Python and PySpark is a carefully engineered tutorial that helps you use PySpark to deliver your data-driven applications at any scale. This clear and hands-on guide shows you how to enlarge your processing capabilities across multiple machines with data from any source, ranging from Hadoop-based clusters to Excel worksheets. You’ll learn how to break down big analysis tasks into manageable chunks and how to choose and use the best PySpark data abstraction for your unique needs.

Data Analysis with Python and PySpark
contents
preface
acknowledgments
about this book
	Who should read this book
	How this book is organized: A road map
	About the code
	liveBook discussion forum
about the author
about the cover illustration
1 Introduction
	1.1 What is PySpark?
		1.1.1 Taking it from the start: What is Spark?
		1.1.2 PySpark = Spark + Python
		1.1.3 Why PySpark?
	1.2 Your very own factory: How PySpark works
		1.2.1 Some physical planning with the cluster manager
		1.2.2 A factory made efficient through a lazy leader
	1.3 What will you learn in this book?
	1.4 What do I need to get started?
	Summary
Part 1—Get acquainted: First steps in PySpark
	2 Your first data program in PySpark
		2.1 Setting up the PySpark shell
			2.1.1 The SparkSession entry point
			2.1.2 Configuring how chatty spark is: The log level
		2.2 Mapping our program
		2.3 Ingest and explore: Setting the stage for data transformation
			2.3.1 Reading data into a data frame with spark.read
			2.3.2 From structure to content: Exploring our data frame with show()
		2.4 Simple column transformations: Moving from a sentence to a list of words
			2.4.1 Selecting specific columns using select()
			2.4.2 Transforming columns: Splitting a string into a list of words
			2.4.3 Renaming columns: alias and withColumnRenamed
			2.4.4 Reshaping your data: Exploding a list into rows
			2.4.5 Working with words: Changing case and removing punctuation
		2.5 Filtering rows
		Summary
		Additional exercises
			Exercise 2.2
			Exercise 2.3
			Exercise 2.4
			Exercise 2.5
			Exercise 2.6
			Exercise 2.7
	3 Submitting and scaling your first PySpark program
		3.1 Grouping records: Counting word frequencies
		3.2 Ordering the results on the screen using orderBy
		3.3 Writing data from a data frame
		3.4 Putting it all together: Counting
			3.4.1 Simplifying your dependencies with PySpark’s import conventions
			3.4.2 Simplifying our program via method chaining
		3.5 Using spark-submit to launch your program in batch mode
		3.6 What didn’t happen in this chapter
		3.7 Scaling up our word frequency program
		Summary
		Additional Exercises
			Exercise 3.3
			Exercise 3.4
			Exercise 3.5
			Exercise 3.6
	4 Analyzing tabular data with pyspark.sql
		4.1 What is tabular data?
			4.1.1 How does PySpark represent tabular data?
		4.2 PySpark for analyzing and processing tabular data
		4.3 Reading and assessing delimited data in PySpark
			4.3.1 A first pass at the SparkReader specialized for CSV files
			4.3.2 Customizing the SparkReader object to read CSV data files
			4.3.3 Exploring the shape of our data universe
		4.4 The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing
			4.4.1 Knowing what we want: Selecting columns
			4.4.2 Keeping what we need: Deleting columns
			4.4.3 Creating what’s not there: New columns with withColumn()
			4.4.4 Tidying our data frame: Renaming and reordering columns
			4.4.5 Diagnosing a data frame with describe() and summary()
		Summary
		Additional exercises
			Exercise 4.3
			Exercise 4.4
	5 Data frame gymnastics: Joining and grouping
		5.1 From many to one: Joining data
			5.1.1 What’s what in the world of joins
			5.1.2 Knowing our left from our right
			5.1.3 The rules to a successful join: The predicates
			5.1.4 How do you do it: The join method
			5.1.5 Naming conventions in the joining world
		5.2 Summarizing the data via groupby and GroupedData
			5.2.1 A simple groupby blueprint
			5.2.2 A column is a column: Using agg() with custom column definitions
		5.3 Taking care of null values: Drop and fill
			5.3.1 Dropping it like it’s hot: Using dropna() to remove records with null values
			5.3.2 Filling values to our heart’s content using fillna()
		5.4 What was our question again? Our end-to-end program
		Summary
		Additional exercises
			Exercise 5.4
			Exercise 5.5
			Exercise 5.6
			Exercise 5.7
Part 2—Get proficient: Translate your ideas into code
	6 Multidimensional data frames: Using PySpark with JSON data
		6.1 Reading JSON data: Getting ready for the schemapocalypse
			6.1.1 Starting small: JSON data as a limited Python dictionary
			6.1.2 Going bigger: Reading JSON data in PySpark
		6.2 Breaking the second dimension with complex data types
			6.2.1 When you have more than one value: The array
			6.2.2 The map type: Keys and values within a column
		6.3 The struct: Nesting columns within columns
			6.3.1 Navigating structs as if they were nested columns
		6.4 Building and using the data frame schema
			6.4.1 Using Spark types as the base blocks of a schema
			6.4.2 Reading a JSON document with a strict schema in place
			6.4.3 Going full circle: Specifying your schemas in JSON
		6.5 Putting it all together: Reducing duplicate data with complex data types
			6.5.1 Getting to the “just right” data frame: Explode and collect
			6.5.2 Building your own hierarchies: Struct as a function
		Summary
		Additional exercises
			Exercise 6.4
			Exercise 6.5
			Exercise 6.6
			Exercise 6.7
			Exercise 6.8
	7 Bilingual PySpark: Blending Python and SQL code
		7.1 Banking on what we know: pyspark.sql vs. plain SQL
		7.2 Preparing a data frame for SQL
			7.2.1 Promoting a data frame to a Spark table
			7.2.2 Using the Spark catalog
		7.3 SQL and PySpark
		7.4 Using SQL-like syntax within data frame methods
			7.4.1 Get the rows and columns you want: select and where
			7.4.2 Grouping similar records together: group by and order by
			7.4.3 Filtering after grouping using having
			7.4.4 Creating new tables/views using the CREATE keyword
			7.4.5 Adding data to our table using UNION and JOIN
			7.4.6 Organizing your SQL code better through subqueries and common table expressions
			7.4.7 A quick summary of PySpark vs. SQL syntax
		7.5 Simplifying our code: Blending SQL and Python
			7.5.1 Using Python to increase the resiliency and simplifying the data reading stage
			7.5.2 Using SQL-style expressions in PySpark
		7.6 Conclusion
		Summary
		Additional exercises
			Exercise 7.2
			Exercise 7.3
			Exercise 7.4
			Exercise 7.5
	8 Extending PySpark with Python: RDD and UDFs
		8.1 PySpark, freestyle: The RDD
			8.1.1 Manipulating data the RDD way: map(), filter(), and reduce()
		8.2 Using Python to extend PySpark via UDFs
			8.2.1 It all starts with plain Python: Using typed Python functions
			8.2.2 From Python function to UDFs using udf()
		Summary
		Additional exercises
			Exercise 8.3
			Exercise 8.4
			Exercise 8.5
			Exercise 8.6
	9 Big data is just a lot of small data: Using pandas UDFs
		9.1 Column transformations with pandas: Using Series UDF
			9.1.1 Connecting Spark to Google’s BigQuery
			9.1.2 Series to Series UDF: Column functions, but with pandas
			9.1.3 Scalar UDF + cold start = Iterator of Series UDF
		9.2 UDFs on grouped data: Aggregate and apply
			9.2.1 Group aggregate UDFs
			9.2.2 Group map UDF
		9.3 What to use, when
		Summary
		Additional exercises
			Exercise 9.2
			Exercise 9.3
			Exercise 9.4
			Exercise 9.5
	10 Your data under a different lens: Window functions
		10.1 Growing and using a simple window function
			10.1.1 Identifying the coldest day of each year, the long way
			10.1.2 Creating and using a simple window function to get the coldest days
			10.1.3 Comparing both approaches
		10.2 Beyond summarizing: Using ranking and analytical functions
			10.2.1 Ranking functions: Quick, who’s first?
			10.2.2 Analytic functions: Looking back and ahead
		10.3 Flex those windows! Using row and range boundaries
			10.3.1 Counting, window style: Static, growing, unbounded
			10.3.2 What you are vs. where you are: Range vs. rows
		10.4 Going full circle: Using UDFs within windows
		10.5 Look in the window: The main steps to a successful window function
		Summary
		Additional Exercises
			Exercise 10.4
			Exercise 10.5
			Exercise 10.6
			Exercise 10.7
	11 Faster PySpark: Understanding Spark’s query planning
		11.1 Open sesame: Navigating the Spark UI to understand the environment
			11.1.1 Reviewing the configuration: The environment tab
			11.1.2 Greater than the sum of its parts: The Executors tab and resource management
			11.1.3 Look at what you’ve done: Diagnosing a completed job via the Spark UI
			11.1.4 Mapping the operations via Spark query plans: The SQL tab
			11.1.5 The core of Spark: The parsed, analyzed, optimized, and physical plans
		11.2 Thinking about performance: Operations and memory
			11.2.1 Narrow vs. wide operations
			11.2.2 Caching a data frame: Powerful, but often deadly (for perf)
		Summary
Part 3—Get confident: Using machine learning with PySpark
	12 Setting the stage: Preparing features for machine learning
		12.1 Reading, exploring, and preparing our machine learning data set
			12.1.1 Standardizing column names using toDF()
			12.1.2 Exploring our data and getting our first feature columns
			12.1.3 Addressing data mishaps and building our first feature set
			12.1.4 Weeding out useless records and imputing binary features
			12.1.5 Taking care of extreme values: Cleaning continuous columns
			12.1.6 Weeding out the rare binary occurrence columns
		12.2 Feature creation and refinement
			12.2.1 Creating custom features
			12.2.2 Removing highly correlated features
		12.3 Feature preparation with transformers and estimators
			12.3.1 Imputing continuous features using the Imputer estimator
			12.3.2 Scaling our features using the MinMaxScaler estimator
		Summary
	13 Robust machine learning with ML Pipelines
		13.1 Transformers and estimators: The building blocks of ML in Spark
			13.1.1 Data comes in, data comes out: The Transformer
			13.1.2 Data comes in, transformer comes out: The Estimator
		13.2 Building a (complete) machine learning pipeline
			13.2.1 Assembling the final data set with the vector column type
			13.2.2 Training an ML model using a LogisticRegression classifier
		13.3 Evaluating and optimizing our model
			13.3.1 Assessing model accuracy: Confusion matrix and evaluator object
			13.3.2 True positives vs. false positives: The ROC curve
			13.3.3 Optimizing hyperparameters with cross-validation
		13.4 Getting the biggest drivers from our model: Extracting the coefficients
		Summary
	14 Building custom ML transformers and estimators
		14.1 Creating your own transformer
			14.1.1 Designing a transformer: Thinking in terms of Params and transformation
			14.1.2 Creating the Params of a transformer
			14.1.3 Getters and setters: Being a nice PySpark citizen
			14.1.4 Creating a custom transformer’s initialization function
			14.1.5 Creating our transformation function
			14.1.6 Using our transformer
		14.2 Creating your own estimator
			14.2.1 Designing our estimator: From model to params
			14.2.2 Implementing the companion model: Creating our own Mixin
			14.2.3 Creating the ExtremeValueCapper estimator
			14.2.4 Trying out our custom estimator
		14.3 Using our transformer and estimator in an ML pipeline
			14.3.1 Dealing with multiple inputCols
			14.3.2 In practice: Inserting custom components into an ML pipeline
		Summary
		Conclusion: Have data, am happy!
Appendix A—Solutions to the exercises
	Chapter
		Exercise 2.1
		Exercise 2.2
		Exercise 2.3
		Exercise 2.4
		Exercise 2.5
		Exercise 2.6
		Exercise 2.7
	Chapter
		Exercise 3.1
		Exercise 3.2
		Exercise 3.3
		Exercise 3.4
		Exercise 3.5
		Exercise 3.6
	Chapter
		Exercise 4.1
		Exercise 4.2
		Exercise 4.3
		Exercise 4.4
	Chapter
		Exercise 5.1
		Exercise 5.2
		Exercise 5.3
		Exercise 5.4
		Exercise 5.5
		Exercise 5.6
		Exercise 5.7
	Chapter
		Exercise 6.1
		Exercise 6.2
		Exercise 6.3
		Exercise 6.4
		Exercise 6.5
		Exercise 6.6
		Exercise 6.7
		Exercise 6.8
	Chapter
		Exercise 7.1
		Exercise 7.2
		Exercise 7.3
		Exercise 7.4
		Exercise 7.5
	Chapter
		Exercise 8.1
		Exercise 8.2
		Exercise 8.3
		Exercise 8.4
		Exercise 8.5
		Exercise 8.6
	Chapter
		Exercise 9.1
		Exercise 9.2
		Exercise 9.3
		Exercise 9.4
		Exercise 9.5
	Chapter
		Exercise 10.1
		Exercise 10.2
		Exercise 10.3
		Exercise 10.4
		Exercise 10.5
		Exercise 10.6
		Exercise 10.7
	Chapter
		Exercise 11.1
		Exercise 11.2
		Exercise 11.3
	Chapter
		Exercise 13.1
Appendix B—Installing PySpark
	B.1 Installing PySpark on your local machine
	B.2 Windows
		B.2.1 Install Java
		B.2.2 Install 7-zip
		B.2.3 Download and install Apache Spark
		B.2.4 Configure Spark to work seamlessly with Python
		B.2.5 Install Python
		B.2.6 Launching an IPython REPL and starting PySpark
		B.2.7 (Optional) Install and run Jupyter to use a Jupyter notebook
	B.3 macOS
		B.3.1 Install Homebrew
		B.3.2 Install Java and Spark
		B.3.3 Configure Spark to work seamlessly with Python
		B.3.4 Install Anaconda/Python
		B.3.5 Launching an IPython REPL and starting PySpark
		B.3.6 (Optional) Install and run Jupyter to use Jupyter notebook
	B.4 GNU/Linux and WSL
		B.4.1 Install Java
		B.4.2 Installing Spark
		B.4.3 Configure Spark to work seamlessly with Python
		B.4.4 Install Python 3, IPython, and the PySpark package
		B.4.5 Launch PySpark with IPython
		B.4.6 (Optional) Install and run Jupyter to use Jupyter notebook
	B.5 PySpark in the cloud
	B.6 AWS
	B.7 Azure
	B.8 GCP
	B.9 Databricks
Appendix C—Some useful Python concepts
	C.1 List comprehensions
	C.2 Packing and unpacking arguments (*args and **kwargs)
		C.2.1 Argument unpacking
		C.2.2 Argument packing
		C.2.3 Packing and unpacking keyword arguments
	C.3 Python’s typing and mypy/pyright
	C.4 Python closures and the PySpark transform() method
	C.5 Python decorators: Wrapping a function to change its behavior
index
	Symbols
	Numerics
	A
	B
	C
	D
	E
	F
	G
	H
	I
	J
	K
	L
	M
	N
	O
	P
	Q
	R
	S
	T
	U
	V
	W