Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark
The amount of data being generated today is staggering–and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark’s Python API, and other best practices in Spark programming.
Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques–including classification, clustering, collaborative filtering, and anomaly detection–to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing.
If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis.
- Familiarize yourself with Spark’s programming model and ecosystem
- Learn general approaches in data science
- Examine complete implementations that analyze large public datasets
- Discover which machine learning tools make sense for particular problems
- Explore code that can be adapted to many uses
1. Analyzing Big Data Working with Big Data Introducing Apache Spark and PySpark Components PySpark Ecosystem Spark 3.0 PySpark Addresses Challenges of Data Science Where to go from here 2. Introduction to Data Analysis with PySpark Spark Architecture Installing PySpark Setting up our data Analyzing Data with the DataFrame API Fast Summary Statistics for DataFrames Pivoting and Reshaping DataFrames Joining DataFrames and Selecting Features Scoring And Model Evaluation Where to Go from Here 3. Recommending Music and the Audioscrobbler Data Set Setting up the Data Our requirements from a recommender system Alternating Least Squares algorithm Preparing the Data Building a First Model Spot Checking Recommendations Evaluating Recommendation Quality Computing AUC Hyperparameter Selection Making Recommendations Where to Go from Here 4. Predicting Forest Cover with Decision Trees Decision Trees and Forests Preparing the Data Our First Decision Tree Decision Tree Hyperparameters Tuning Decision Trees Categorical Features Revisited Random Forests Making Predictions Where to Go from Here 5. Anomaly Detection in Network Traffic with K-means Clustering Anomaly Detection K-means Clustering Network Intrusion KDD Cup 1999 Data Set A First Take on Clustering Choosing k Visualization with SparkR Feature Normalization Categorical Variables Using Labels with Entropy Clustering in Action Where to Go from Here 6. Geospatial and Temporal Data Analysis on New York City Taxi Trip Data Preparing the data Converting datetime strings to timestamps Handling Invalid Records Geospatial Analysis Intro to GeoJSON Geopandas Sessionization in PySpark Building Sessions: Secondary Sorts in PySpark Where to Go from Here 7. Estimating Financial Risk Terminology Methods for Calculating VaR Variance-Covariance Historical Simulation Monte Carlo Simulation Our Model Getting the Data Preprocessing Determining the Factor Weights Sampling The Multivariate Normal Distribution Running the Trials Visualizing the Distribution of Returns Evaluating Our Results Where to Go from Here
How to download source code?
1. Go to:
2. Search the book title:
Advanced Analytics with PySpark: Patterns for Learning from Data at Scale Using Python and Spark, sometime you may not get the results, please search the main title
3. Click the book title in the search results
Publisher resources section, click
Download Example Code.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.