Mastering Spark with R: The Complete Guide to Large-Scale Analysis and Modeling

by Edgar Ruiz, Javier Luraschi, Kevin Kuo

Length: 296 pages
Edition: 1
Language: English
Publisher: O'Reilly Media
Publication Date: 2019-11-05
ISBN-10: 149204637X
ISBN-13: 9781492046370
Sales Rank: #1135456 (See Top 100 Books)

If you’re like most R users, you have deep knowledge and love for statistics. But as your organization continues to collect huge amounts of data, adding tools such as Apache Spark makes a lot of sense. With this practical book, data scientists and professionals working with large-scale data applications will learn how to use Spark from R to tackle big data and big compute problems.

Authors Javier Luraschi, Kevin Kuo, and Edgar Ruiz show you how to use R with Spark to solve different data analysis problems. This book covers relevant data science topics, cluster computing, and issues that should interest even the most advanced users.

Analyze, explore, transform, and visualize data in Apache Spark with R
Create statistical models to extract information and predict outcomes; automate the process in production-ready workflows
Perform analysis and modeling across many machines using distributed computing techniques
Use large-scale data from multiple sources and different formats with ease from within Spark
Learn about alternative modeling frameworks for graph processing, geospatial analysis, and genomics at scale
Dive into advanced topics including custom transformations, real-time data processing, and creating custom Spark extensions

Cover
Copyright
Table of Contents
Foreword
Preface
	Formatting
	Acknowledgments
	Conventions Used in This Book
	Using Code Examples
	O’Reilly Online Learning
	How to Contact Us
Chapter 1. Introduction
	Overview
	Hadoop
	Spark
	R
	sparklyr
	Recap
Chapter 2. Getting Started
	Overview
	Prerequisites
		Installing sparklyr
		Installing Spark
	Connecting
	Using Spark
		Web Interface
		Analysis
		Modeling
		Data
		Extensions
		Distributed R
		Streaming
		Logs
	Disconnecting
	Using RStudio
	Resources
	Recap
Chapter 3. Analysis
	Overview
	Import
	Wrangle
		Built-in Functions
		Correlations
	Visualize
		Using ggplot2
		Using dbplot
	Model
		Caching
	Communicate
	Recap
Chapter 4. Modeling
	Overview
	Exploratory Data Analysis
	Feature Engineering
	Supervised Learning
		Generalized Linear Regression
		Other Models
	Unsupervised Learning
		Data Preparation
		Topic Modeling
	Recap
Chapter 5. Pipelines
	Overview
	Creation
	Use Cases
		Hyperparameter Tuning
	Operating Modes
	Interoperability
	Deployment
		Batch Scoring
		Real-Time Scoring
	Recap
Chapter 6. Clusters
	Overview
	On-Premises
		Managers
		Distributions
	Cloud
		Amazon
		Databricks
		Google
		IBM
		Microsoft
		Qubole
	Kubernetes
	Tools
		RStudio
		Jupyter
		Livy
	Recap
Chapter 7. Connections
	Overview
		Edge Nodes
		Spark Home
	Local
	Standalone
	YARN
		YARN Client
		YARN Cluster
	Livy
	Mesos
	Kubernetes
	Cloud
	Batches
	Tools
	Multiple Connections
	Troubleshooting
		Logging
		Spark Submit
		Windows
	Recap
Chapter 8. Data
	Overview
	Reading Data
		Paths
		Schema
		Memory
		Columns
	Writing Data
	Copying Data
	File Formats
		CSV
		JSON
		Parquet
		Others
	File Systems
	Storage Systems
		Hive
		Cassandra
		JDBC
	Recap
Chapter 9. Tuning
	Overview
		Graph
		Timeline
	Configuring
		Connect Settings
		Submit Settings
		Runtime Settings
		sparklyr Settings
	Partitioning
		Implicit Partitions
		Explicit Partitions
	Caching
		Checkpointing
		Memory
	Shuffling
	Serialization
	Configuration Files
	Recap
Chapter 10. Extensions
	Overview
	H2O
	Graphs
	XGBoost
	Deep Learning
	Genomics
	Spatial
	Troubleshooting
	Recap
Chapter 11. Distributed R
	Overview
	Use Cases
		Custom Parsers
		Partitioned Modeling
		Grid Search
		Web APIs
		Simulations
	Partitions
	Grouping
	Columns
	Context
	Functions
	Packages
	Cluster Requirements
		Installing R
		Apache Arrow
	Troubleshooting
		Worker Logs
		Resolving Timeouts
		Inspecting Partitions
		Debugging Workers
	Recap
Chapter 12. Streaming
	Overview
	Transformations
		Analysis
		Modeling
		Pipelines
		Distributed R
	Kafka
	Shiny
	Recap
Chapter 13. Contributing
	Overview
	The Spark API
	Spark Extensions
	Using Scala Code
	Recap
Appendix A. Supplemental Code References
	Preface
		Formatting
	Chapter 1
		The World’s Capacity to Store Information
		Daily Downloads of CRAN Packages
	Chapter 2
		Prerequisites
	Chapter 3
		Hive Functions
	Chapter 4
		MLlib Functions
	Chapter 6
		Google Trends for On-Premises (Mainframes), Cloud Computing, and Kubernetes
	Chapter 12
		Stream Generator
		Installing Kafka
Index
About the Authors
Colophon

Data Mining Data Modeling & Design Data Processing Database Storage & Design Mathematical Analysis Mathematics Programming Languages

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://www.oreilly.com/

2. Search the book title: Mastering Spark with R: The Complete Guide to Large-Scale Analysis and Modeling, sometime you may not get the results, please search the main title

3. Click the book title in the search results

3. Publisher resources section, click Download Example Code.

1. Disable the AdBlock plugin. Otherwise, you may not get any links.

2. Solve the CAPTCHA.

3. Click download link.

4. Lead to download server to download.