Julia for Data Analysis

Length: 246 pages
Edition: 1
Language: English
Publisher: Manning
Publication Date: 2023-01-10
ISBN-10: 1633439364
ISBN-13: 9781633439368
Sales Rank: #1857150 (See Top 100 Books)

Master core data analysis skills using Julia. Interesting hands-on projects guide you through time series data, predictive models, popularity ranking, and more.

Julia was designed for the unique needs of data scientists: it’s expressive and easy-to-use whilst also delivering super fast code execution.

Julia for Data Analysis teaches you how to perform core data science tasks with this amazing language. It’s written by Bogumil Kaminski, a top contributor to Julia, #1 Julia answerer on StackOverflow, and a lead developer of Julia’s core data package DataFrames.jl. You’ll learn how to write production-quality code in Julia, and utilize Julia’s core features for data gathering, visualization, and working with data frames. Plus, the engaging hands-on projects get you into the action quickly.

Julia for Data Analysis
brief contents
contents
foreword
preface
acknowledgments
about this book
	Who should read this book
	How this book is organized: A roadmap
	About the code
	liveBook discussion forum
	Other online resources
about the author
about the cover illustration
1 Introduction
	1.1 What is Julia and why is it useful?
	1.2 Key features of Julia from a data scientist’s perspective
		1.2.1 Julia is fast because it is a compiled language
		1.2.2 Julia provides full support for interactive workflows
		1.2.3 Julia programs are highly reusable and easy to compose together
		1.2.4 Julia has a built-in state-of-the-art package manager
		1.2.5 It is easy to integrate existing code with Julia
	1.3 Usage scenarios of tools presented in the book
	1.4 Julia’s drawbacks
	1.5 What data analysis skills will you learn?
	1.6 How can Julia be used for data analysis?
	Summary
Part 1 Essential Julia skills
	2 Getting started with Julia
		2.1 Representing values
		2.2 Defining variables
		2.3 Using the most important control-flow constructs
			2.3.1 Computations depending on a Boolean condition
			2.3.2 Loops
			2.3.3 Compound expressions
			2.3.4 A first approach to calculating the winsorized mean
		2.4 Defining functions
			2.4.1 Defining functions using the function keyword
			2.4.2 Positional and keyword arguments of functions
			2.4.3 Rules for passing arguments to functions
			2.4.4 Short syntax for defining simple functions
			2.4.5 Anonymous functions
			2.4.6 Do blocks
			2.4.7 Function-naming convention in Julia
			2.4.8 A simplified definition of a function computing the winsorized mean
		2.5 Understanding variable scoping rules
		Summary
	3 Julia’s support for scaling projects
		3.1 Understanding Julia’s type system
			3.1.1 A single function in Julia may have multiple methods
			3.1.2 Types in Julia are arranged in a hierarchy
			3.1.3 Finding all supertypes of a type
			3.1.4 Finding all subtypes of a type
			3.1.5 Union of types
			3.1.6 Deciding what type restrictions to put in method signature
		3.2 Using multiple dispatch in Julia
			3.2.1 Rules for defining methods of a function
			3.2.2 Method ambiguity problem
			3.2.3 Improved implementation of winsorized mean
		3.3 Working with packages and modules
			3.3.1 What is a module in Julia?
			3.3.2 How can packages be used in Julia?
			3.3.3 Using StatsBase.jl to compute the winsorized mean
		3.4 Using macros
		Summary
	4 Working with collections in Julia
		4.1 Working with arrays
			4.1.1 Getting the data into a matrix
			4.1.2 Computing basic statistics of the data stored in a matrix
			4.1.3 Indexing into arrays
			4.1.4 Performance considerations of copying vs. making a view
			4.1.5 Calculating correlations between variables
			4.1.6 Fitting a linear regression
			4.1.7 Plotting the Anscombe’s quartet data
		4.2 Mapping key-value pairs with dictionaries
		4.3 Structuring your data by using named tuples
			4.3.1 Defining named tuples and accessing their contents
			4.3.2 Analyzing Anscombe’s quartet data stored in a named tuple
			4.3.3 Understanding composite types and mutability of values in Julia
		Summary
	5 Advanced topics on handling collections
		5.1 Vectorizing your code using broadcasting
			5.1.1 Understanding syntax and meaning of broadcasting in Julia
			5.1.2 Expanding length-1 dimensions in broadcasting
			5.1.3 Protecting collections from being broadcasted over
			5.1.4 Analyzing Anscombe’s quartet data using broadcasting
		5.2 Defining methods with parametric types
			5.2.1 Most collection types in Julia are parametric
			5.2.2 Rules for subtyping of parametric types
			5.2.3 Using subtyping rules to define the covariance function
		5.3 Integrating with Python
			5.3.1 Preparing data for dimensionality reduction using t-SNE
			5.3.2 Calling Python from Julia
			5.3.3 Visualizing the results of the t-SNE algorithm
		Summary
	6 Working with strings
		6.1 Getting and inspecting the data
			6.1.1 Downloading files from the web
			6.1.2 Using common techniques of string construction
			6.1.3 Reading the contents of a file
		6.2 Splitting strings
		6.3 Using regular expressions to work with strings
			6.3.1 Working with regular expressions
			6.3.2 Writing a parser of a single line of movies.dat file
		6.4 Extracting a subset from a string with indexing
			6.4.1 UTF-8 encoding of strings in Julia
			6.4.2 Character vs. byte indexing of strings
			6.4.3 ASCII strings
			6.4.4 The Char type
		6.5 Analyzing genre frequency in movies.dat
			6.5.1 Finding common movie genres
			6.5.2 Understanding genre popularity evolution over the years
		6.6 Introducing symbols
			6.6.1 Creating symbols
			6.6.2 Using symbols
		6.7 Using fixed-width string types to improve performance
			6.7.1 Available fixed-width strings
			6.7.2 Performance of fixed-width strings
		6.8 Compressing vectors of strings with PooledArrays.jl
			6.8.1 Creating a file containing flower names
			6.8.2 Reading in the data to a vector and compressing it
			6.8.3 Understanding the internal design of PooledArray
		6.9 Choosing appropriate storage for collections of strings
		Summary
	7 Handling time-series data and missing values
		7.1 Understanding the NBP Web API
			7.1.1 Getting the data via a web browser
			7.1.2 Getting the data by using Julia
			7.1.3 Handling cases when an NBP Web API query fails
		7.2 Working with missing data in Julia
			7.2.1 Definition of the missing value
			7.2.2 Working with missing values
		7.3 Getting time-series data from the NBP Web API
			7.3.1 Working with dates
			7.3.2 Fetching data from the NBP Web API for a range of dates
		7.4 Analyzing data fetched from the NBP Web API
			7.4.1 Computing summary statistics
			7.4.2 Finding which days of the week have the most missing values
			7.4.3 Plotting the PLN/USD exchange rate
		Summary
Part 2 Toolbox for data analysis
	8 First steps with data frames
		8.1 Fetching, unpacking, and inspecting the data
			8.1.1 Downloading the file from the web
			8.1.2 Working with bzip2 archives
			8.1.3 Inspecting the CSV file
		8.2 Loading the data to a data frame
			8.2.1 Reading a CSV file into a data frame
			8.2.2 Inspecting the contents of a data frame
			8.2.3 Saving a data frame to a CSV file
		8.3 Getting a column out of a data frame
			8.3.1 Understanding the data frame’s storage model
			8.3.2 Treating a data frame column as a property
			8.3.3 Getting a column by using data frame indexing
			8.3.4 Visualizing data stored in columns of a data frame
		8.4 Reading and writing data frames using different formats
			8.4.1 Apache Arrow
			8.4.2 SQLite
		Summary
	9 Getting data from a data frame
		9.1 Advanced data frame indexing
			9.1.1 Getting a reduced puzzles data frame
			9.1.2 Overview of allowed column selectors
			9.1.3 Overview of allowed row-subsetting values
			9.1.4 Making views of data frame objects
		9.2 Analyzing the relationship between puzzle difficulty and popularity
			9.2.1 Calculating mean puzzle popularity by its rating
			9.2.2 Fitting LOESS regression
		Summary
	10 Creating data frame objects
		10.1 Reviewing the most important ways to create a data frame
			10.1.1 Creating a data frame from a matrix
			10.1.2 Creating a data frame from vectors
			10.1.3 Creating a data frame using a Tables.jl interface
			10.1.4 Plotting a correlation matrix of data stored in a data frame
		10.2 Creating data frames incrementally
			10.2.1 Vertically concatenating data frames
			10.2.2 Appending a table to a data frame
			10.2.3 Adding a new row to an existing data frame
			10.2.4 Storing simulation results in a data frame
		Summary
	11 Converting and grouping data frames
		11.1 Converting a data frame to other value types
			11.1.1 Conversion to a matrix
			11.1.2 Conversion to a named tuple of vectors
			11.1.3 Other common conversions
		11.2 Grouping data frame objects
			11.2.1 Preparing the source data frame
			11.2.2 Grouping a data frame
			11.2.3 Getting group keys of a grouped data frame
			11.2.4 Indexing a grouped data frame with a single value
			11.2.5 Comparing performance of indexing methods
			11.2.6 Indexing a grouped data frame with multiple values
			11.2.7 Iterating a grouped data frame
		Summary
	12 Mutating and transforming data frames
		12.1 Getting and loading the GitHub developers data set
			12.1.1 Understanding graphs
			12.1.2 Fetching GitHub developer data from the web
			12.1.3 Implementing a function that extracts data from a ZIP file
			12.1.4 Reading the GitHub developer data into a data frame
		12.2 Computing additional node features
			12.2.1 Creating a SimpleGraph object
			12.2.2 Computing features of nodes by using the Graphs.jl package
			12.2.3 Counting a node’s web and machine learning neighbors
		12.3 Using the split-apply-combine approach to predict the developer’s type
			12.3.1 Computing summary statistics of web and machine learning developer features
			12.3.2 Visualizing the relationship between the number of web and machine learning neighbors of a node
			12.3.3 Fitting a logistic regression model predicting developer type
		12.4 Reviewing data frame mutation operations
			12.4.1 Performing low-level API operations
			12.4.2 Using the insertcols! function to mutate a data frame
		Summary
	13 Advanced transformations of data frames
		13.1 Getting and preprocessing the police stop data set
			13.1.1 Loading all required packages
			13.1.2 Introducing the @chain macro
			13.1.3 Getting the police stop data set
			13.1.4 Comparing functions that perform operations on columns
			13.1.5 Using short forms of operation specification syntax
		13.2 Investigating the violation column
			13.2.1 Finding the most frequent violations
			13.2.2 Vectorizing functions by using the ByRow wrapper
			13.2.3 Flattening data frames
			13.2.4 Using convenience syntax to get the number of rows of a data frame
			13.2.5 Sorting data frames
			13.2.6 Using advanced functionalities of DataFramesMeta.jl
		13.3 Preparing data for making predictions
			13.3.1 Performing initial transformation of the data
			13.3.2 Working with categorical data
			13.3.3 Joining data frames
			13.3.4 Reshaping data frames
			13.3.5 Dropping rows of a data frame that hold missing values
		13.4 Building a predictive model of arrest probability
			13.4.1 Splitting the data into train and test data sets
			13.4.2 Fitting a logistic regression model
			13.4.3 Evaluating the quality of a model’s predictions
		13.5 Reviewing functionalities provided by DataFrames.jl
		Summary
	14 Creating web services for sharing data analysis results
		14.1 Pricing financial options by using a Monte Carlo simulation
			14.1.1 Calculating the payoff of an Asian option definition
			14.1.2 Computing the value of an Asian option
			14.1.3 Understanding GBM
			14.1.4 Using a numerical approach to computing the Asian option value
		14.2 Implementing the option pricing simulator
			14.2.1 Starting Julia with multiple-thread support
			14.2.2 Computing the option payoff for a single sample of stock prices
			14.2.3 Computing the option value
		14.3 Creating a web service serving the Asian option valuation
			14.3.1 A general approach to building a web service
			14.3.2 Creating a web service using Genie.jl
			14.3.3 Running the web service
		14.4 Using the Asian option pricing web service
			14.4.1 Sending a single request to the web service
			14.4.2 Collecting responses to multiple requests from a web service in a data frame
			14.4.3 Unnesting a column of a data frame
			14.4.4 Plotting the results of Asian option pricing
		Summary
appendix A First steps with Julia
	A.1 Installing and setting up Julia
	A.2 Getting help in and about Julia
	A.3 Managing packages in Julia
		A.3.1 Project environments
		A.3.2 Activating project environments
		A.3.3 Potential issues with installing packages
		A.3.4 Managing packages
		A.3.5 Setting up integration with Python
		A.3.6 Setting up integration with R
	A.4 Reviewing standard ways to work with Julia
		A.4.1 Using a terminal
		A.4.2 Using Visual Studio Code
		A.4.3 Using Jupyter Notebook
		A.4.4 Using Pluto notebooks
appendix B Solutions to exercises
appendix C Julia packages for data science
	C.1 Plotting ecosystems in Julia
	C.2 Scaling computing with Julia
	C.3 Working with databases and data storage formats
	C.4 Using data science methods
	Summary
index
	Symbols
	Numerics
	A
	B
	C
	D
	E
	F
	G
	H
	I
	J
	K
	L
	M
	N
	O
	P
	Q
	R
	S
	T
	U
	V
	W
	Z
Julia for Data Analysis - back