Effective Data Science Infrastructure: How to make data scientists productive

Length: 325 pages
Edition: 1
Language: English
Publisher: Manning
Publication Date: 2022-06-28
ISBN-10: 1617299197
ISBN-13: 9781617299193
Sales Rank: #2912268 (See Top 100 Books)

Simplify data science infrastructure to give data scientists an efficient path from prototype to production.

Effective Data Science Infrastructure is a hands-on guide to assembling infrastructure for data science and machine learning applications. It reveals the processes used at Netflix and other data-driven companies to manage their cutting edge data infrastructure.

As you work through this easy-to-follow guide, you’ll set up end-to-end infrastructure from the ground up, with a fully customizable process you can easily adapt to your company. You’ll learn how you can make data scientists more productive with your existing cloud infrastructure, a stack of open source software, and idiomatic Python. Throughout, you’ll follow a human-centric approach focused on user experience and meeting the unique needs of data scientists.

Effective Data Science Infrastructure MEAP V07
Copyright
welcome
Brief contents
Chapter 1: Introduction
	1.1 Why Data Science Infrastructure
		1.1.1 Lifecycle of a Data Science Project
	1.2 What is Data Science Infrastructure
		1.2.1 The Infrastructure Stack for Data Science
		1.2.2 Taming Complexity
		1.2.3 Leveraging Existing Platforms
	1.3 Human-Centric Infrastructure
		1.3.1 Data scientist autonomy
	1.4 Summary
Chapter 2: The Toolchain of Data Science
	2.1 Setting up a Development Environment
		2.1.1 Cloud Account
		2.1.2 Data Science Workstation
		2.1.3 Notebooks
		2.1.4 Putting things together
	2.2 Introducing Workflows
		2.2.1 The basics of workflows
		2.2.2 Executing workflows
		2.2.3 The world of workflow frameworks
	2.3 Summary
Chapter 3: Introducing Metaflow
	3.1 Basics of Metaflow
		3.1.1 Writing a basic workflow
		3.1.2 Managing data flow in workflows
		3.1.3 Parameters
	3.2 Branching and merging
		3.2.1 Valid DAG structures
		3.2.2 Static branches
		3.2.3 Dynamic branches
		3.2.4 Controlling Concurrency
	3.3 Metaflow in Action
		3.3.1 Starting a new project
		3.3.2 Accessing results with the Client API
		3.3.3 Debugging failures
		3.3.4 Finishing touches
	3.4 Summary
Chapter 4: Scaling with The Compute Layer
	4.1 What is Scalability
		4.1.1 Scaling organizations
	4.2 The Compute Layer
		4.2.1 Batch processing with containers
		4.2.2 Examples of compute layers
	4.3 The compute layer in Metaflow
		4.3.1 Configuring AWS Batch for Metaflow
		4.3.2 @batch and @resources decorators
	4.4 Handling failures
		4.4.1 Recovering from transient errors with @retry
		4.4.2 Killing zombies with @timeout
		4.4.3 The decorator of the last resort: @catch
	4.5 Summary
Chapter 5: Practicing Scalability and Performance
	5.1 Starting simple: Vertical scalability
		5.1.1 Example: Clustering Yelp Reviews
		5.1.2 Practicing vertical scalability
		5.1.3 Why vertical scalability
	5.2 Practicing Horizontal Scalability
		5.2.1 Why horizontal scalability
		5.2.2 Example: Hyperparameter search
	5.3 Practicing performance optimization
		5.3.1 Example: Computing a co-occurrence matrix
		5.3.2 Recipe for fast enough workflows
	5.4 Summary
Chapter 6: Going to Production
	6.1 Stable workflow scheduling
		6.1.1 Centralized metadata
		6.1.2 Using AWS Step Functions with Metaflow
		6.1.3 Scheduling runs with @schedule
	6.2 Stable execution environments
		6.2.1 How Metaflow packages flows
		6.2.2 Why dependency managements matters
		6.2.3 Using the @conda decorator
	6.3 Stable operations
		6.3.1 Namespaces during prototyping
		6.3.2 Production namespaces
		6.3.3 Parallel deployments with @project
	6.4 Summary
Chapter 7: Processing Data
	7.1 Foundations of Fast Data
		7.1.1 Loading data from S3
		7.1.2 Working with tabular data
		7.1.3 The in-memory data stack
	7.2 Interfacing with Data Infrastructure
		7.2.1 Modern data infrastructure
		7.2.2 Preparing datasets in SQL
		7.2.3 Distributed data processing
	7.3 From Data to Features
		7.3.1 Encoding features
	7.4 Summary
Chapter 8: Using and Operating Models
	8.1 Producing Predictions
		Batch, streaming, and real-time predictions
		8.1.1 Example: Recommendation system
		Training a rudimentary recommendations model
		8.1.2 Batch predictions
		Producing recommendations
		Sharing results robustly
		Producing a batch of recommendations
		Using recommendations in a web app
		8.1.3 Real-time predictions
		Example: Real-time movie recommendations
	8.2 Summary
Chapter 9: Machine Learning With the Full Stack
	9.1 Pluggable Feature Encoders and Models
		9.1.1 Pluggable models and feature encoders
		Defining a feature encoder
		Loading and executing plugins
		9.1.2 Benchmarking models
		Model workflow
	9.2 Deep Regression Model
		9.2.1 Encoding input tensors
		Data loader
		9.2.2 Defining a deep regression model
		9.2.3 Training a deep regression model
		Small-scale training
		Large-scale training
	9.3 Summarizing Lessons Learned
	9.4 Summary