Reproducible Data Science with Pachyderm: Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0
- Length: 364 pages
- Edition: 1
- Language: English
- Publisher: Packt Publishing
- Publication Date: 2022-03-18
- ISBN-10: 1801074488
- ISBN-13: 9781801074483
- Sales Rank: #0 (See Top 100 Books)
Create scalable and reliable data pipelines easily with Pachyderm
Key Features
- Learn how to build an enterprise-level reproducible data science platform with Pachyderm
- Deploy Pachyderm on cloud platforms such as AWS EKS, Google Kubernetes Engine, and Microsoft Azure Kubernetes Service
- Integrate Pachyderm with other data science tools, such as Pachyderm Notebooks
Book Description
Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale.
You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm’s main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you’ve learned, you’ll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks.
By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.
What you will learn
- Understand the importance of reproducible data science for enterprise
- Explore the basics of Pachyderm, such as commits and branches
- Upload data to and from Pachyderm
- Implement common pipeline operations in Pachyderm
- Create a real-life example of hyperparameter tuning in Pachyderm
- Combine Pachyderm with Pachyderm language clients in Python and Go
Who this book is for
This book is for new as well as experienced data scientists and machine learning engineers who want to build scalable infrastructures for their data science projects. Basic knowledge of Python programming and Kubernetes will be beneficial. Familiarity with Golang will be helpful.
Reproducible Data Science with Pachyderm Contributors About the author About the reviewers Preface Who this book is for What this book covers To get the most out of this book Download the example code files Download the color images Conventions used Get in touch Share Your Thoughts Section 1:
Introduction to Pachyderm and Reproducible Data Science Chapter 1: The Problem of Data Reproducibility Why is reproducibility important? What is a model? The main principles of reproducibility The reproducibility crisis in science Data fishing Better reproducibility in science research guidelines Common practices to improve reproducibility Demystifying MLOps Types of data science platforms End-to-end platforms Pluggable solutions Data ingestion tools Data transformation tools Model serving tools Data monitoring tools Putting it all together Explaining ethical AI Trustworthy AI Summary Further reading Chapter 2: Pachyderm Basics Reviewing Pachyderm architecture Why can't I use Git for my data pipelines? Pachyderm architecture diagram Kubernetes Helm Pachyderm internals Other components Container runtimes Learning about version control primitives Repository Branch Commit Discovering pipeline elements Types of pipelines Datum Summary Further reading Chapter 3: Pachyderm Pipeline Specification Pipeline specification overview Understanding inputs pfs Exploring informational parameters name description metadata Exploring transformation image stdin err_cmd err_stdin env secrets image_pull_secrets accept_return_code debug user working_dir dockerfile Optimizing your pipeline parallelism_spec reprocess_spec cache_size max_queue_size chunk_spec resource_limits resource_requests sidecar_resource_limits scheduling_spec job_timeout datum_timeout datum_tries Exploring service parameters enable_stats pod_patch Exploring output parameters Summary Further reading Section 2:Getting Started with Pachyderm Chapter 4: Installing Pachyderm Locally Technical requirements Installing the required tools Installing Homebrew (macOS only) Installing Windows Subsystem for Linux (for Windows only) Installing the Kubernetes command-line tool Installing Helm v3 Installing minikube Installing Docker Desktop Installing Docker Desktop for macOS Installing the Pachyderm command-line interface Enabling autocompletion for Pachyderm Enabling Pachyderm autocompletion for bash Enabling Pachyderm autocompletion for zsh Preparing the Kubernetes environment Enabling Kubernetes on Docker Desktop Enabling Kubernetes using minikube Deploying Pachyderm Accessing the Pachyderm Console Deleting an existing Pachyderm deployment Summary Further reading Chapter 5: Installing Pachyderm on a Cloud Platform Technical requirements Installing the required tools Installing the AWS Command Line Interface to manage AWS Installing the AWS IAM authenticator for Kubernetes Installing eksctl to manage Amazon EKS Installing the Google Cloud SDK to manage Google Cloud Installing the Azure CLI to manage Microsoft Azure Deploying Pachyderm on Amazon EKS Preparing an Amazon EKS cluster to run Pachyderm Creating an S3 object storage bucket Deploying the cluster Deleting a Pachyderm deployment on Amazon EKS Deploying Pachyderm on GKE Preparing a GKE cluster to run Pachyderm Creating a Google Cloud object storage bucket Deploying the cluster Deleting a Pachyderm deployment on GKE Deploying Pachyderm on Microsoft AKS Preparing an AKS cluster to run Pachyderm Creating an Azure storage container Deploying the cluster Deleting a Pachyderm deployment on AKS Accessing the Pachyderm console Summary Further reading Chapter 6: Creating Your First Pipeline Technical requirements Pipeline overview Creating a repository Creating a pipeline specification Viewing the pipeline result Adding another pipeline step Cleaning up Summary Further reading Chapter 7: Pachyderm Operations Technical requirements Downloading the source files Reviewing the standard Pachyderm workflow Executing data operations Uploading data to Pachyderm About data lineage Exploring data lineage Mounting a Pachyderm repository to a local filesystem Executing pipeline operations Updating your pipeline specification Updating your code Running maintenance operations Troubleshooting your pipeline Upgrading your Pachyderm cluster Cleaning up Summary Further reading Chapter 8: Creating an End-to-End Machine Learning Workflow Technical requirements Adjusting virtual machine parameters NLP example overview Introduction to NLP Learning the NLP phases Reviewing the NLP example Creating repositories and pipelines Creating the data cleaning pipeline Creating the POS tagging pipeline Creating an NER pipeline Retraining an NER model Creating the retrain pipeline Deploying the retrained pipeline Cleaning up Summary Further reading Chapter 9: Distributed Hyperparameter Tuning with Pachyderm Technical requirements Reviewing hyperparameter tuning techniques and strategies Grid search Random search Bayesian optimization Regression evaluation metrics Creating a hyperparameter tuning pipeline in Pachyderm Example overview Creating an exploratory analysis pipeline Creating a data cleaning pipeline Creating a pipeline that removes outliers Creating a training pipeline Creating an evaluation pipeline Cleaning up Summary Further reading Section 3:Pachyderm Clients and Tools Chapter 10: Pachyderm Language Clients Technical requirements Downloading the source files Using the Pachyderm Go client Installing Go on your computer Configuring $GOPATH Cloning the Pachyderm source repository Connecting to Pachyderm with the Go client Creating a repository with the Go client Putting data into a Pachyderm repository with the Go client Creating pipelines with the Go client Cleaning up the cluster with the Go client Using the Pachyderm Python client Installing the Pachyderm Python client Connecting to your Pachyderm cluster with the Python client Creating a Pachyderm repository with the Python client Putting data into a Pachyderm repository with the Python client Creating pipelines with the Pachyderm Python client Cleaning up the cluster with the Python client Summary Further reading Chapter 11: Using Pachyderm Notebooks Technical requirements Downloading the source files Enabling Pachyderm Notebooks in Pachyderm Hub Create a workspace Connect to your Pachyderm Hub workspace with pachctl Connect to a Pachyderm notebook Running basic Pachyderm operations in Pachyderm Notebooks Using the integrated terminal Using Pachyderm Notebooks Creating and running an example pipeline in Pachyderm Notebooks Pipeline methodology Creating the pipelines Summary Further reading Why subscribe? Other Books You May Enjoy Packt is searching for authors like you Share Your Thoughts
Donate to keep this site alive
How to download source code?
1. Go to: https://github.com/PacktPublishing
2. In the Find a repository… box, search the book title: Reproducible Data Science with Pachyderm: Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0
, sometime you may not get the results, please search the main title.
3. Click the book title in the search results.
3. Click Code to download.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.