Reproducible Data Science with Pachyderm: Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0

by Svetlana Karslioglu

Length: 364 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2022-03-18
ISBN-10: 1801074488
ISBN-13: 9781801074483
Sales Rank: #0 (See Top 100 Books)

0 ratings

Print Book Look Inside

Create scalable and reliable data pipelines easily with Pachyderm

Key Features

Learn how to build an enterprise-level reproducible data science platform with Pachyderm
Deploy Pachyderm on cloud platforms such as AWS EKS, Google Kubernetes Engine, and Microsoft Azure Kubernetes Service
Integrate Pachyderm with other data science tools, such as Pachyderm Notebooks

Book Description

Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale.

You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm’s main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you’ve learned, you’ll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks.

By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.

What you will learn

Understand the importance of reproducible data science for enterprise
Explore the basics of Pachyderm, such as commits and branches
Upload data to and from Pachyderm
Implement common pipeline operations in Pachyderm
Create a real-life example of hyperparameter tuning in Pachyderm
Combine Pachyderm with Pachyderm language clients in Python and Go

Who this book is for

This book is for new as well as experienced data scientists and machine learning engineers who want to build scalable infrastructures for their data science projects. Basic knowledge of Python programming and Kubernetes will be beneficial. Familiarity with Golang will be helpful.

Reproducible Data Science with Pachyderm
Contributors
About the author
About the reviewers
Preface
    Who this book is for
    What this book covers
    To get the most out of this book
    Download the example code files
    Download the color images
    Conventions used
    Get in touch
    Share Your Thoughts
Section 1: Introduction to Pachyderm and Reproducible Data Science
Chapter 1: The Problem of Data Reproducibility
    Why is reproducibility important?
        What is a model?
        The main principles of reproducibility
    The reproducibility crisis in science
        Data fishing
        Better reproducibility in science research guidelines
        Common practices to improve reproducibility
    Demystifying MLOps
    Types of data science platforms
        End-to-end platforms
        Pluggable solutions
        Data ingestion tools
        Data transformation tools
        Model serving tools
        Data monitoring tools
        Putting it all together
    Explaining ethical AI
        Trustworthy AI
    Summary
    Further reading
Chapter 2: Pachyderm Basics
    Reviewing Pachyderm architecture
        Why can't I use Git for my data pipelines?
        Pachyderm architecture diagram
        Kubernetes
        Helm
        Pachyderm internals
        Other components
        Container runtimes
    Learning about version control primitives
        Repository
        Branch
        Commit
    Discovering pipeline elements
        Types of pipelines
        Datum
    Summary
    Further reading
Chapter 3: Pachyderm Pipeline Specification
    Pipeline specification overview 
    Understanding inputs
        pfs
    Exploring informational parameters
        name
        description
        metadata
    Exploring transformation
        image
        stdin
        err_cmd
        err_stdin
        env
        secrets
        image_pull_secrets
        accept_return_code
        debug
        user
        working_dir
        dockerfile
    Optimizing your pipeline
        parallelism_spec
        reprocess_spec
        cache_size
        max_queue_size
        chunk_spec
        resource_limits
        resource_requests
        sidecar_resource_limits
        scheduling_spec
        job_timeout
        datum_timeout
        datum_tries
    Exploring service parameters
        enable_stats
        pod_patch 
    Exploring output parameters
    Summary
    Further reading
Section 2:Getting Started with Pachyderm
Chapter 4: Installing Pachyderm Locally
    Technical requirements
    Installing the required tools
        Installing Homebrew (macOS only)
        Installing Windows Subsystem for Linux (for Windows only)
        Installing the Kubernetes command-line tool
        Installing Helm v3
    Installing minikube
    Installing Docker Desktop
        Installing Docker Desktop for macOS
    Installing the Pachyderm command-line interface
    Enabling autocompletion for Pachyderm 
        Enabling Pachyderm autocompletion for bash
        Enabling Pachyderm autocompletion for zsh
    Preparing the Kubernetes environment
        Enabling Kubernetes on Docker Desktop
        Enabling Kubernetes using minikube
    Deploying Pachyderm
    Accessing the Pachyderm Console
    Deleting an existing Pachyderm deployment
    Summary
    Further reading
Chapter 5: Installing Pachyderm on a Cloud Platform
    Technical requirements
    Installing the required tools
        Installing the AWS Command Line Interface to manage AWS
        Installing the AWS IAM authenticator for Kubernetes
        Installing eksctl to manage Amazon EKS
        Installing the Google Cloud SDK to manage Google Cloud
        Installing the Azure CLI to manage Microsoft Azure
    Deploying Pachyderm on Amazon EKS
        Preparing an Amazon EKS cluster to run Pachyderm
        Creating an S3 object storage bucket 
    Deploying the cluster
        Deleting a Pachyderm deployment on Amazon EKS
    Deploying Pachyderm on GKE
        Preparing a GKE cluster to run Pachyderm
        Creating a Google Cloud object storage bucket 
    Deploying the cluster
        Deleting a Pachyderm deployment on GKE
    Deploying Pachyderm on Microsoft AKS
        Preparing an AKS cluster to run Pachyderm
        Creating an Azure storage container 
    Deploying the cluster
        Deleting a Pachyderm deployment on AKS
    Accessing the Pachyderm console 
    Summary
    Further reading
Chapter 6: Creating Your First Pipeline
    Technical requirements
    Pipeline overview
    Creating a repository
    Creating a pipeline specification
    Viewing the pipeline result
    Adding another pipeline step
        Cleaning up
    Summary
    Further reading
Chapter 7: Pachyderm Operations
    Technical requirements
        Downloading the source files
    Reviewing the standard Pachyderm workflow
    Executing data operations
        Uploading data to Pachyderm
        About data lineage
        Exploring data lineage
        Mounting a Pachyderm repository to a local filesystem
    Executing pipeline operations
        Updating your pipeline specification
        Updating your code
    Running maintenance operations
        Troubleshooting your pipeline
        Upgrading your Pachyderm cluster
        Cleaning up
    Summary
    Further reading
Chapter 8: Creating an End-to-End Machine Learning Workflow
    Technical requirements
        Adjusting virtual machine parameters
    NLP example overview
        Introduction to NLP
        Learning the NLP phases
        Reviewing the NLP example
    Creating repositories and pipelines
        Creating the data cleaning pipeline
        Creating the POS tagging pipeline
    Creating an NER pipeline
    Retraining an NER model
        Creating the retrain pipeline
        Deploying the retrained pipeline
        Cleaning up
    Summary
    Further reading
Chapter 9: Distributed Hyperparameter Tuning with Pachyderm
    Technical requirements
    Reviewing hyperparameter tuning techniques and strategies
        Grid search
        Random search
        Bayesian optimization
        Regression evaluation metrics
    Creating a hyperparameter tuning pipeline in Pachyderm
        Example overview
        Creating an exploratory analysis pipeline
        Creating a data cleaning pipeline
        Creating a pipeline that removes outliers
        Creating a training pipeline
        Creating an evaluation pipeline
        Cleaning up
    Summary
    Further reading
Section 3:Pachyderm Clients and Tools 
Chapter 10: Pachyderm Language Clients
    Technical requirements
        Downloading the source files
    Using the Pachyderm Go client
        Installing Go on your computer
        Configuring $GOPATH
        Cloning the Pachyderm source repository
        Connecting to Pachyderm with the Go client 
        Creating a repository with the Go client
        Putting data into a Pachyderm repository with the Go client
        Creating pipelines with the Go client
        Cleaning up the cluster with the Go client
    Using the Pachyderm Python client
        Installing the Pachyderm Python client
        Connecting to your Pachyderm cluster with the Python client
        Creating a Pachyderm repository with the Python client
        Putting data into a Pachyderm repository with the Python client
        Creating pipelines with the Pachyderm Python client
        Cleaning up the cluster with the Python client
    Summary
    Further reading
Chapter 11: Using Pachyderm Notebooks
    Technical requirements
        Downloading the source files
    Enabling Pachyderm Notebooks in Pachyderm Hub
        Create a workspace
        Connect to your Pachyderm Hub workspace with pachctl
        Connect to a Pachyderm notebook
    Running basic Pachyderm operations in Pachyderm Notebooks
        Using the integrated terminal
        Using Pachyderm Notebooks
    Creating and running an example pipeline in Pachyderm Notebooks
        Pipeline methodology
        Creating the pipelines
    Summary
    Further reading
    Why subscribe?
Other Books You May Enjoy
    Packt is searching for authors like you
    Share Your Thoughts

AI & Machine Learning Artificial Intelligence Data Processing Intelligence & Semantics Machine Theory

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Reproducible Data Science with Pachyderm: Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0, sometime you may not get the results, please search the main title.

Reproducible Data Science with Pachyderm: Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0

Key Features

Book Description

What you will learn

Who this book is for

How to download source code?

Build AI-Enhanced Audio Plugins with C++

Optimizing AI and Machine Learning Solutions: Your ultimate guide to building high-impact ML/AI solutions

Programming Machine Learning: Machine Learning Basics Concepts + Artificial Intelligence + Python Programming + Python Machine Learning

The Cybersecurity Guide to Governance, Risk, and Compliance

Accelerate Model Training with PyTorch 2.X: Build more accurate models by boosting the model training process

MLOps with Red Hat OpenShift: A cloud-native approach to machine learning operations