Practical Weak Supervision: Doing More with Less Data

by Amit Bahree, Senja Filipi, Wee-Hyong Tok

Length: 200 pages
Edition: 1
Language: English
Publisher: O'Reilly Media
Publication Date: 2021-12-21
ISBN-10: 1492077062
ISBN-13: 9781492077060
Sales Rank: #5330202 (See Top 100 Books)

Most data scientists and engineers today rely on quality labeled data to train their machine learning models. But building training sets manually is time-consuming and expensive, leaving many companies with unfinished ML projects. There’s a more practical approach. In this book, Amit Bahree, Senja Filipi, and Wee Hyong Tok from Microsoft show you how to create products using weakly supervised learning models.

You’ll learn how to build natural language processing and computer vision projects using weakly labeled datasets from Snorkel, a spin-off from the Stanford AI Lab. Because so many companies pursue ML projects that never go beyond their labs, this book also provides a guide on how to ship the deep learning models you build.

Get a practical overview of weak supervision
Dive into data programming with help from Snorkel
Perform text classification using Snorkel’s weakly labeled dataset
Use Snorkel’s labeled indoor-outdoor dataset for computer vision tasks
Scale up weak supervision using scaling strategies and underlying technologies

Preface
    Who Should Read This Book
    Navigating This Book
    Conventions Used in This Book
    Using Code Examples
    O’Reilly Online Learning
    How to Contact Us
    Acknowledgments
1. Introduction to Weak Supervision
    What is Weak Supervision?
    Real-world Weak Supervision with Snorkel
    Approaches to Weak Supervision
        Incomplete supervision
        Inexact supervision
        Inaccurate supervision
    Data programming
    Getting training data
        How data programming is helping accelerate Software 2.0
    Summary
    Bibliography
2. Diving into Data Programming with Snorkel
    Snorkel, a data programming framework
    Getting started with Labeling Functions
        Applying the labels to the datasets
        Analyzing the labeling performance
        Using a validation set
    Reaching labeling consensuswith LabelModel
    Strategies to improve the labeling functions
    Data Augmentation with Snorkel Transformers
        Data augmentation through word removal
        Snorkel Preprocessors
        Data augmentation through GPT-2 prediction
        Data Augmentation through translation
        Applying the transformation functions to the dataset
    Summary
    Bibliography
3. Labeling in Action
    Labeling a Text Dataset: Identifying Fake News
        Exploring the Fake news detection(FakeNewsNet) dataset
        Importing Snorkel, and setting up representative constants
        Fact-checking sites
        Is the speaker a “liar”?
        Twitter profile and Botometer score
        Generating agreements between weak classifiers
    Labeling an Images Dataset. Determining Indoor versus Outdoor Images
        Creating a dataset of images from Bing
        Defining and training weak classifiers in TensorFlow
        Training the various classifiers
        Weak classifiers out of image tags
        Deploying the Computer Vision Service
        Interacting with the Computer Vision Service
        Preparing the data frame
        Learning a label model
    Summary
    Bibliography
4. Using the Snorkel-labeled Dataset for Text Classification
    Getting started with Natural Language Processing (NLP)
        Transformers
    Hard vs Probabilistic Labels
    Using ktrain for Performing Text Classification
        Data Preparation
        Dealing with an Imbalanced Dataset
        Training the model
        Using the Text Classification model for prediction
        Finding a good learning rate
    Using Hugging Face and Transformers
        Loading the relevant Python packages
        Dataset Preparation
        Checking whether GPU hardware is available
        Performing Tokenization
        Model Training
        Testing the Fine-tuned Model
    Summary
    Bibliography
5. Using the Snorkel-labeled Dataset for Image Classification
    Visual Object Recognition Overview
        Representing Image Features
        Transfer Learning for Computer Vision
    Using PyTorch for Image classification
        Loading the Indoor/Outdoor dataset
        Utility Functions
        Visualizing the Training Data
        Fine-tuning the Pre-trained Model
    Summary
    Bibliography
6. Scalability and Distributed Training
    The need for scalability
    Distributed training
    Apache Spark - An Introduction
        Spark Application Design
    Using Azure Databricks to Scale
        Cluster Setup for Weak Supervision
    Fake news detection dataset on Databricks
        Labeling Functions for Snorkel
        Setting up dependencies
        Loading the data
        Fact-checking sites
        Transfer Learning using the LIAR dataset
        Weak classifiers - generating agreement
        Type Conversions needed for Spark runtime
    Summary
    Bibliography