Machine Learning for Data Streams: with Practical Examples in MOA

by Albert Bifet, Bernhard Pfahringer, Geoff Holmes, Ricard Gavalda

Length: 288 pages
Edition: 1
Language: English
Publisher: The MIT Press
Publication Date: 2018-03-02
ISBN-10: 0262037793
ISBN-13: 9780262037792
Sales Rank: #1162545 (See Top 100 Books)

A hands-on approach to tasks and techniques in data stream mining and real-time analytics, with examples in MOA, a popular freely available open-source software framework.

Today many information sources—including sensor networks, financial markets, social networks, and healthcare monitoring—are so-called data streams, arriving sequentially and at high speed. Analysis must take place in real time, with partial data and without the capacity to store the entire data set. This book presents algorithms and techniques used in data stream mining and real-time analytics. Taking a hands-on approach, the book demonstrates the techniques using MOA (Massive Online Analysis), a popular, freely available open-source software framework, allowing readers to try out the techniques after reading the explanations.

The book first offers a brief introduction to the topic, covering big data mining, basic methodologies for mining data streams, and a simple example of MOA. More detailed discussions follow, with chapters on sketching techniques, change, classification, ensemble methods, regression, clustering, and frequent pattern mining. Most of these chapters include exercises, an MOA-based lab session, or both. Finally, the book discusses the MOA software, covering the MOA graphical user interface, the command line, use of its API, and the development of new methods within MOA. The book will be an essential reference for readers who want to use data stream mining as a tool, researchers in innovation or data stream mining, and programmers who want to create new algorithms for MOA.

Cover
Series Page
Title Page
Copyright Page
Table of Contents
List of Figures
List of Tables
Preface
I: Introduction
    1. Introduction
        1.1. Big Data
            1.1.1. Tools: Open-Source Revolution
            1.1.2. Challenges in Big Data
        1.2. Real-Time Analytics
            1.2.1. Data Streams
            1.2.2. Time and Memory
            1.2.3. Applications
        1.3. What This Book Is About
    2. Big Data Stream Mining
        2.1. Algorithms
        2.2. Classification
            2.2.1. Classifier Evaluation in Data Streams
            2.2.2. Majority Class Classifier
            2.2.3. No-Change Classifier
            2.2.4. Lazy Classifier
            2.2.5. Naive Bayes
            2.2.6. Decision Trees
            2.2.7. Ensembles
        2.3. Regression
        2.4. Clustering
        2.5. Frequent Pattern Mining
    3. Hands-on Introduction to MOA
        3.1. Getting Started
        3.2. The Graphical User Interface for Classification
            3.2.1. Drift Stream Generators
        3.3. Using the Command Line
II: Stream Mining
    4. Streams and Sketches
        4.1. Setting: Approximation Algorithms
        4.2. Concentration Inequalities
        4.3. Sampling
        4.4. Counting Total Items
        4.5. Counting Distinct Elements
            4.5.1. Linear Counting
            4.5.2. Cohen’s Logarithmic Counter
            4.5.3. The Flajolet-Martin Counter and HyperLogLog
            4.5.4. An Application: Computing Distance Functions in Graphs
            4.5.5. Discussion: Log vs. Linear
        4.6. Frequency Problems
            4.6.1. The SpaceSaving Sketch
            4.6.2. The CM-Sketch Algorithm
            4.6.3. CountSketch
            4.6.4. Moment Computation
        4.7. Exponential Histograms for Sliding Windows
        4.8. Distributed Sketching: Mergeability
        4.9. Some Technical Discussions and Additional Material
            4.9.1. Hash Functions
            4.9.2. Creating (ε, δ)-Approximation Algorithms
            4.9.3. Other Sketching Techniques
        4.10. Exercises
    5. Dealing with Change
        5.1. Notion of Change in Streams
        5.2. Estimators
            5.2.1. Sliding Windows and Linear Estimators
            5.2.2. Exponentially Weighted Moving Average
            5.2.3. Unidimensional Kalman Filter
        5.3. Change Detection
            5.3.1. Evaluating Change Detection
            5.3.2. The CUSUM and Page-Hinkley Tests
            5.3.3. Statistical Tests
            5.3.4. Drift Detection Method
            5.3.5. ADWIN
        5.4. Combination with Other Sketches and Multidimensional Data
        5.5. Exercises
    6. Classification
        6.1. Classifier Evaluation
            6.1.1. Error Estimation
            6.1.2. Distributed Evaluation
            6.1.3. Performance Evaluation Measures
            6.1.4. Statistical Significance
            6.1.5. A Cost Measure for the Mining Process
        6.2. Baseline Classifiers
            6.2.1. Majority Class
            6.2.2. No-change Classifier
            6.2.3. Naive Bayes
            6.2.4. Multinomial Naive Bayes
        6.3. Decision Trees
            6.3.1. Estimating Split Criteria
            6.3.2. The Hoeffding Tree
            6.3.3. CVFDT
            6.3.4. VFDTc and UFFT
            6.3.5. Hoeffding Adaptive Tree
        6.4. Handling Numeric Attributes
            6.4.1. VFML
            6.4.2. Exhaustive Binary Tree
            6.4.3. Greenwald and Khanna’s Quantile Summaries
            6.4.4. Gaussian Approximation
        6.5. Perceptron
        6.6. Lazy Learning
        6.7. Multi-label Classification
            6.7.1. Multi-label Hoeffding Trees
        6.8. Active Learning
            6.8.1. Random Strategy
            6.8.2. Fixed Uncertainty Strategy
            6.8.3. Variable Uncertainty Strategy
            6.8.4. Uncertainty Strategy with Randomization
        6.9. Concept Evolution
        6.10. Lab Session with MOA
    7. Ensemble Methods
        7.1. Accuracy-Weighted Ensembles
        7.2. Weighted Majority
        7.3. Stacking
        7.4. Bagging
            7.4.1. Online Bagging Algorithm
            7.4.2. Bagging with a Change Detector
            7.4.3. Leveraging Bagging
        7.5. Boosting
        7.6. Ensembles of Hoeffding Trees
            7.6.1. Hoeffding Option Trees
            7.6.2. Random Forests
            7.6.3. Perceptron Stacking of Restricted Hoeffding Trees
            7.6.4. Adaptive-Size Hoeffding Trees
        7.7. Recurrent Concepts
        7.8. Lab Session with MOA
    8. Regression
        8.1. Introduction
        8.2. Evaluation
        8.3. Perceptron Learning
        8.4. Lazy Learning
        8.5. Decision Tree Learning
        8.6. Decision Rules
        8.7. Regression in MOA
    9. Clustering
        9.1. Evaluation Measures
        9.2. The k-means Algorithm
        9.3. BIRCH, BICO, and CluStream
        9.4. Density-Based Methods: DBSCAN and Den-Stream
        9.5. ClusTree
        9.6. StreamKM++: Coresets
        9.7. Additional Material
        9.8. Lab Session with MOA
    10. Frequent Pattern Mining
        10.1. An Introduction to Pattern Mining
            10.1.1. Patterns: Definitions and Examples
            10.1.2. Batch Algorithms for Frequent Pattern Mining
            10.1.3. Closed and Maximal Patterns
        10.2. Frequent Pattern Mining in Streams: Approaches
            10.2.1. Coresets of Closed Patterns
        10.3. Frequent Itemset Mining on Streams
            10.3.1. Reduction to Heavy Hitters
            10.3.2. Moment
            10.3.3. FP-Stream
            10.3.4. IncMine
        10.4. Frequent Subgraph Mining on Streams
            10.4.1. WinGraphMiner
            10.4.2. AdaGraphMiner
        10.5. Additional Material
        10.6. Exercises
III: The MOA Software
    11. Introduction to MOA and Its Ecosystem
        11.1. MOA Architecture
        11.2. Installation
        11.3. Recent Developments in MOA
        11.4. Extensions to MOA
        11.5. ADAMS
        11.6. MEKA
        11.7. OpenML
        11.8. StreamDM
        11.9. Streams
        11.10. Apache SAMOA
    12. The Graphical User Interface
        12.1. Getting Started with the GUI
        12.2. Classification and Regression
            12.2.1. Tasks
            12.2.2. Data Feeds and Data Generators
            12.2.3. Bayesian Classifiers
            12.2.4. Decision Trees
            12.2.5. Meta Classifiers (Ensembles)
            12.2.6. Function Classifiers
            12.2.7. Drift Classifiers
            12.2.8. Active Learning Classifiers
        12.3. Clustering
            12.3.1. Data Feeds and Data Generators
            12.3.2. Stream Clustering Algorithms
            12.3.3. Visualization and Analysis
    13. Using the Command Line
        13.1. Learning Task for Classification and Regression
        13.2. Evaluation Tasks for Classification and Regression
        13.3. Learning and Evaluation Tasks for Classification and Regression
        13.4. Comparing Two Classifiers
    14. Using the API
        14.1. MOA Objects
        14.2. Options
        14.3. Prequential Evaluation Example
    15. Developing New Methods in MOA
        15.1. Main Classes in MOA
        15.2. Creating a New Classifier
        15.3. Compiling a Classifier
        15.4. Good Programming Practices in MOA
Bibliography
Index
Series List