Data Mining: Concepts and Techniques, 4th Edition
- Length: 752 pages
- Edition: 4
- Language: English
- Publisher: Morgan Kaufmann
- Publication Date: 2022-10-17
- ISBN-10: 0128117605
- ISBN-13: 9780128117606
- Sales Rank: #977738 (See Top 100 Books)
Data Mining: Concepts and Techniques, Fourth Edition introduces concepts, principles, and methods for mining patterns, knowledge, and models from various kinds of data for diverse applications. Specifically, it delves into the processes for uncovering patterns and knowledge from massive collections of data, known as knowledge discovery from data, or KDD. It focuses on the feasibility, usefulness, effectiveness, and scalability of data mining techniques for large data sets.
After an introduction to the concept of data mining, the authors explain the methods for preprocessing, characterizing, and warehousing data. They then partition the data mining methods into several major tasks, introducing concepts and methods for mining frequent patterns, associations, and correlations for large data sets; data classificcation and model construction; cluster analysis; and outlier detection. Concepts and methods for deep learning are systematically introduced as one chapter. Finally, the book covers the trends, applications, and research frontiers in data mining.
Front Cover Data Mining Copyright Contents Foreword Foreword to second edition Preface Acknowledgments About the authors 1 Introduction 1.1 What is data mining? 1.2 Data mining: an essential step in knowledge discovery 1.3 Diversity of data types for data mining 1.4 Mining various kinds of knowledge 1.4.1 Multidimensional data summarization 1.4.2 Mining frequent patterns, associations, and correlations 1.4.3 Classification and regression for predictive analysis 1.4.4 Cluster analysis 1.4.5 Deep learning 1.4.6 Outlier analysis 1.4.7 Are all mining results interesting? 1.5 Data mining: confluence of multiple disciplines 1.5.1 Statistics and data mining 1.5.2 Machine learning and data mining 1.5.3 Database technology and data mining 1.5.4 Data mining and data science 1.5.5 Data mining and other disciplines 1.6 Data mining and applications 1.7 Data mining and society 1.8 Summary 1.9 Exercises 1.10 Bibliographic notes 2 Data, measurements, and data preprocessing 2.1 Data types 2.1.1 Nominal attributes 2.1.2 Binary attributes 2.1.3 Ordinal attributes 2.1.4 Numeric attributes Interval-scaled attributes Ratio-scaled attributes 2.1.5 Discrete vs. continuous attributes 2.2 Statistics of data 2.2.1 Measuring the central tendency 2.2.2 Measuring the dispersion of data Range, quartiles, and interquartile range Five-number summary, boxplots, and outliers Variance and standard deviation 2.2.3 Covariance and correlation analysis Covariance of numeric data Correlation coefficient for numeric data χ2 correlation test for nominal data 2.2.4 Graphic displays of basic statistics of data Quantile plot Quantile-quantile plot Histograms Scatter plots and data correlation 2.3 Similarity and distance measures 2.3.1 Data matrix vs. dissimilarity matrix 2.3.2 Proximity measures for nominal attributes 2.3.3 Proximity measures for binary attributes 2.3.4 Dissimilarity of numeric data: Minkowski distance 2.3.5 Proximity measures for ordinal attributes 2.3.6 Dissimilarity for attributes of mixed types 2.3.7 Cosine similarity 2.3.8 Measuring similar distributions: the Kullback-Leibler divergence 2.3.9 Capturing hidden semantics in similarity measures 2.4 Data quality, data cleaning, and data integration 2.4.1 Data quality measures 2.4.2 Data cleaning Missing values Noisy data Data cleaning as a process 2.4.3 Data integration Entity identification problem Redundancy and correlation analysis Tuple duplication Data value conflict detection and resolution 2.5 Data transformation 2.5.1 Normalization 2.5.2 Discretization Discretization by binning Discretization by histogram analysis 2.5.3 Data compression 2.5.4 Sampling 2.6 Dimensionality reduction 2.6.1 Principal components analysis 2.6.2 Attribute subset selection 2.6.3 Nonlinear dimensionality reduction methods General procedure Kernel PCA Stochastic neighbor embedding 2.7 Summary 2.8 Exercises 2.9 Bibliographic notes 3 Data warehousing and online analytical processing 3.1 Data warehouse 3.1.1 Data warehouse: what and why? 3.1.2 Architecture of data warehouses: enterprise data warehouses and data marts The three-tier architecture ETL for data warehouses Enterprise data warehouse and data mart 3.1.3 Data lakes 3.2 Data warehouse modeling: schema and measures 3.2.1 Data cube: a multidimensional data model 3.2.2 Schemas for multidimensional data models: stars, snowflakes, and fact constellations 3.2.3 Concept hierarchies 3.2.4 Measures: categorization and computation 3.3 OLAP operations 3.3.1 Typical OLAP operations 3.3.2 Indexing OLAP data: bitmap index and join index Bitmap indexing Join indexing 3.3.3 Storage implementation: column-based databases 3.4 Data cube computation 3.4.1 Terminology of data cube computation 3.4.2 Data cube materialization: ideas 3.4.3 OLAP server architectures: ROLAP vs. MOLAP vs. HOLAP 3.4.4 General strategies for data cube computation 3.5 Data cube computation methods 3.5.1 Multiway array aggregation for full cube computation 3.5.2 BUC: computing iceberg cubes from the apex cuboid downward 3.5.3 Precomputing shell fragments for fast high-dimensional OLAP 3.5.4 Efficient processing of OLAP queries using cuboids 3.6 Summary 3.7 Exercises 3.8 Bibliographic notes 4 Pattern mining: basic concepts and methods 4.1 Basic concepts 4.1.1 Market basket analysis: a motivating example 4.1.2 Frequent itemsets, closed itemsets, and association rules 4.2 Frequent itemset mining methods 4.2.1 Apriori algorithm: finding frequent itemsets by confined candidate generation 4.2.2 Generating association rules from frequent itemsets 4.2.3 Improving the efficiency of Apriori 4.2.4 A pattern-growth approach for mining frequent itemsets 4.2.5 Mining frequent itemsets using the vertical data format 4.2.6 Mining closed and max patterns 4.3 Which patterns are interesting?—Pattern evaluation methods 4.3.1 Strong rules are not necessarily interesting 4.3.2 From association analysis to correlation analysis 4.3.3 A comparison of pattern evaluation measures 4.4 Summary 4.5 Exercises 4.6 Bibliographic notes 5 Pattern mining: advanced methods 5.1 Mining various kinds of patterns 5.1.1 Mining multilevel associations 5.1.2 Mining multidimensional associations 5.1.3 Mining quantitative association rules Data cube–based mining of quantitative associations Mining clustering-based quantitative associations Using statistical theory to disclose exceptional behavior 5.1.4 Mining high-dimensional data 5.1.5 Mining rare patterns and negative patterns 5.2 Mining compressed or approximate patterns 5.2.1 Mining compressed patterns by pattern clustering 5.2.2 Extracting redundancy-aware top-k patterns 5.3 Constraint-based pattern mining 5.3.1 Pruning pattern space with pattern pruning constraints Pattern antimonotonicity Pattern monotonicity Convertible constraints: ordering data in transactions 5.3.2 Pruning data space with data pruning constraints 5.3.3 Mining space pruning with succinctness constraints 5.4 Mining sequential patterns 5.4.1 Sequential pattern mining: concepts and primitives 5.4.2 Scalable methods for mining sequential patterns GSP: a sequential pattern mining algorithm based on candidate generate-and-test SPADE: an Apriori-based vertical data format sequential pattern mining algorithm PrefixSpan: prefix-projected sequential pattern growth Mining closed sequential patterns Mining multidimensional, multilevel sequential patterns 5.4.3 Constraint-based mining of sequential patterns 5.5 Mining subgraph patterns 5.5.1 Methods for mining frequent subgraphs Apriori-based approach Pattern-growth approach 5.5.2 Mining variant and constrained substructure patterns Mining closed frequent substructures Extension of pattern-growth approach: mining alternative substructure patterns Mining substructure patterns with user-specified constraints Mining approximate frequent substructures Mining coherent substructures 5.6 Pattern mining: application examples 5.6.1 Phrase mining in massive text data How to judge the quality of a phrase? Phrasal segmentation and computing phrase quality Phrase mining methods 5.6.2 Mining copy and paste bugs in software programs 5.7 Summary 5.8 Exercises 5.9 Bibliographic notes 6 Classification: basic concepts and methods 6.1 Basic concepts 6.1.1 What is classification? 6.1.2 General approach to classification 6.2 Decision tree induction 6.2.1 Decision tree induction 6.2.2 Attribute selection measures Information gain Gain ratio Gini impurity Other attribute selection measures 6.2.3 Tree pruning 6.3 Bayes classification methods 6.3.1 Bayes' theorem 6.3.2 Naïve Bayesian classification 6.4 Lazy learners (or learning from your neighbors) 6.4.1 k-nearest-neighbor classifiers 6.4.2 Case-based reasoning 6.5 Linear classifiers 6.5.1 Linear regression 6.5.2 Perceptron: turning linear regression to classification 6.5.3 Logistic regression 6.6 Model evaluation and selection 6.6.1 Metrics for evaluating classifier performance 6.6.2 Holdout method and random subsampling 6.6.3 Cross-validation 6.6.4 Bootstrap 6.6.5 Model selection using statistical tests of significance 6.6.6 Comparing classifiers based on cost–benefit and ROC curves 6.7 Techniques to improve classification accuracy 6.7.1 Introducing ensemble methods 6.7.2 Bagging 6.7.3 Boosting 6.7.4 Random forests 6.7.5 Improving classification accuracy of class-imbalanced data 6.8 Summary 6.9 Exercises 6.10 Bibliographic notes 7 Classification: advanced methods 7.1 Feature selection and engineering 7.1.1 Filter methods 7.1.2 Wrapper methods 7.1.3 Embedded methods 7.2 Bayesian belief networks 7.2.1 Concepts and mechanisms 7.2.2 Training Bayesian belief networks 7.3 Support vector machines 7.3.1 Linear support vector machines 7.3.2 Nonlinear support vector machines 7.4 Rule-based and pattern-based classification 7.4.1 Using IF-THEN rules for classification 7.4.2 Rule extraction from a decision tree 7.4.3 Rule induction using a sequential covering algorithm Rule quality measures Rule pruning 7.4.4 Associative classification 7.4.5 Discriminative frequent pattern–based classification 7.5 Classification with weak supervision 7.5.1 Semisupervised classification 7.5.2 Active learning 7.5.3 Transfer learning 7.5.4 Distant supervision 7.5.5 Zero-shot learning 7.6 Classification with rich data type 7.6.1 Stream data classification 7.6.2 Sequence classification 7.6.3 Graph data classification 7.7 Potpourri: other related techniques 7.7.1 Multiclass classification 7.7.2 Distance metric learning 7.7.3 Interpretability of classification 7.7.4 Genetic algorithms 7.7.5 Reinforcement learning 7.8 Summary 7.9 Exercises 7.10 Bibliographic notes 8 Cluster analysis: basic concepts and methods 8.1 Cluster analysis 8.1.1 What is cluster analysis? 8.1.2 Requirements for cluster analysis 8.1.3 Overview of basic clustering methods 8.2 Partitioning methods 8.2.1 k-Means: a centroid-based technique 8.2.2 Variations of k-means k-Medoids: a representative object-based technique k-Modes: clustering nominal data Initialization in partitioning methods Estimating the number of clusters Applying feature transformation 8.3 Hierarchical methods 8.3.1 Basic concepts of hierarchical clustering 8.3.2 Agglomerative hierarchical clustering Similarity measures in hierarchical clustering Connecting agglomerative hierarchical clustering and partitioning methods The Lance-Williams algorithm 8.3.3 Divisive hierarchical clustering The minimum spanning tree–based approach Dendrogram 8.3.4 BIRCH: scalable hierarchical clustering using clustering feature trees 8.3.5 Probabilistic hierarchical clustering 8.4 Density-based and grid-based methods 8.4.1 DBSCAN: density-based clustering based on connected regions with high density 8.4.2 DENCLUE: clustering based on density distribution functions 8.4.3 Grid-based methods 8.5 Evaluation of clustering 8.5.1 Assessing clustering tendency 8.5.2 Determining the number of clusters 8.5.3 Measuring clustering quality: extrinsic methods Extrinsic vs. intrinsic methods Desiderata of extrinsic methods Categories of extrinsic methods Matching-based methods Information theory–based methods Pairwise comparison–based methods 8.5.4 Intrinsic methods 8.6 Summary 8.7 Exercises 8.8 Bibliographic notes 9 Cluster analysis: advanced methods 9.1 Probabilistic model-based clustering 9.1.1 Fuzzy clusters 9.1.2 Probabilistic model-based clusters 9.1.3 Expectation-maximization algorithm 9.2 Clustering high-dimensional data 9.2.1 Why is clustering high-dimensional data challenging? Motivations of clustering analysis on high-dimensional data High-dimensional clustering models Categorization of high-dimensional clustering methods 9.2.2 Axis-parallel subspace approaches CLIQUE: a subspace clustering method PROCLUS: a projected clustering method Soft projected clustering methods 9.2.3 Arbitrarily oriented subspace approaches 9.3 Biclustering 9.3.1 Why and where is biclustering useful? 9.3.2 Types of biclusters 9.3.3 Biclustering methods Optimization using the δ-cluster algorithm 9.3.4 Enumerating all biclusters using MaPle 9.4 Dimensionality reduction for clustering 9.4.1 Linear dimensionality reduction methods for clustering 9.4.2 Nonnegative matrix factorization (NMF) 9.4.3 Spectral clustering Similarity graph Finding a new space Extracting clusters 9.5 Clustering graph and network data 9.5.1 Applications and challenges 9.5.2 Similarity measures Geodesic distance SimRank: similarity based on random walk and structural context Personalized PageRank and topical PageRank 9.5.3 Graph clustering methods Generic high-dimensional clustering methods on graphs Specific clustering methods by searching graph structures Probabilistic graphical model-based methods 9.6 Semisupervised clustering 9.6.1 Semisupervised clustering on partially labeled data 9.6.2 Semisupervised clustering on pairwise constraints 9.6.3 Other types of background knowledge for semisupervised clustering Semisupervised hierarchical clustering Clusters associated with outcome variables Active and interactive learning for semisupervised clustering 9.7 Summary 9.8 Exercises 9.9 Bibliographic notes 10 Deep learning 10.1 Basic concepts 10.1.1 What is deep learning? 10.1.2 Backpropagation algorithm 10.1.3 Key challenges for training deep learning models 10.1.4 Overview of deep learning architecture 10.2 Improve training of deep learning models 10.2.1 Responsive activation functions 10.2.2 Adaptive learning rate 10.2.3 Dropout 10.2.4 Pretraining 10.2.5 Cross-entropy 10.2.6 Autoencoder: unsupervised deep learning 10.2.7 Other techniques 10.3 Convolutional neural networks 10.3.1 Introducing convolution operation 10.3.2 Multidimensional convolution 10.3.3 Convolutional layer 10.4 Recurrent neural networks 10.4.1 Basic RNN models and applications 10.4.2 Gated RNNs 10.4.3 Other techniques for addressing long-term dependence 10.5 Graph neural networks 10.5.1 Basic concepts 10.5.2 Graph convolutional networks 10.5.3 Other types of GNNs 10.6 Summary 10.7 Exercises 10.8 Bibliographic notes 11 Outlier detection 11.1 Basic concepts 11.1.1 What are outliers? 11.1.2 Types of outliers Global outliers Contextual outliers Collective outliers 11.1.3 Challenges of outlier detection 11.1.4 An overview of outlier detection methods Supervised, semisupervised, and unsupervised methods Statistical methods, proximity-based methods, and reconstruction-based methods 11.2 Statistical approaches 11.2.1 Parametric methods Detection of univariate outliers based on normal distribution Detection of multivariate outliers Using a mixture of parametric distributions 11.2.2 Nonparametric methods 11.3 Proximity-based approaches 11.3.1 Distance-based outlier detection 11.3.2 Density-based outlier detection 11.4 Reconstruction-based approaches 11.4.1 Matrix factorization–based methods for numerical data 11.4.2 Pattern-based compression methods for categorical data 11.5 Clustering- vs. classification-based approaches 11.5.1 Clustering-based approaches 11.5.2 Classification-based approaches 11.6 Mining contextual and collective outliers 11.6.1 Transforming contextual outlier detection to conventional outlier detection 11.6.2 Modeling normal behavior with respect to contexts 11.6.3 Mining collective outliers 11.7 Outlier detection in high-dimensional data 11.7.1 Extending conventional outlier detection 11.7.2 Finding outliers in subspaces 11.7.3 Outlier detection ensemble 11.7.4 Taming high dimensionality by deep learning 11.7.5 Modeling high-dimensional outliers 11.8 Summary 11.9 Exercises 11.10 Bibliographic notes 12 Data mining trends and research frontiers 12.1 Mining rich data types 12.1.1 Mining text data Bibliographic notes 12.1.2 Spatial-temporal data Auto-correlation and heterogeneity in spatial and temporal data Spatial and temporal data types Spatial and temporal data models Bibliographic notes 12.1.3 Graph and networks Bibliographic notes 12.2 Data mining applications 12.2.1 Data mining for sentiment and opinion What are sentiments and opinions? Sentiment analysis and opinion mining techniques Sentiment analysis and opinion mining applications Bibliographic notes 12.2.2 Truth discovery and misinformation identification Truth discovery Identification of misinformation Bibliographic notes 12.2.3 Information and disease propagation Bibliographic notes 12.2.4 Productivity and team science Bibliographic notes 12.3 Data mining methodologies and systems 12.3.1 Structuring unstructured data for knowledge mining: a data-driven approach Bibliographic notes 12.3.2 Data augmentation Bibliographic notes 12.3.3 From correlation to causality Bibliographic notes 12.3.4 Network as a context Bibliographic notes 12.3.5 Auto-ML: methods and systems Bibliographic notes 12.4 Data mining, people, and society 12.4.1 Privacy-preserving data mining 12.4.2 Human-algorithm interaction Bibliographic notes 12.4.3 Mining beyond maximizing accuracy: fairness, interpretability, and robustness Bibliographic notes 12.4.4 Data mining for social good Bibliographic notes A Mathematical background A.1 Probability and statistics A.1.1 PDF of typical distributions A.1.2 MLE and MAP A.1.3 Significance test A.1.4 Density estimation A.1.5 Bias-variance tradeoff A.1.6 Cross-validation and Jackknife Cross-validation Jackknife A.2 Numerical optimization A.2.1 Gradient descent A.2.2 Variants of gradient descent A.2.3 Newton's method A.2.4 Coordinate descent A.2.5 Quadratic programming A.3 Matrix and linear algebra A.3.1 Linear system Ax=b Standard square system Overdetermined system Underdetermined system A.3.2 Norms of vectors and matrices Norms of vectors Norms of matrices A.3.3 Matrix decompositions Eigenvalues and eigendecomposition Singular value decomposition (SVD) A.3.4 Subspace A.3.5 Orthogonality A.4 Concepts and tools from signal processing A.4.1 Entropy A.4.2 Kullback-Leibler divergence (KL-divergence) A.4.3 Mutual information A.4.4 Discrete Fourier transform (DFT) and fast Fourier transform (FFT) A.5 Bibliographic notes Bibliography Index Back Cover
Donate to keep this site alive
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.