Post-Shrinkage Strategies in Statistical and Machine Learning for High Dimensional Data

by Bahadir Yüzbaşı, Feryaal Ahmed, Syed Ejaz Ahmed

Length: 378 pages
Edition: 1
Language: English
Publisher: Chapman and Hall/CRC
Publication Date: 2023-05-25
ISBN-10: 0367763443
ISBN-13: 9780367763442
Sales Rank: #0 (See Top 100 Books)

This book presents some post-estimation and predictions strategies for the host of useful statistical models with applications in data science. It combines statistical learning and machine learning techniques in a unique and optimal way. It is well-known that machine learning methods are subject to many issues relating to bias, and consequently the mean squared error and prediction error may explode. For this reason, we suggest shrinkage strategies to control the bias by combining a submodel selected by a penalized method with a model with many features. Further, the suggested shrinkage methodology can be successfully implemented for high dimensional data analysis. Many researchers in statistics and medical sciences work with big data. They need to analyse this data through statistical modelling. Estimating the model parameters accurately is an important part of the data analysis. This book may be a repository for developing improve estimation strategies for statisticians. This book will help researchers and practitioners for their teaching and advanced research, and is an excellent textbook for advanced undergraduate and graduate courses involving shrinkage, statistical, and machine learning.

The book succinctly reveals the bias inherited in machine learning method and successfully provides tools, tricks and tips to deal with the bias issue.

Expertly sheds light on the fundamental reasoning for model selection and post estimation using shrinkage and related strategies.

This presentation is fundamental, because shrinkage and other methods appropriate for model selection and estimation problems and there is a growing interest in this area to fill the gap between competitive strategies.

Application of these strategies to real life data set from many walks of life.

Analytical results are fully corroborated by numerical work and numerous worked examples are included in each chapter with numerous graphs for data visualization.

The presentation and style of the book clearly makes it accessible to a broad audience. It offers rich, concise expositions of each strategy and clearly describes how to use each estimation strategy for the problem at hand.

This book emphasizes that statistics/statisticians can play a dominant role in solving Big Data problems, and will put them on the precipice of scientific discovery.

The book contributes novel methodologies for HDDA and will open a door for continued research in this hot area. The practical impact of the proposed work stems from wide applications. The developed computational packages will aid in analyzing a broad range of applications in many walks of life.

Cover Page
Half-Title Page
Title Page
Copyright Page
Dedication Page
Contents
Preface
Acknowledgments
Author/editor biographies
List of Figures
List of Tables
Contributors
Abbreviations
1 Introduction
    1.1 Least Absolute Shrinkage and Selection Operator
    1.2 Elastic Net
    1.3 Adaptive LASSO
    1.4 Smoothly Clipped Absolute Deviation
    1.5 Minimax Concave Penalty
    1.6 High-Dimensional Weak-Sparse Regression Model
    1.7 Estimation Strategies
        1.7.1 Pretest Estimation Strategy
        1.7.2 Shrinkage Estimation Strategy
    1.8 Asymptotic Properties of Non-Penalty Estimators
        1.8.1 Bias of Estimators
        1.8.2 Risk of Estimators
    1.9 Organization of the Book
2 Introduction to Machine Learning
    2.1 What is Learning?
    2.2 Unsupervised Learning: Principle Component Analysis and k-Means Clustering
        2.2.1 Principle Component Analysis (PCA)
        2.2.2 k-Means Clustering
        2.2.3 Extension: Unsupervised Text Analysis
    2.3 Supervised Learning
        2.3.1 Logistic Regression
        2.3.2 Multivariate Adaptive Regression Splines (MARS)
        2.3.3 k Nearest Neighbours (kNN)
        2.3.4 Random Forest
        2.3.5 Support Vector Machine (SVM)
        2.3.6 Linear Discriminant Analysis (LDA)
        2.3.7 Artificial Neural Network (ANN)
        2.3.8 Gradient Boosting Machine (GBM)
    2.4 Implementation in R
    2.5 Case Study: Genomics
        2.5.1 Data Exploration
        2.5.2 Modeling
3 Post-Shrinkage Strategies in Sparse Regression Models
    3.1 Introduction
    3.2 Estimation Strategies
        3.2.1 Least Squares Estimation Strategies
        3.2.2 Maximum Likelihood Estimator
        3.2.3 Full Model and Submodel Estimators
        3.2.4 Shrinkage Strategies
    3.3 Asymptotic Analysis
        3.3.1 Asymptotic Distributional Bias
        3.3.2 Asymptotic Distributional Risk
    3.4 Relative Risk Assessment
        3.4.1 Risk Comparison of β^1FM and β^1SM
        3.4.2 Risk Comparison of β^1FM and β^1S
        3.4.3 Risk Comparison of β^1S and β^1SM
        3.4.4 Risk Comparison of β^1PS and β^1FM
        3.4.5 Risk Comparison of β^1PS and β^1S
        3.4.6 Mean Squared Prediction Error
    3.5 Simulation Experiments
        3.5.1 Strong Signals and Noises
        3.5.2 Signals and Noises
        3.5.3 Comparing Shrinkage Estimators with Penalty Estimators
    3.6 Prostrate Cancer Data Example
        3.6.1 Classical Strategy
        3.6.2 Shrinkage and Penalty Strategies
        3.6.3 Prediction Error via Bootstrapping
        3.6.4 Machine Learning Strategies
    3.7 R-Codes
    3.8 Concluding Remarks
4 Shrinkage Strategies in High-Dimensional Regression Models
    4.1 Introduction
    4.2 Estimation Strategies
    4.3 Integrating Submodels
        4.3.1 Sparse Regression Model
        4.3.2 Overfitted Regression Model
        4.3.3 Underfitted Regression Model
        4.3.4 Non-Linear Shrinkage Estimation Strategies
    4.4 Simulation Experiments
    4.5 Real Data Examples
        4.5.1 Eye Data
        4.5.2 Expression Data
        4.5.3 Riboflavin Data
    4.6 R-Codes
    4.7 Concluding Remarks
5 Shrinkage Estimation Strategies in Partially Linear Models
    5.1 Introduction
        5.1.1 Statement of the Problem
    5.2 Estimation Strategies
    5.3 Asymptotic Properties
    5.4 Simulation Experiments
        5.4.1 Comparing with Penalty Estimators
    5.5 Real Data Examples
        5.5.1 Housing Prices (HP) Data
        5.5.2 Investment Data of Turkey
    5.6 High-Dimensional Model
        5.6.1 Real Data Example
    5.7 R-Codes
    5.8 Concluding Remarks
6 Shrinkage Strategies: Generalized Linear Models
    6.1 Introduction
    6.2 Maximum Likelihood Estimation
    6.3 A Genle Introduction of Logistic Regression Model
        6.3.1 Statement of the Problem
    6.4 Estimation Strategies
        6.4.1 The Shrinkage Estimation Strategies
    6.5 Asymptotic Properties
    6.6 Simulation Experiment
        6.6.1 Penalized Strategies
    6.7 Real Data Examples
        6.7.1 Pima Indians Diabetes (PID) Data
        6.7.2 South Africa Heart-Attack Data
        6.7.3 Orinda Longitudinal Study of Myopia (OLSM) Data
    6.8 High-Dimensional Data
        6.8.1 Simulation Experiments
        6.8.2 Gene Expression Data
    6.9 A Gentle Introduction of Negative Binomial Models
        6.9.1 Sparse NB Regression Model
    6.10 Shrinkage and Penalized Strategies
    6.11 Asymptotic Analysis
    6.12 Simulation Experiments
    6.13 Real Data Examples
        6.13.1 Resume Data
        6.13.2 Labor Supply Data
    6.14 High-Dimensional Data
    6.15 R-Codes
    6.16 Concluding Remarks
7 Post-Shrinkage Strategy in Sparse Linear Mixed Models
    7.1 Introduction
    7.2 Estimation Strategies
        7.2.1 A Gentle Introduction to Linear Mixed Model
        7.2.2 Ridge Regression
        7.2.3 Shrinkage Estimation Strategy
    7.3 Asymptotic Results
    7.4 High-Dimensional Simulation Studies
        7.4.1 Comparing with Penalized Estimation Strategies
        7.4.2 Weak Signals
    7.5 Real Data Applications
        7.5.1 Amsterdam Growth and Health Data (AGHD)
        7.5.2 Resting-State Effective Brain Connectivity and Genetic Data
    7.6 Concluding Remarks
8 Shrinkage Estimation in Sparse Nonlinear Regression Models
    8.1 Introduction
    8.2 Model and Estimation Strategies
        8.2.1 Shrinkage Strategy
    8.3 Asymptotic Analysis
    8.4 Simulation Experiments
        8.4.1 High-Dimensional Data
            8.4.1.1 Post-Selection Estimation Strategy
    8.5 Application to Rice Yield Data
    8.6 R-Codes
    8.7 Concluding Remarks
9 Shrinkage Strategies in Sparse Robust Regression Models
    9.1 Introduction
    9.2 LAD Shrinkage Strategies
        9.2.1 Asymptotic Properties
        9.2.2 Bias of the Estimators
        9.2.3 Risk of Estimators
    9.3 Simulation Experiments
    9.4 Penalized Estimation
    9.5 Real Data Applications
        9.5.1 US Crime Data
        9.5.2 Barro Data
        9.5.3 Murder Rate Data
    9.6 High-Dimensional Data
        9.6.1 Simulation Experiments
        9.6.2 Real Data Application
    9.7 R-Codes
    9.8 Conclusion Remarks
10 Liu-type Shrinkage Estimations in Linear Sparse Models
    10.1 Introduction
    10.2 Estimation Strategies
        10.2.1 Estimation Under a Sparsity Assumption
        10.2.2 Shrinkage Liu Estimation
    10.3 Asymptotic Analysis
    10.4 Simulation Experiments
        10.4.1 Comparisons with Penalty Estimators
    10.5 Application to Air Pollution Data
    10.6 R-Codes
    10.7 Concluding Remarks
Bibliography
Index