Data Science Projects with Python: A case study approach to gaining valuable insights from real data with machine learning, 2nd Edition

Length: 420 pages
Edition: 2
Language: English
Publisher: Packt Publishing
Publication Date: 2021-08-10
ISBN-10: 1800564481
ISBN-13: 9781800564480
Sales Rank: #576596 (See Top 100 Books)

Gain hands-on experience in Python programming with industry-standard machine learning tools using pandas, scikit-learn, and XGBoost

Key Features

Think critically about data by exploring and cleaning it
Choose an appropriate machine learning model and train it on your data
Communicate data-driven insights with confidence and clarity

Book Description

If data is the new oil, then machine learning is the drill. As companies gain access to ever-increasing quantities of raw data, the ability to deliver state-of-the-art predictive models that support business decision-making becomes more and more valuable.

In this book, you’ll work on an end-to-end project based around a realistic data set and split up into bite-sized practical exercises. This creates a case-study approach that simulates the working conditions you’ll experience in real-world data science projects.

You’ll learn how to use key Python packages, including pandas, Matplotlib, and scikit-learn, and master the process of data exploration and data processing, before moving on to fitting, evaluating, and tuning algorithms such as regularized logistic regression and random forest.

Now in its second edition, this book will take you through the process of exploring data and delivering machine learning models. Updated to the latest version of Python, this new edition for 2021 includes brand new content on XGBoost, SHAP values, and how to evaluate and monitor machine learning models.

By the end of this data science book, you’ll have the skills, understanding, and confidence to build your own machine learning models and gain insights from real data.

What You Will Learn

Load, explore, and process data using the pandas Python package
Use Matplotlib to create compelling data visualizations
Implement predictive machine learning models with scikit-learn
Use lasso and ridge regression to reduce model overfitting
Evaluate random forest and logistic regression model performance
Create state-of-the-art models with XGBoost
Learn to use SHAP values to explain model predictions
Deliver business insights by presenting clear, convincing conclusions

Who This Book Is For

Data Science Projects with Python – Second Edition is for anyone who wants to get started with data science and machine learning. If you’re keen to advance your career by using data analysis and predictive modeling to generate business insights, then this book is the perfect place to begin. To quickly grasp the concepts covered, it is recommended that you have basic experience of programming with Python or another similar language, and a general interest in statistics.

Data Science Projects with Python
second edition
Preface
    About the Book
        About the Author
        Objectives
        Audience
        Approach
        About the Chapters
        Hardware Requirements
        Software Requirements
        Installation and Setup
            Code Bundle
            Anaconda and Setting up Your Environment
        Conventions
        Code Presentation
        Get in Touch
        Please Leave a Review
1. Data Exploration and Cleaning
    Introduction
    Python and the Anaconda Package Management System
        Indexing and the Slice Operator
        Exercise 1.01: Examining Anaconda and Getting Familiar with Python
    Different Types of Data Science Problems
    Loading the Case Study Data with Jupyter and pandas
        Exercise 1.02: Loading the Case Study Data in a Jupyter Notebook
        Getting Familiar with Data and Performing Data Cleaning
        The Business Problem
        Data Exploration Steps 
        Exercise 1.03: Verifying Basic Data Integrity
        Boolean Masks
        Exercise 1.04: Continuing Verification of Data Integrity
        Exercise 1.05: Exploring and Cleaning the Data 
    Data Quality Assurance and Exploration
        Exercise 1.06: Exploring the Credit Limit and Demographic Features
        Deep Dive: Categorical Features
        Exercise 1.07: Implementing OHE for a Categorical Feature
    Exploring the Financial History Features in the Dataset
        Activity 1.01: Exploring the Remaining Financial Features in the Dataset
    Summary
2. Introduction to Scikit-Learn and Model Evaluation
    Introduction
    Exploring the Response Variable and Concluding the Initial Exploration
    Introduction to Scikit-Learn
        Generating Synthetic Data
        Data for Linear Regression
        Exercise 2.01: Linear Regression in Scikit-Learn
    Model Performance Metrics for Binary Classification
        Splitting the Data: Training and Test Sets
        Classification Accuracy
        True Positive Rate, False Positive Rate, and Confusion Matrix
        Exercise 2.02: Calculating the True and False Positive and Negative Rates and Confusion Matrix in Python
        Discovering Predicted Probabilities: How Does Logistic Regression Make Predictions?
        Exercise 2.03: Obtaining Predicted Probabilities from a Trained Logistic Regression Model
        The Receiver Operating Characteristic (ROC) Curve
        Precision
        Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve
    Summary
3. Details of Logistic Regression and Feature Exploration
    Introduction
    Examining the Relationships Between Features and the Response Variable
        Pearson Correlation
        Mathematics of Linear Correlation
        F-test
        Exercise 3.01: F-test and Univariate Feature Selection
        Finer Points of the F-test: Equivalence to the t-test for Two Classes and Cautions
        Hypotheses and Next Steps
        Exercise 3.02: Visualizing the Relationship Between the Features and Response Variable
    Univariate Feature Selection: What it Does and Doesn't Do
        Understanding Logistic Regression and the Sigmoid Function Using Function Syntax in Python
        Exercise 3.03: Plotting the Sigmoid Function
        Scope of Functions
        Why Is Logistic Regression Considered a Linear Model?
        Exercise 3.04: Examining the Appropriateness of Features for Logistic Regression
        From Logistic Regression Coefficients to Predictions Using Sigmoid
        Exercise 3.05: Linear Decision Boundary of Logistic Regression
        Activity 3.01: Fitting a Logistic Regression Model and Directly Using the Coefficients
    Summary
4. The Bias-Variance Trade-Off
    Introduction
    Estimating the Coefficients and Intercepts of Logistic Regression
        Gradient Descent to Find Optimal Parameter Values
        Exercise 4.01: Using Gradient Descent to Minimize a Cost Function
        Assumptions of Logistic Regression
        The Motivation for Regularization: The Bias-Variance Trade-Off
        Exercise 4.02: Generating and Modeling Synthetic Classification Data 
        Lasso (L1) and Ridge (L2) Regularization
    Cross-Validation: Choosing the Regularization Parameter 
        Exercise 4.03: Reducing Overfitting on the Synthetic Data Classification Problem
        Options for Logistic Regression in Scikit-Learn
        Scaling Data, Pipelines, and Interaction Features in Scikit-Learn
        Activity 4.01: Cross-Validation and Feature Engineering with the Case Study Data
    Summary
5. Decision Trees and Random Forests
    Introduction
    Decision Trees
        The Terminology of Decision Trees and Connections to Machine Learning
        Exercise 5.01: A Decision Tree in Scikit-Learn
        Training Decision Trees: Node Impurity
        Features Used for the First Splits: Connections to Univariate Feature Selection and Interactions
        Training Decision Trees: A Greedy Algorithm
        Training Decision Trees: Different Stopping Criteria and Other Options
        Using Decision Trees: Advantages and Predicted Probabilities
        A More Convenient Approach to Cross-Validation
        Exercise 5.02: Finding Optimal Hyperparameters for a Decision Tree
    Random Forests: Ensembles of Decision Trees
        Random Forest: Predictions and Interpretability
        Exercise 5.03: Fitting a Random Forest
        Checkerboard Graph
        Activity 5.01: Cross-Validation Grid Search with Random Forest
    Summary
6. Gradient Boosting, XGBoost, and SHAP Values
    Introduction
    Gradient Boosting and XGBoost
        What Is Boosting?
        Gradient Boosting and XGBoost
    XGBoost Hyperparameters
        Early Stopping
        Tuning the Learning Rate
        Other Important Hyperparameters in XGBoost
        Exercise 6.01: Randomized Grid Search for Tuning XGBoost Hyperparameters
    Another Way of Growing Trees: XGBoost's grow_policy
    Explaining Model Predictions with SHAP Values
        Exercise 6.02: Plotting SHAP Interactions, Feature Importance, and Reconstructing Predicted Probabilities from SHAP Values
    Missing Data
        Saving Python Variables to a File
        Activity 6.01: Modeling the Case Study Data with XGBoost and Explaining the Model with SHAP
    Summary
7. Test Set Analysis, Financial Insights, and Delivery to the Client
    Introduction
    Review of Modeling Results
        Feature Engineering
        Ensembling Multiple Models
        Different Modeling Techniques
        Balancing Classes
    Model Performance on the Test Set
        Distribution of Predicted Probability and Decile Chart
        Exercise 7.01: Equal-Interval Chart
        Calibration of Predicted Probabilities
    Financial Analysis
        Financial Conversation with the Client
        Exercise 7.02: Characterizing Costs and Savings
        Activity 7.01: Deriving Financial Insights
    Final Thoughts on Delivering a Predictive Model to the Client
        Model Monitoring
        Ethics in Predictive Modeling
    Summary
Appendix
    1. Data Exploration and Cleaning
        Activity 1.01: Exploring the Remaining Financial Features in the Dataset
    2. Introduction to Scikit-Learn and Model Evaluation
        Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve
    3. Details of Logistic Regression and Feature Exploration
        Activity 3.01: Fitting a Logistic Regression Model and Directly Using the Coefficients
    4. The Bias-Variance Trade-Off
        Activity 4.01: Cross-Validation and Feature Engineering with the Case Study Data
    5. Decision Trees and Random Forests
        Activity 5.01: Cross-Validation Grid Search with Random Forest
    6. Gradient Boosting, XGBoost, and SHAP Values
        Activity 6.01: Modeling the Case Study Data with XGBoost and Explaining the Model with SHAP 
    7. Test Set Analysis, Financial Insights, and Delivery to the Client
        Activity 7.01: Deriving Financial Insights
        Hey!

Python

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Data Science Projects with Python: A case study approach to gaining valuable insights from real data with machine learning, 2nd Edition, sometime you may not get the results, please search the main title.

Data Science Projects with Python: A case study approach to gaining valuable insights from real data with machine learning, 2nd Edition

How to download source code?

Learn Enough Python to Be Dangerous: Software Development, Flask Web Apps, and Beginning Data Science with Python

Python 3 Data Visualization Using Google Gemini

Python Programming for Beginners: The Definitive Guide, With Hands-On Exercises and Secret Coding Tips, to Master Python in Just One Week and Get Your Dream Job!

Learning OpenCV 5 Computer Vision with Python: Tackle computer vision and machine learning with the newest tools, techniques and algorithms, 4th Edition

Python for Data Science, 2nd Edition

Algo Fundamentals: With Python: A Comprehensive Guide for 2024