Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data

by Nathan George

Length: 620 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2021-09-30
ISBN-10: 1801071977
ISBN-13: 9781801071970
Sales Rank: #6551944 (See Top 100 Books)

0 ratings

Print Book Look Inside

Learn to effectively manage data and execute data science projects from start to finish using Python

Key Features

Understand and utilize data science tools in Python, such as specialized machine learning algorithms and statistical modeling
Build a strong data science foundation with the best data science tools available in Python
Add value to yourself, your organization, and society by extracting actionable insights from raw data

Book Description

Practical Data Science with Python teaches you core data science concepts, with real-world and realistic examples, and strengthens your grip on the basic as well as advanced principles of data preparation and storage, statistics, probability theory, machine learning, and Python programming, helping you build a solid foundation to gain proficiency in data science.

The book starts with an overview of basic Python skills and then introduces foundational data science techniques, followed by a thorough explanation of the Python code needed to execute the techniques. You’ll understand the code by working through the examples. The code has been broken down into small chunks (a few lines or a function at a time) to enable thorough discussion.

As you progress, you will learn how to perform data analysis while exploring the functionalities of key data science Python packages, including pandas, SciPy, and scikit-learn. Finally, the book covers ethics and privacy concerns in data science and suggests resources for improving data science skills, as well as ways to stay up to date on new data science developments.

By the end of the book, you should be able to comfortably use Python for basic data science projects and should have the skills to execute the data science process on any data source.

What you will learn

Use Python data science packages effectively
Clean and prepare data for data science work, including feature engineering and feature selection
Data modeling, including classic statistical models (such as t-tests), and essential machine learning algorithms, such as random forests and boosted models
Evaluate model performance
Compare and understand different machine learning methods
Interact with Excel spreadsheets through Python
Create automated data science reports through Python
Get to grips with text analytics techniques

Who this book is for

The book is intended for beginners, including students starting or about to start a data science, analytics, or related program (e.g. Bachelor’s, Master’s, bootcamp, online courses), recent college graduates who want to learn new skills to set them apart in the job market, professionals who want to learn hands-on data science techniques in Python, and those who want to shift their career to data science.

The book requires basic familiarity with Python. A “getting started with Python” section has been included to get complete novices up to speed.

Introduction to Data Science
Getting Started with Python
SQL and Built-in File Handling Modules in Python
Loading and Wrangling Data with Pandas and NumPy
Exploratory Data Analysis and Visualization
Data Wrangling Documents and Spreadsheets
Web Scraping
Probability, Distributions, and Sampling
Statistical Testing for Data Science
Preparing Data for Machine Learning: Feature Selection, Feature Engineering, and Dimensionality Reduction
Machine Learning for Classification
Evaluating Machine Learning Classification Models and Sampling for Classification
Machine Learning with Regression

(N.B. Please use the Look Inside option to see further chapters)

Preface
    Who this book is for
    What this book covers
    To get the most out of this book
    Get in touch
Part I - An Introduction and the Basics
Introduction to Data Science
    The data science origin story
    The top data science tools and skills
        Python
        Other programming languages
        GUIs and platforms
        Cloud tools
        Statistical methods and math
        Collecting, organizing, and preparing data
        Software development
        Business understanding and communication
    Specializations in and around data science
        Machine learning
            Business intelligence
            Deep learning
            Data engineering
            Big data
            Statistical methods
        Natural Language Processing (NLP)
        Artificial Intelligence (AI)
        Choosing how to specialize
    Data science project methodologies
        Using data science in other fields
        CRISP-DM
        TDSP
            Further reading on data science project management strategies
        Other tools
    Test your knowledge
    Summary
Getting Started with Python
    Installing Python with Anaconda and getting started
        Installing Anaconda
        Running Python code
            The Python shell
            The IPython shell
            Jupyter
            Why the command line?
            Command line basics
        Installing and using a code text editor – VS Code
            Editing Python code with VS Code
            Running a Python file
        Installing Python packages and creating virtual environments
    Python basics
        Numbers
        Strings
        Variables
        Lists, tuples, sets, and dictionaries
            Lists
            Tuples
            Sets
            Dictionaries
        Loops and comprehensions
        Booleans and conditionals
        Packages and modules
        Functions
        Classes
        Multithreading and multiprocessing
    Software engineering best practices
        Debugging errors and utilizing documentation
            Debugging
            Documentation
        Version control with Git
        Code style
        Productivity tips
    Test your knowledge
    Summary
Part II - Dealing with Data
SQL and Built-in File Handling Modules in Python
    Introduction
    Loading, reading, and writing files with base Python
        Opening a file and reading its contents
        Using the built-in JSON module
        Saving credentials or data in a Python file
        Saving Python objects with pickle
    Using SQLite and SQL
        Creating a SQLite database and storing data
    Using the SQLAlchemy package in Python
    Test your knowledge
    Summary
Loading and Wrangling Data with Pandas and NumPy
    Data wrangling and analyzing iTunes data
        Loading and saving data with Pandas
            Understanding the DataFrame structure and combining/concatenating multiple DataFrames
        Exploratory Data Analysis (EDA) and basic data cleaning with Pandas
            Examining the top and bottom of the data
            Examining the data's dimensions, datatypes, and missing values
            Investigating statistical properties of the data
            Plotting with DataFrames
        Cleaning data
        Filtering DataFrames
            Removing irrelevant data
            Dealing with missing values
            Dealing with outliers
            Dealing with duplicate values
            Ensuring datatypes are correct
            Standardizing data formats
        Data transformations
        Using replace, map, and apply to clean and transform data
        Using GroupBy
        Writing DataFrames to disk
        Wrangling and analyzing Bitcoin price data
    Understanding NumPy basics
    Using NumPy mathematical functions
    Test your knowledge
    Summary
Exploratory Data Analysis and Visualization 
    EDA and visualization libraries in Python
    Performing EDA with Seaborn and pandas
        Making boxplots and letter-value plots
        Making histograms and violin plots
        Making scatter plots with Matplotlib and Seaborn
        Examining correlations and making correlograms
        Making missing value plots
    Using EDA Python packages
    Using visualization best practices
        Saving plots for sharing and reports
    Making plots with Plotly
    Test your knowledge
    Summary
Data Wrangling Documents and Spreadsheets
    Parsing and processing Word and PDF documents
        Reading text from Word documents
            Extracting insights from Word documents: common words and phrases
            Analyzing words and phrases from the text
        Reading text from PDFs
    Reading and writing data with Excel files
        Using pandas for wrangling Excel files
            Analyzing the data
        Using openpyxl for wrangling Excel files
    Test your knowledge
    Summary
Web Scraping
    Understanding the structure of the internet
        GET and POST requests, and HTML
    Performing simple web scraping
        Using urllib
        Using the requests package
        Scraping several files
            Extracting the data from the scraped files
    Parsing HTML from scraped pages
        Using XPath, lxml, and bs4 to extract data from webpages
            Collecting data from several pages
    Using APIs to collect data
        Using API wrappers
    The ethics and legality of web scraping
    Test your knowledge
    Summary
Part III - Statistics for Data Science
Probability, Distributions, and Sampling
    Probability basics
        Independent and conditional probabilities
        Bayes' Theorem
        Frequentist versus Bayesian
    Distributions
        The normal distribution and using scipy to generate distributions
        Descriptive statistics of distributions
            Variants of the normal distribution
        Fitting distributions to data to get parameters
        The Student's t-distribution
        The Bernoulli distribution
        The binomial distribution
        The uniform distribution
        The exponential and Poisson distributions
        The Weibull distribution
        The Zipfian distribution
    Sampling from data
        The law of large numbers
        The central limit theorem
        Random sampling
        Bootstrap sampling and confidence intervals
    Test your knowledge
    Summary
Statistical Testing for Data Science
    Statistical testing basics and sample comparison tests
        The t-test and z-test
            One-sample, two-sided t-test
            The z-test
            One-sided tests
            Two-sample t- and z-tests: A/B testing
            Paired t- and z-tests
            Other A/B testing methods
            Testing between several groups with ANOVA
            Post-hoc ANOVA tests
            Assumptions for these methods
    Other statistical tests
        Testing if data belongs to a distribution
        Generalized ESD outlier test
        The Pearson correlation test
    Test your knowledge
    Summary
Part IV - Machine Learning
Preparing Data for Machine Learning: Feature Selection, Feature Engineering, and Dimensionality Reduction
    Types of machine learning
    Feature selection
        The curse of dimensionality
        Overfitting and underfitting, and the bias-variance trade-off
        Methods for feature selection
        Variance thresholding – removing features with too much and too little variance
        Univariate statistics feature selection
            Correlation
        Mutual information score and chi-squared
            The chi-squared test
            ANOVA
        Using the univariate statistics for feature selection
    Feature engineering
        Data cleaning and preparation
            Converting strings to dates
            Outlier cleaning strategies
        Combining multiple columns
        Transforming numeric data
            Standardization
            Making data more Gaussian with the Yeo-Johnson transform
        Extracting datetime features
        Binning
        One-hot encoding and label encoding
        Simplifying categorical columns
            One-hot encoding
    Dimensionality reduction
        Principle Component Analysis (PCA)
    Test your knowledge
    Summary
Machine Learning for Classification
    Machine learning classification algorithms
        Logistic regression for binary classification
            Getting predictions from our model
        How logistic regression works
            Odds ratio and the logit
            Examining feature importances with sklearn
            Using statmodels for logistic regression
            Maximum likelihood estimation, optimizers, and the logistic regression algorithm
            Regularization
            Hyperparameters and cross-validation
            Logistic regression (and other models) with big data
        Naïve Bayes for binary classification
        k-nearest neighbors (KNN)
        Multiclass classification
            Logistic regression
            One-versus-rest and one-versus-one formulations
            Multi-label classification
        Choosing a model to use
            The "no free lunch" theorem
            Computational complexity of models
    Test your knowledge
    Summary
Evaluating Machine Learning Classification Models and Sampling for Classification
    Evaluating classification algorithm performance with metrics
        Train-validation-test splits
        Accuracy
        Cohen's Kappa
        Confusion matrix
        Precision, recall, and F1 score
        AUC score and the ROC curve
        Choosing the optimal cutoff threshold 
    Sampling and balancing classification data
        Downsampling
        Oversampling
        SMOTE and other synthetic sampling methods
    Test your knowledge
    Summary
Machine Learning with Regression
    Linear regression
        Linear regression with sklearn
        Linear regression with statsmodels
        Regularized linear regression
        Regression with KNN in sklearn
        Evaluating regression models
            R2 or the coefficient of determination
            Adjusted R2
            Information criteria
            Mean squared error
            Mean absolute error
        Linear regression assumptions
    Regression models on big data
    Forecasting
    Test your knowledge
    Summary
Optimizing Models and Using AutoML
    Hyperparameter optimization with search methods
        Using grid search
        Using random search
        Using Bayesian search
        Other advanced search methods
    Using learning curves
    Optimizing the number of features with ML models
    Using AutoML with PyCaret
        The no free lunch theorem
        AutoML solutions
        Using PyCaret
    Test your knowledge
    Summary
Tree-Based Machine Learning Models
    Decision trees
        Random forests
        Random forests with sklearn
        Random forests with H2O
    Feature importance from tree-based methods
        Using H2O for feature importance
        Using sklearn random forest feature importances
    Boosted trees: AdaBoost, XGboost, LightGBM, and CatBoost
        AdaBoost
        XGBoost
            XGBoost with PyCaret
            XGBoost with the xgboost package
            Training boosted models on a GPU
        LightGBM
            LightGBM plotting
            Using LightGBM directly
        CatBoost
            Using CatBoost natively
        Using early stopping with boosting algorithms
    Test your knowledge
    Summary
Support Vector Machine (SVM) Machine Learning Models
    How SVMs work
        SVMs for classification
        SVMs for regression
    Using SVMs
        Using SVMs in sklearn
        Tuning SVMs with pycaret
    Test your knowledge
    Summary
Part V - Text Analysis and Reporting
Clustering with Machine Learning
    Using k-means clustering
        Clustering metrics
        Optimizing k in k-means
            Examining the clusters
    Hierarchical clustering
    DBSCAN
    Other unsupervised methods
    Test your knowledge
    Summary
Working with Text
    Text preprocessing
        Basic text cleaning
        Stemming and Lemmatizing
        Preparing text with spaCy
        Word vectors
        TFIDF vectors
    Basic text analysis
        Word frequency plots
        Wordclouds
        Zipf's law
        Word collocations
        Parts of speech
    Unsupervised learning
        Topic modeling
        Topic modeling with pycaret
        Topic modeling with Top2Vec
    Supervised learning
        Classification
    Sentiment analysis
    Test your knowledge
    Summary
Part VI - Wrapping Up
Data Storytelling and Automated Reporting/Dashboarding
    Data storytelling
        Data storytelling example
    Automated reporting and dashboarding
        Automated reporting options
        Automated dashboarding
        Scheduling tasks to run automatically 
    Test your knowledge
    Summary
Ethics and Privacy
    The ethics of machine learning algorithms
        Bias
        How to decrease ML biases
        Carefully evaluating performance and consequences
    Data privacy
    Data privacy regulations and laws
    k-anonymity, l-diversity, and t-closeness
    Differential privacy
    Using data science for the common good
    Other ethical considerations
    Test your knowledge
    Summary
Staying Up to Date and the Future of Data Science
    Blogs, newsletters, books, and academic sources
        Blogs
        Newsletters
        Books
        Academic sources
    Data science competition websites
    Online learning platforms
    Cloud services
    Other places to keep an eye on
    Strategies for staying up to date
    Other data science topics we didn't cover
    The future of data science
    Summary
Other Books You May Enjoy
Index

AI & Machine Learning Artificial Intelligence Data Modeling & Design Intelligence & Semantics Programming Languages Python

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data, sometime you may not get the results, please search the main title.