Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data
- Length: 620 pages
- Edition: 1
- Language: English
- Publisher: Packt Publishing
- Publication Date: 2021-09-30
- ISBN-10: 1801071977
- ISBN-13: 9781801071970
- Sales Rank: #6551944 (See Top 100 Books)
Learn to effectively manage data and execute data science projects from start to finish using Python
Key Features
- Understand and utilize data science tools in Python, such as specialized machine learning algorithms and statistical modeling
- Build a strong data science foundation with the best data science tools available in Python
- Add value to yourself, your organization, and society by extracting actionable insights from raw data
Book Description
Practical Data Science with Python teaches you core data science concepts, with real-world and realistic examples, and strengthens your grip on the basic as well as advanced principles of data preparation and storage, statistics, probability theory, machine learning, and Python programming, helping you build a solid foundation to gain proficiency in data science.
The book starts with an overview of basic Python skills and then introduces foundational data science techniques, followed by a thorough explanation of the Python code needed to execute the techniques. You’ll understand the code by working through the examples. The code has been broken down into small chunks (a few lines or a function at a time) to enable thorough discussion.
As you progress, you will learn how to perform data analysis while exploring the functionalities of key data science Python packages, including pandas, SciPy, and scikit-learn. Finally, the book covers ethics and privacy concerns in data science and suggests resources for improving data science skills, as well as ways to stay up to date on new data science developments.
By the end of the book, you should be able to comfortably use Python for basic data science projects and should have the skills to execute the data science process on any data source.
What you will learn
- Use Python data science packages effectively
- Clean and prepare data for data science work, including feature engineering and feature selection
- Data modeling, including classic statistical models (such as t-tests), and essential machine learning algorithms, such as random forests and boosted models
- Evaluate model performance
- Compare and understand different machine learning methods
- Interact with Excel spreadsheets through Python
- Create automated data science reports through Python
- Get to grips with text analytics techniques
Who this book is for
The book is intended for beginners, including students starting or about to start a data science, analytics, or related program (e.g. Bachelor’s, Master’s, bootcamp, online courses), recent college graduates who want to learn new skills to set them apart in the job market, professionals who want to learn hands-on data science techniques in Python, and those who want to shift their career to data science.
The book requires basic familiarity with Python. A “getting started with Python” section has been included to get complete novices up to speed.
Table of Contents
- Introduction to Data Science
- Getting Started with Python
- SQL and Built-in File Handling Modules in Python
- Loading and Wrangling Data with Pandas and NumPy
- Exploratory Data Analysis and Visualization
- Data Wrangling Documents and Spreadsheets
- Web Scraping
- Probability, Distributions, and Sampling
- Statistical Testing for Data Science
- Preparing Data for Machine Learning: Feature Selection, Feature Engineering, and Dimensionality Reduction
- Machine Learning for Classification
- Evaluating Machine Learning Classification Models and Sampling for Classification
- Machine Learning with Regression
(N.B. Please use the Look Inside option to see further chapters)
Preface Who this book is for What this book covers To get the most out of this book Get in touch Part I - An Introduction and the Basics Introduction to Data Science The data science origin story The top data science tools and skills Python Other programming languages GUIs and platforms Cloud tools Statistical methods and math Collecting, organizing, and preparing data Software development Business understanding and communication Specializations in and around data science Machine learning Business intelligence Deep learning Data engineering Big data Statistical methods Natural Language Processing (NLP) Artificial Intelligence (AI) Choosing how to specialize Data science project methodologies Using data science in other fields CRISP-DM TDSP Further reading on data science project management strategies Other tools Test your knowledge Summary Getting Started with Python Installing Python with Anaconda and getting started Installing Anaconda Running Python code The Python shell The IPython shell Jupyter Why the command line? Command line basics Installing and using a code text editor – VS Code Editing Python code with VS Code Running a Python file Installing Python packages and creating virtual environments Python basics Numbers Strings Variables Lists, tuples, sets, and dictionaries Lists Tuples Sets Dictionaries Loops and comprehensions Booleans and conditionals Packages and modules Functions Classes Multithreading and multiprocessing Software engineering best practices Debugging errors and utilizing documentation Debugging Documentation Version control with Git Code style Productivity tips Test your knowledge Summary Part II - Dealing with Data SQL and Built-in File Handling Modules in Python Introduction Loading, reading, and writing files with base Python Opening a file and reading its contents Using the built-in JSON module Saving credentials or data in a Python file Saving Python objects with pickle Using SQLite and SQL Creating a SQLite database and storing data Using the SQLAlchemy package in Python Test your knowledge Summary Loading and Wrangling Data with Pandas and NumPy Data wrangling and analyzing iTunes data Loading and saving data with Pandas Understanding the DataFrame structure and combining/concatenating multiple DataFrames Exploratory Data Analysis (EDA) and basic data cleaning with Pandas Examining the top and bottom of the data Examining the data's dimensions, datatypes, and missing values Investigating statistical properties of the data Plotting with DataFrames Cleaning data Filtering DataFrames Removing irrelevant data Dealing with missing values Dealing with outliers Dealing with duplicate values Ensuring datatypes are correct Standardizing data formats Data transformations Using replace, map, and apply to clean and transform data Using GroupBy Writing DataFrames to disk Wrangling and analyzing Bitcoin price data Understanding NumPy basics Using NumPy mathematical functions Test your knowledge Summary Exploratory Data Analysis and Visualization EDA and visualization libraries in Python Performing EDA with Seaborn and pandas Making boxplots and letter-value plots Making histograms and violin plots Making scatter plots with Matplotlib and Seaborn Examining correlations and making correlograms Making missing value plots Using EDA Python packages Using visualization best practices Saving plots for sharing and reports Making plots with Plotly Test your knowledge Summary Data Wrangling Documents and Spreadsheets Parsing and processing Word and PDF documents Reading text from Word documents Extracting insights from Word documents: common words and phrases Analyzing words and phrases from the text Reading text from PDFs Reading and writing data with Excel files Using pandas for wrangling Excel files Analyzing the data Using openpyxl for wrangling Excel files Test your knowledge Summary Web Scraping Understanding the structure of the internet GET and POST requests, and HTML Performing simple web scraping Using urllib Using the requests package Scraping several files Extracting the data from the scraped files Parsing HTML from scraped pages Using XPath, lxml, and bs4 to extract data from webpages Collecting data from several pages Using APIs to collect data Using API wrappers The ethics and legality of web scraping Test your knowledge Summary Part III - Statistics for Data Science Probability, Distributions, and Sampling Probability basics Independent and conditional probabilities Bayes' Theorem Frequentist versus Bayesian Distributions The normal distribution and using scipy to generate distributions Descriptive statistics of distributions Variants of the normal distribution Fitting distributions to data to get parameters The Student's t-distribution The Bernoulli distribution The binomial distribution The uniform distribution The exponential and Poisson distributions The Weibull distribution The Zipfian distribution Sampling from data The law of large numbers The central limit theorem Random sampling Bootstrap sampling and confidence intervals Test your knowledge Summary Statistical Testing for Data Science Statistical testing basics and sample comparison tests The t-test and z-test One-sample, two-sided t-test The z-test One-sided tests Two-sample t- and z-tests: A/B testing Paired t- and z-tests Other A/B testing methods Testing between several groups with ANOVA Post-hoc ANOVA tests Assumptions for these methods Other statistical tests Testing if data belongs to a distribution Generalized ESD outlier test The Pearson correlation test Test your knowledge Summary Part IV - Machine Learning Preparing Data for Machine Learning: Feature Selection, Feature Engineering, and Dimensionality Reduction Types of machine learning Feature selection The curse of dimensionality Overfitting and underfitting, and the bias-variance trade-off Methods for feature selection Variance thresholding – removing features with too much and too little variance Univariate statistics feature selection Correlation Mutual information score and chi-squared The chi-squared test ANOVA Using the univariate statistics for feature selection Feature engineering Data cleaning and preparation Converting strings to dates Outlier cleaning strategies Combining multiple columns Transforming numeric data Standardization Making data more Gaussian with the Yeo-Johnson transform Extracting datetime features Binning One-hot encoding and label encoding Simplifying categorical columns One-hot encoding Dimensionality reduction Principle Component Analysis (PCA) Test your knowledge Summary Machine Learning for Classification Machine learning classification algorithms Logistic regression for binary classification Getting predictions from our model How logistic regression works Odds ratio and the logit Examining feature importances with sklearn Using statmodels for logistic regression Maximum likelihood estimation, optimizers, and the logistic regression algorithm Regularization Hyperparameters and cross-validation Logistic regression (and other models) with big data Naïve Bayes for binary classification k-nearest neighbors (KNN) Multiclass classification Logistic regression One-versus-rest and one-versus-one formulations Multi-label classification Choosing a model to use The "no free lunch" theorem Computational complexity of models Test your knowledge Summary Evaluating Machine Learning Classification Models and Sampling for Classification Evaluating classification algorithm performance with metrics Train-validation-test splits Accuracy Cohen's Kappa Confusion matrix Precision, recall, and F1 score AUC score and the ROC curve Choosing the optimal cutoff threshold Sampling and balancing classification data Downsampling Oversampling SMOTE and other synthetic sampling methods Test your knowledge Summary Machine Learning with Regression Linear regression Linear regression with sklearn Linear regression with statsmodels Regularized linear regression Regression with KNN in sklearn Evaluating regression models R2 or the coefficient of determination Adjusted R2 Information criteria Mean squared error Mean absolute error Linear regression assumptions Regression models on big data Forecasting Test your knowledge Summary Optimizing Models and Using AutoML Hyperparameter optimization with search methods Using grid search Using random search Using Bayesian search Other advanced search methods Using learning curves Optimizing the number of features with ML models Using AutoML with PyCaret The no free lunch theorem AutoML solutions Using PyCaret Test your knowledge Summary Tree-Based Machine Learning Models Decision trees Random forests Random forests with sklearn Random forests with H2O Feature importance from tree-based methods Using H2O for feature importance Using sklearn random forest feature importances Boosted trees: AdaBoost, XGboost, LightGBM, and CatBoost AdaBoost XGBoost XGBoost with PyCaret XGBoost with the xgboost package Training boosted models on a GPU LightGBM LightGBM plotting Using LightGBM directly CatBoost Using CatBoost natively Using early stopping with boosting algorithms Test your knowledge Summary Support Vector Machine (SVM) Machine Learning Models How SVMs work SVMs for classification SVMs for regression Using SVMs Using SVMs in sklearn Tuning SVMs with pycaret Test your knowledge Summary Part V - Text Analysis and Reporting Clustering with Machine Learning Using k-means clustering Clustering metrics Optimizing k in k-means Examining the clusters Hierarchical clustering DBSCAN Other unsupervised methods Test your knowledge Summary Working with Text Text preprocessing Basic text cleaning Stemming and Lemmatizing Preparing text with spaCy Word vectors TFIDF vectors Basic text analysis Word frequency plots Wordclouds Zipf's law Word collocations Parts of speech Unsupervised learning Topic modeling Topic modeling with pycaret Topic modeling with Top2Vec Supervised learning Classification Sentiment analysis Test your knowledge Summary Part VI - Wrapping Up Data Storytelling and Automated Reporting/Dashboarding Data storytelling Data storytelling example Automated reporting and dashboarding Automated reporting options Automated dashboarding Scheduling tasks to run automatically Test your knowledge Summary Ethics and Privacy The ethics of machine learning algorithms Bias How to decrease ML biases Carefully evaluating performance and consequences Data privacy Data privacy regulations and laws k-anonymity, l-diversity, and t-closeness Differential privacy Using data science for the common good Other ethical considerations Test your knowledge Summary Staying Up to Date and the Future of Data Science Blogs, newsletters, books, and academic sources Blogs Newsletters Books Academic sources Data science competition websites Online learning platforms Cloud services Other places to keep an eye on Strategies for staying up to date Other data science topics we didn't cover The future of data science Summary Other Books You May Enjoy Index
Donate to keep this site alive
How to download source code?
1. Go to: https://github.com/PacktPublishing
2. In the Find a repository… box, search the book title: Practical Data Science with Python: Learn tools and techniques from hands-on examples to extract insights from data
, sometime you may not get the results, please search the main title.
3. Click the book title in the search results.
3. Click Code to download.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.