Data Science Bookcamp: Five real-world Python projects
- Length: 704 pages
- Edition: 1
- Language: English
- Publisher: Manning
- Publication Date: 2021-11-23
- ISBN-10: 1617296252
- ISBN-13: 9781617296253
- Sales Rank: #4795259 (See Top 100 Books)
Learn data science with Python by building five real-world projects! Experiment with card game predictions, tracking disease outbreaks, and more, as you build a flexible and intuitive understanding of data science.
In Data Science Bookcamp you will learn:
- Techniques for computing and plotting probabilities
- Statistical analysis using Scipy
- How to organize datasets with clustering algorithms
- How to visualize complex multi-variable datasets
- How to train a decision tree machine learning algorithm
In Data Science Bookcamp you’ll test and build your knowledge of Python with the kind of open-ended problems that professional data scientists work on every day. Downloadable data sets and thoroughly-explained solutions help you lock in what you’ve learned, building your confidence and making you ready for an exciting new data science career.
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the technology
A data science project has a lot of moving parts, and it takes practice and skill to get all the code, algorithms, datasets, formats, and visualizations working together harmoniously. This unique book guides you through five realistic projects, including tracking disease outbreaks from news headlines, analyzing social networks, and finding relevant patterns in ad click data.
About the book
Data Science Bookcamp doesn’t stop with surface-level theory and toy examples. As you work through each project, you’ll learn how to troubleshoot common problems like missing data, messy data, and algorithms that don’t quite fit the model you’re building. You’ll appreciate the detailed setup instructions and the fully explained solutions that highlight common failure points. In the end, you’ll be confident in your skills because you can see the results.
What’s inside
- Web scraping
- Organize datasets with clustering algorithms
- Visualize complex multi-variable datasets
- Train a decision tree machine learning algorithm
About the reader
For readers who know the basics of Python. No prior data science or machine learning skills required.
About the author
Leonard Apeltsin is the Head of Data Science at Anomaly, where his team applies advanced analytics to uncover healthcare fraud, waste, and abuse.
Data Science Bookcamp brief contents contents preface acknowledgments about this book Who should read this book How this book is organized About the code about the author about the cover illustration Case study 1—Finding the winning strategy in a card game Section 1—Computing probabilities using Python 1.1 Sample space analysis: An equation-free approach for measuring uncertainty in outcomes 1.1.1 Analyzing a biased coin 1.2 Computing nontrivial probabilities 1.2.1 Problem 1: Analyzing a family with four children 1.2.2 Problem 2: Analyzing multiple die rolls 1.2.3 Problem 3: Computing die-roll probabilities using weighted sample spaces 1.3 Computing probabilities over interval ranges 1.3.1 Evaluating extremes using interval analysis Summary Section 2—Plotting probabilities using Matplotlib 2.1 Basic Matplotlib plots 2.2 Plotting coin-flip probabilities 2.2.1 Comparing multiple coin-flip probability distributions Summary Section 3—Running random simulations in NumPy 3.1 Simulating random coin flips and die rolls using NumPy 3.1.1 Analyzing biased coin flips 3.2 Computing confidence intervals using histograms and NumPy arrays 3.2.1 Binning similar points in histogram plots 3.2.2 Deriving probabilities from histograms 3.2.3 Shrinking the range of a high confidence interval 3.2.4 Computing histograms in NumPy 3.3 Using confidence intervals to analyze a biased deck of cards 3.4 Using permutations to shuffle cards Summary Section 4—Case study 1 solution 4.1 Predicting red cards in a shuffled deck 4.1.1 Estimating the probability of strategy success 4.2 Optimizing strategies using the sample space for a 10-card deck Summary Case study 2—Assessing online ad clicks for significance Section 5—Basic probability and statistical analysis using SciPy 5.1 Exploring the relationships between data and probability using SciPy 5.2 Mean as a measure of centrality 5.2.1 Finding the mean of a probability distribution 5.3 Variance as a measure of dispersion 5.3.1 Finding the variance of a probability distribution Summary Section 6—Making predictions using the central limit theorem and SciPy 6.1 Manipulating the normal distribution using SciPy 6.1.1 Comparing two sampled normal curves 6.2 Determining the mean and variance of a population through random sampling 6.3 Making predictions using the mean and variance 6.3.1 Computing the area beneath a normal curve 6.3.2 Interpreting the computed probability Summary Section 7—Statistical hypothesis testing 7.1 Assessing the divergence between sample mean and population mean 7.2 Data dredging: Coming to false conclusions through oversampling 7.3 Bootstrapping with replacement: Testing a hypothesis when the population variance is unknown 7.4 Permutation testing: Comparing means of samples when the population parameters are unknown Summary Section 8—Analyzing tables using Pandas 8.1 Storing tables using basic Python 8.2 Exploring tables using Pandas 8.3 Retrieving table columns 8.4 Retrieving table rows 8.5 Modifying table rows and columns 8.6 Saving and loading table data 8.7 Visualizing tables using Seaborn Summary Section 9—Case study 2 solution 9.1 Processing the ad-click table in Pandas 9.2 Computing p-values from differences in means 9.3 Determining statistical significance 9.4 41 shades of blue: A real-life cautionary tale Summary Case study 3—Tracking disease outbreaks using news headlines Section 10—Clustering data into groups 10.1 Using centrality to discover clusters 10.2 K-means: A clustering algorithm for grouping data into K central groups 10.2.1 K-means clustering using scikit-learn 10.2.2 Selecting the optimal K using the elbow method 10.3 Using density to discover clusters 10.4 DBSCAN: A clustering algorithm for grouping data based on spatial density 10.4.1 Comparing DBSCAN and K-means 10.4.2 Clustering based on non-Euclidean distance 10.5 Analyzing clusters using Pandas Summary Section 11—Geographic location visualization and analysis 11.1 The great-circle distance: A metric for computing the distance between two global points 11.2 Plotting maps using Cartopy 11.2.1 Manually installing GEOS and Cartopy 11.2.2 Utilizing the Conda package manager 11.2.3 Visualizing maps 11.3 Location tracking using GeoNamesCache 11.3.1 Accessing country information 11.3.2 Accessing city information 11.3.3 Limitations of the GeoNamesCache library 11.4 Matching location names in text Summary Section 12—Case study 3 solution 12.1 Extracting locations from headline data 12.2 Visualizing and clustering the extracted location data 12.3 Extracting insights from location clusters Summary Case study 4—Using online job postings to improve your data science resume Section 13—Measuring text similarities 13.1 Simple text comparison 13.1.1 Exploring the Jaccard similarity 13.1.2 Replacing words with numeric values 13.2 Vectorizing texts using word counts 13.2.1 Using normalization to improve TF vector similarity 13.2.2 Using unit vector dot products to convert between relevance metrics 13.3 Matrix multiplication for efficient similarity calculation 13.3.1 Basic matrix operations 13.3.2 Computing all-by-all matrix similarities 13.4 Computational limits of matrix multiplication Summary Section 14—Dimension reduction of matrix data 14.1 Clustering 2D data in one dimension 14.1.1 Reducing dimensions using rotation 14.2 Dimension reduction using PCA and scikit-learn 14.3 Clustering 4D data in two dimensions 14.3.1 Limitations of PCA 14.4 Computing principal components without rotation 14.4.1 Extracting eigenvectors using power iteration 14.5 Efficient dimension reduction using SVD and scikit-learn Summary Section 15—NLP analysis of large text datasets 15.1 Loading online forum discussions using scikit-learn 15.2 Vectorizing documents using scikit-learn 15.3 Ranking words by both post frequency and count 15.3.1 Computing TFIDF vectors with scikit-learn 15.4 Computing similarities across large document datasets 15.5 Clustering texts by topic 15.5.1 Exploring a single text cluster 15.6 Visualizing text clusters 15.6.1 Using subplots to display multiple word clouds Summary Section 16—Extracting text from web pages 16.1 The structure of HTML documents 16.2 Parsing HTML using Beautiful Soup 16.3 Downloading and parsing online data Summary Section 17—Case study 4 solution 17.1 Extracting skill requirements from job posting data 17.1.1 Exploring the HTML for skill descriptions 17.2 Filtering jobs by relevance 17.3 Clustering skills in relevant job postings 17.3.1 Grouping the job skills into 15 clusters 17.3.2 Investigating the technical skill clusters 17.3.3 Investigating the soft-skill clusters 17.3.4 Exploring clusters at alternative values of K 17.3.5 Analyzing the 700 most relevant postings 17.4 Conclusion Summary Case study 5—Predicting future friendships from social network data Section 18—An introduction to graph theory and network analysis 18.1 Using basic graph theory to rank websites by popularity 18.1.1 Analyzing web networks using NetworkX 18.2 Utilizing undirected graphs to optimize the travel time between towns 18.2.1 Modeling a complex network of towns and counties 18.2.2 Computing the fastest travel time between nodes Summary Section 19—Dynamic graph theory techniques for node ranking and social network analysis 19.1 Uncovering central nodes based on expected traffic in a network 19.1.1 Measuring centrality using traffic simulations 19.2 Computing travel probabilities using matrix multiplication 19.2.1 Deriving PageRank centrality from probability theory 19.2.2 Computing PageRank centrality using NetworkX 19.3 Community detection using Markov clustering 19.4 Uncovering friend groups in social networks Summary Section 20—Network-driven supervised machine learning 20.1 The basics of supervised machine learning 20.2 Measuring predicted label accuracy 20.2.1 Scikit-learn’s prediction measurement functions 20.3 Optimizing KNN performance 20.4 Running a grid search using scikit-learn 20.5 Limitations of the KNN algorithm Summary Section 21—Training linear classifiers with logistic regression 21.1 Linearly separating customers by size 21.2 Training a linear classifier 21.2.1 Improving perceptron performance through standardization 21.3 Improving linear classification with logistic regression 21.3.1 Running logistic regression on more than two features 21.4 Training linear classifiers using scikit-learn 21.4.1 Training multiclass linear models 21.5 Measuring feature importance with coefficients 21.6 Linear classifier limitations Summary Section 22—Training nonlinear classifiers with decision tree techniques 22.1 Automated learning of logical rules 22.1.1 Training a nested if/else model using two features 22.1.2 Deciding which feature to split on 22.1.3 Training if/else models with more than two features 22.2 Training decision tree classifiers using scikit-learn 22.2.1 Studying cancerous cells using feature importance 22.3 Decision tree classifier limitations 22.4 Improving performance using random forest classification 22.5 Training random forest classifiers using scikit-learn Summary Section 23—Case study 5 solution 23.1 Exploring the data 23.1.1 Examining the profiles 23.1.2 Exploring the experimental observations 23.1.3 Exploring the Friendships linkage table 23.2 Training a predictive model using network features 23.3 Adding profile features to the model 23.4 Optimizing performance across a steady set of features 23.5 Interpreting the trained model 23.5.1 Why are generalizable models so important? Summary index Symbols A B C D E F G H I J K L M N O P R S T U V W X Y
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.