Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter, 3rd Edition
- Length: 579 pages
- Edition: 3
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2022-09-27
- ISBN-10: 109810403X
- ISBN-13: 9781098104030
- Sales Rank: #86558 (See Top 100 Books)
Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.9 and pandas 1.2, the third edition of this hands-on guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You’ll learn the latest versions of pandas, NumPy, and Jupyter in the process.
Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It’s ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub.
- Use the Jupyter notebook and IPython shell for exploratory computing
- Learn basic and advanced features in NumPy
- Get started with data analysis tools in the pandas library
- Use flexible tools to load, clean, transform, merge, and reshape data
- Create informative visualizations with matplotlib
- Apply the pandas groupby facility to slice, dice, and summarize datasets
- Analyze and manipulate regular and irregular time series data
- Learn how to solve real-world data analysis problems with thorough, detailed examples
Preface 1. Conventions Used in This Book 2. Using Code Examples 3. O’Reilly Online Learning 4. How to Contact Us 5. Acknowledgments In Memoriam: John D. Hunter (1968–2012) Acknowledgments for the Third Edition (2022) Acknowledgments for the Second Edition (2017) Acknowledgments for the First Edition (2012) 1. Preliminaries 1.1. What Is This Book About? What Kinds of Data? 1.2. Why Python for Data Analysis? Python as Glue Solving the “Two-Language” Problem Why Not Python? 1.3. Essential Python Libraries NumPy pandas matplotlib IPython and Jupyter SciPy scikit-learn statsmodels Other Packages 1.4. Installation and Setup Miniconda on Windows GNU/Linux Miniconda on macOS Installing Necessary Packages Integrated Development Environments and Text Editors 1.5. Community and Conferences 1.6. Navigating This Book Code Examples Data for Examples Import Conventions 2. Python Language Basics, IPython, and Jupyter Notebooks 2.1. The Python Interpreter 2.2. IPython Basics Running the IPython Shell Running the Jupyter Notebook Tab Completion Introspection 2.3. Python Language Basics Language Semantics Indentation, not braces Everything is an object Comments Function and object method calls Variables and argument passing Dynamic references, strong types Attributes and methods Duck typing Imports Binary operators and comparisons Mutable and immutable objects Scalar Types Numeric types Strings Bytes and Unicode Booleans Type casting None Dates and times Control Flow if, elif, and else for loops while loops pass range 2.4. Conclusion 3. Built-In Data Structures, Functions, and Files 3.1. Data Structures and Sequences Tuple Unpacking tuples Tuple methods List Adding and removing elements Concatenating and combining lists Sorting Slicing Dictionary Creating dictionaries from sequences Default values Valid dictionary key types Set Built-In Sequence Functions enumerate sorted zip reversed List, Set, and Dictionary Comprehensions Nested list comprehensions 3.2. Functions Namespaces, Scope, and Local Functions Returning Multiple Values Functions Are Objects Anonymous (Lambda) Functions Generators Generator expressions itertools module Errors and Exception Handling Exceptions in IPython 3.3. Files and the Operating System Bytes and Unicode with Files 3.4. Conclusion 4. NumPy Basics: Arrays and Vectorized Computation 4.1. The NumPy ndarray: A Multidimensional Array Object Creating ndarrays Data Types for ndarrays Arithmetic with NumPy Arrays Basic Indexing and Slicing Indexing with slices Boolean Indexing Fancy Indexing Transposing Arrays and Swapping Axes 4.2. Pseudorandom Number Generation 4.3. Universal Functions: Fast Element-Wise Array Functions 4.4. Array-Oriented Programming with Arrays Expressing Conditional Logic as Array Operations Mathematical and Statistical Methods Methods for Boolean Arrays Sorting Unique and Other Set Logic 4.5. File Input and Output with Arrays 4.6. Linear Algebra 4.7. Example: Random Walks Simulating Many Random Walks at Once 4.8. Conclusion 5. Getting Started with pandas 5.1. Introduction to pandas Data Structures Series DataFrame Index Objects 5.2. Essential Functionality Reindexing Dropping Entries from an Axis Indexing, Selection, and Filtering Selection on DataFrame with loc and iloc Integer indexing pitfalls Pitfalls with chained indexing Arithmetic and Data Alignment Arithmetic methods with fill values Operations between DataFrame and Series Function Application and Mapping Sorting and Ranking Axis Indexes with Duplicate Labels 5.3. Summarizing and Computing Descriptive Statistics Correlation and Covariance Unique Values, Value Counts, and Membership 5.4. Conclusion 6. Data Loading, Storage, and File Formats 6.1. Reading and Writing Data in Text Format Reading Text Files in Pieces Writing Data to Text Format Working with Other Delimited Formats JSON Data XML and HTML: Web Scraping Parsing XML with lxml.objectify 6.2. Binary Data Formats Reading Microsoft Excel Files Using HDF5 Format 6.3. Interacting with Web APIs 6.4. Interacting with Databases 6.5. Conclusion 7. Data Cleaning and Preparation 7.1. Handling Missing Data Filtering Out Missing Data Filling In Missing Data 7.2. Data Transformation Removing Duplicates Transforming Data Using a Function or Mapping Replacing Values Renaming Axis Indexes Discretization and Binning Detecting and Filtering Outliers Permutation and Random Sampling Computing Indicator/Dummy Variables 7.3. Extension Data Types 7.4. String Manipulation Python Built-In String Object Methods Regular Expressions String Functions in pandas 7.5. Categorical Data Background and Motivation Categorical Extension Type in pandas Computations with Categoricals Better performance with categoricals Categorical Methods Creating dummy variables for modeling 7.6. Conclusion 8. Data Wrangling: Join, Combine, and Reshape 8.1. Hierarchical Indexing Reordering and Sorting Levels Summary Statistics by Level Indexing with a DataFrame’s columns 8.2. Combining and Merging Datasets Database-Style DataFrame Joins Merging on Index Concatenating Along an Axis Combining Data with Overlap 8.3. Reshaping and Pivoting Reshaping with Hierarchical Indexing Pivoting “Long” to “Wide” Format Pivoting “Wide” to “Long” Format 8.4. Conclusion 9. Plotting and Visualization 9.1. A Brief matplotlib API Primer Figures and Subplots Adjusting the spacing around subplots Colors, Markers, and Line Styles Ticks, Labels, and Legends Setting the title, axis labels, ticks, and tick labels Adding legends Annotations and Drawing on a Subplot Saving Plots to File matplotlib Configuration 9.2. Plotting with pandas and seaborn Line Plots Bar Plots Histograms and Density Plots Scatter or Point Plots Facet Grids and Categorical Data 9.3. Other Python Visualization Tools 9.4. Conclusion 10. Data Aggregation and Group Operations 10.1. How to Think About Group Operations Iterating over Groups Selecting a Column or Subset of Columns Grouping with Dictionaries and Series Grouping with Functions Grouping by Index Levels 10.2. Data Aggregation Column-Wise and Multiple Function Application Returning Aggregated Data Without Row Indexes 10.3. Apply: General split-apply-combine Suppressing the Group Keys Quantile and Bucket Analysis Example: Filling Missing Values with Group-Specific Values Example: Random Sampling and Permutation Example: Group Weighted Average and Correlation Example: Group-Wise Linear Regression 10.4. Group Transforms and “Unwrapped” GroupBys 10.5. Pivot Tables and Cross-Tabulation Cross-Tabulations: Crosstab 10.6. Conclusion 11. Time Series 11.1. Date and Time Data Types and Tools Converting Between String and Datetime 11.2. Time Series Basics Indexing, Selection, Subsetting Time Series with Duplicate Indices 11.3. Date Ranges, Frequencies, and Shifting Generating Date Ranges Frequencies and Date Offsets Week of month dates Shifting (Leading and Lagging) Data Shifting dates with offsets 11.4. Time Zone Handling Time Zone Localization and Conversion Operations with Time Zone-Aware Timestamp Objects Operations Between Different Time Zones 11.5. Periods and Period Arithmetic Period Frequency Conversion Quarterly Period Frequencies Converting Timestamps to Periods (and Back) Creating a PeriodIndex from Arrays 11.6. Resampling and Frequency Conversion Downsampling Open-high-low-close (OHLC) resampling Upsampling and Interpolation Resampling with Periods Grouped Time Resampling 11.7. Moving Window Functions Exponentially Weighted Functions Binary Moving Window Functions User-Defined Moving Window Functions 11.8. Conclusion 12. Introduction to Modeling Libraries in Python 12.1. Interfacing Between pandas and Model Code 12.2. Creating Model Descriptions with Patsy Data Transformations in Patsy Formulas Categorical Data and Patsy 12.3. Introduction to statsmodels Estimating Linear Models Estimating Time Series Processes 12.4. Introduction to scikit-learn 12.5. Conclusion 13. Data Analysis Examples 13.1. Bitly Data from 1.USA.gov Counting Time Zones in Pure Python Counting Time Zones with pandas 13.2. MovieLens 1M Dataset Measuring Rating Disagreement 13.3. US Baby Names 1880–2010 Analyzing Naming Trends Measuring the increase in naming diversity The “last letter” revolution Boy names that became girl names (and vice versa) 13.4. USDA Food Database 13.5. 2012 Federal Election Commission Database Donation Statistics by Occupation and Employer Bucketing Donation Amounts Donation Statistics by State 13.6. Conclusion A. Advanced NumPy A.1. ndarray Object Internals NumPy Data Type Hierarchy A.2. Advanced Array Manipulation Reshaping Arrays C Versus FORTRAN Order Concatenating and Splitting Arrays Stacking helpers: r_ and c_ Repeating Elements: tile and repeat Fancy Indexing Equivalents: take and put A.3. Broadcasting Broadcasting over Other Axes Setting Array Values by Broadcasting A.4. Advanced ufunc Usage ufunc Instance Methods Writing New ufuncs in Python A.5. Structured and Record Arrays Nested Data Types and Multidimensional Fields Why Use Structured Arrays? A.6. More About Sorting Indirect Sorts: argsort and lexsort Alternative Sort Algorithms Partially Sorting Arrays numpy.searchsorted: Finding Elements in a Sorted Array A.7. Writing Fast NumPy Functions with Numba Creating Custom numpy.ufunc Objects with Numba A.8. Advanced Array Input and Output Memory-Mapped Files HDF5 and Other Array Storage Options A.9. Performance Tips The Importance of Contiguous Memory B. More on the IPython System B.1. Terminal Keyboard Shortcuts B.2. About Magic Commands The %run Command Interrupting running code Executing Code from the Clipboard B.3. Using the Command History Searching and Reusing the Command History Input and Output Variables B.4. Interacting with the Operating System Shell Commands and Aliases Directory Bookmark System B.5. Software Development Tools Interactive Debugger Other ways to use the debugger Timing Code: %time and %timeit Basic Profiling: %prun and %run -p Profiling a Function Line by Line B.6. Tips for Productive Code Development Using IPython Reloading Module Dependencies Code Design Tips Keep relevant objects and data alive Flat is better than nested Overcome a fear of longer files B.7. Advanced IPython Features Profiles and Configuration B.8. Conclusion Index
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter, 3rd Edition
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.