Practical Python Data Wrangling and Data Quality
- Length: 500 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2022-01-18
- ISBN-10: 1492091502
- ISBN-13: 9781492091509
- Sales Rank: #0 (See Top 100 Books)
There are awesome discoveries to be made and valuable stories to be told in datasets–and this book will help you uncover them. Whether you already work with data or just want to understand its possibilities, the techniques and advice in this practical book will help you learn how to better clean, evaluate, and analyze data to generate meaningful insights and compelling visualizations.
Through foundational concepts and worked examples, author Susan McGregor provides the tools you need to evaluate and analyze all kinds of data and communicate your findings effectively. This book provides a methodical, jargon-free way for practitioners of all levels to harness the power of data.
- Use Python 3.8+ to read, write, and transform data from a variety of sources
- Understand and use programming basics in Python to wrangle data at scale
- Organize, document, and structure your code using best practices
- Complete exercises either on your own machine or on the web
- Collect data from structured data files, web pages, and APIs
- Perform basic statistical analysis to make meaning from data sets
- Visualize and present data in clear and compelling ways
Preface Who Should Read This Book? Who Shouldn’t Read This Book? What to Expect from This Volume Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us Acknowledgments 1. Introduction to Data Wrangling and Data Quality What Is “Data Wrangling”? What Is “Data Quality”? Data Integrity Data “Fit” Why Python? Versatility Accessibility Readability Community Python Alternatives Writing and “Running” Python Working with Python on Your Own Device Getting Started with the Command Line Installing Python, Jupyter Notebook, and a Code Editor Working with Python Online Hello World! Using Atom to Create a Standalone Python File Using Jupyter to Create a New Python Notebook Using Google Colab to Create a New Python Notebook Adding the Code In a Standalone File In a Notebook Running the Code In a Standalone File In a Notebook Documenting, Saving, and Versioning Your Work Documenting Saving Versioning Conclusion 2. Introduction to Python The Programming “Parts of Speech” Nouns ≈ Variables Verbs ≈ Functions Cooking with Custom Functions Libraries: Borrowing Custom Functions from Other Coders Taking Control: Loops and Conditionals In the Loop One Condition… Understanding Errors Syntax Snafus Runtime Runaround Logic Loss Hitting the Road with Citi Bike Data Starting with Pseudocode Seeking Scale Conclusion 3. Understanding Data Quality Assessing Data Fit Validity Reliability Representativeness Assessing Data Integrity Necessary, but Not Sufficient Important Achievable Improving Data Quality Data Cleaning Data Augmentation Conclusion 4. Working with File-Based and Feed-Based Data in Python Structured Versus Unstructured Data Working with Structured Data File-Based, Table-Type Data—Take It to Delimit Wrangling Table-Type Data with Python Real-World Data Wrangling: Understanding Unemployment XLSX, ODS, and All the Rest Finally, Fixed-Width Feed-Based Data—Web-Driven Live Updates Wrangling Feed-Type Data with Python Working with Unstructured Data Image-Based Text: Accessing Data in PDFs Wrangling PDFs with Python Accessing PDF Tables with Tabula Conclusion 5. Accessing Web-Based Data Accessing Online XML and JSON Introducing APIs Basic APIs: A Search Engine Example Specialized APIs: Adding Basic Authentication Getting a FRED API Key Using Your API key to Request Data Reading API Documentation Protecting Your API Key When Using Python Creating Your “Credentials” File Using Your Credentials in a Separate Script Getting Started with .gitignore Specialized APIs: Working With OAuth Applying for a Twitter Developer Account Creating Your Twitter “App” and Credentials Encoding Your API Key and Secret Requesting an Access Token and Data from the Twitter API API Ethics Web Scraping: The Data Source of Last Resort Carefully Scraping the MTA Using Browser Inspection Tools The Python Web Scraping Solution: Beautiful Soup Conclusion 6. Assessing Data Quality The Pandemic and the PPP Assessing Data Integrity Is It of Known Pedigree? Is It Timely? Is It Complete? Is It Well-Annotated? Is It High Volume? Is It Consistent? Is It Multivariate? Is It Atomic? Is It Clear? Is It Dimensionally Structured? Assessing Data Fit Validity Reliability Representativeness Conclusion 7. Cleaning, Transforming, and Augmenting Data Selecting a Subset of Citi Bike Data A Simple Split Regular Expressions: Supercharged String Matching Making a Date De-crufting Data Files Decrypting Excel Dates Generating True CSVs from Fixed-Width Data Correcting for Spelling Inconsistencies The Circuitous Path to “Simple” Solutions Gotchas That Will Get Ya! Augmenting Your Data Conclusion 8. Structuring and Refactoring Your Code Revisiting Custom Functions Will You Use It More Than Once? Is It Ugly and Confusing? Do You Just Really Hate the Default Functionality? Understanding Scope Defining the Parameters for Function “Ingredients” What Are Your Options? Getting Into Arguments? Return Values Climbing the “Stack” Refactoring for Fun and Profit A Function for Identifying Weekdays Metadata Without the Mess Documenting Your Custom Scripts and Functions with pydoc The Case for Command-Line Arguments Where Scripts and Notebooks Diverge Conclusion 9. Introduction to Data Analysis Context Is Everything Same but Different What’s Typical? Evaluating Central Tendency What’s That Mean? Embrace the Median Think Different: Identifying Outliers Visualization for Data Analysis What’s Our Data’s Shape? Understanding Histograms The Significance of Symmetry Counting “Clusters” The $2 Million Question Proportional Response Conclusion 10. Presenting Your Data Foundations for Visual Eloquence Making Your Data Statement Charts, Graphs, and Maps: Oh My! Pie Charts Bar and Column Charts Line Charts Scatter Charts Maps Elements of Eloquent Visuals The “Finicky” Details Really Do Make a Difference Trust Your Eyes (and the Experts) Selecting Scales Choosing Colors Above All, Annotate! From Basic to Beautiful: Customizing a Visualization with seaborn and matplotlib Beyond the Basics Conclusion 11. Beyond Python Additional Tools for Data Review Spreadsheet Programs OpenRefine Additional Tools for Sharing and Presenting Data Image Editing for JPGs, PNGs, and GIFs Software for Editing SVGs and Other Vector Formats Reflecting on Ethics Conclusion A. More Python Programming Resources Official Python Documentation Installing Python Resources Where to Look for Libraries Keeping Your Tools Sharp Where to Learn More B. A Bit More About Git You Run git push/pull and End Up in a Weird Text Editor Your git push/pull Command Gets Rejected Run git pull Git Quick Reference C. Finding Data Data Repositories and APIs Subject Matter Experts FOIA/L Requests Custom Data Collection D. Resources for Visualization and Information Design Foundational Books on Information Visualization The Quick Reference You’ll Reach For Sources of Inspiration Index About the Author
Donate to keep this site alive
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: Practical Python Data Wrangling and Data Quality
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.