Data Science at the Command Line, 2nd Edition: Obtain, Scrub, Explore, and Model Data with Unix Power Tools
- Length: 250 pages
- Edition: 2
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2021-09-14
- ISBN-10: 1492087912
- ISBN-13: 9781492087915
- Sales Rank: #1161402 (See Top 100 Books)
This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 80 tools–useful whether you work with Windows, macOS, or Linux.
You’ll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you’re comfortable processing data with Python or R, you’ll learn how to greatly improve your data science workflow by leveraging the command line’s power. This book is ideal for data scientists, analysts, and engineers; software and machine learning engineers; and system administrators.
- Obtain data from websites, APIs, databases, and spreadsheets
- Perform scrub operations on text, CSV, HTM, XML, and JSON files
- Explore data, compute descriptive statistics, and create visualizations
- Manage your data science workflow
- Create reusable command-line tools from one-liners and existing Python or R code
- Parallelize and distribute data-intensive pipelines
- Model data with dimensionality reduction, clustering, regression, and classification algorithms
Foreword Preface What to Expect from This Book Changes for the Second Edition How to Read This Book Who This Book Is For Conventions Used in This Book O’Reilly Online Learning How to Contact Us Acknowledgments for the Second Edition (2021) Acknowledgments for the First Edition (2014) 1. Introduction Data Science Is OSEMN Obtaining Data Scrubbing Data Exploring Data Modeling Data Interpreting Data Intermezzo Chapters What Is the Command Line? Why Data Science at the Command Line? The Command Line Is Agile The Command Line Is Augmenting The Command Line Is Scalable The Command Line Is Extensible The Command Line Is Ubiquitous Summary For Further Exploration 2. Getting Started Getting the Data Installing the Docker Image Essential Unix Concepts The Environment Executing a Command-Line Tool Five Types of Command-Line Tools Combining Command-Line Tools Redirecting Input and Output Working with Files and Directories Managing Output Help! Summary For Further Exploration 3. Obtaining Data Overview Copying Local Files to the Docker Container Downloading from the Internet Introducing curl Saving Other Protocols Following Redirects Decompressing Files Converting Microsoft Excel Spreadsheets to CSV Querying Relational Databases Calling Web APIs Authentication Streaming APIs Summary For Further Exploration 4. Creating Command-Line Tools Overview Converting One-Liners into Shell Scripts Step 1: Create a File Step 2: Give Permission to Execute Step 3: Define a Shebang Step 4: Remove the Fixed Input Step 5: Add Arguments Step 6: Extend Your PATH Creating Command-Line Tools with Python and R Porting the Shell Script Processing Streaming Data from Standard Input Summary For Further Exploration 5. Scrubbing Data Overview Transformations, Transformations Everywhere Plain Text Filtering Lines Extracting Values Replacing and Deleting Values CSV Bodies and Headers and Columns, Oh My! Performing SQL Queries on CSV Extracting and Reordering Columns Filtering Rows Merging Columns Combining Multiple CSV Files Working with XML/HTML and JSON Summary For Further Exploration 6. Project Management with Make Overview Introducing Make Running Tasks Building, for Real Adding Dependencies Summary For Further Exploration 7. Exploring Data Overview Inspecting Data and Its Properties Header or Not, Here I Come Inspect All the Data Feature Names and Data Types Unique Identifiers, Continuous Variables, and Factors Computing Descriptive Statistics Column Statistics R One-Liners on the Shell Creating Visualizations Displaying Images from the Command Line Plotting in a Rush Creating Bar Charts Creating Histograms Creating Density Plots Happy Little Accidents Creating Scatter Plots Creating Trend Lines Creating Box Plots Adding Labels Going Beyond Basic Plots Summary For Further Exploration 8. Parallel Pipelines Overview Serial Processing Looping Over Numbers Looping Over Lines Looping Over Files Parallel Processing Introducing GNU Parallel Specifying Input Controlling the Number of Concurrent Jobs Logging and Output Creating Parallel Tools Distributed Processing Get List of Running AWS EC2 Instances Running Commands on Remote Machines Distributing Local Data Among Remote Machines Processing Files on Remote Machines Summary For Further Exploration 9. Modeling Data Overview More Wine, Please! Dimensionality Reduction with Tapkee Introducing Tapkee Linear and Nonlinear Mappings Regression with Vowpal Wabbit Preparing the Data Training the Model Testing the Model Classification with SciKit-Learn Laboratory Preparing the Data Running the Experiment Parsing the Results Summary For Further Exploration 10. Polyglot Data Science Overview Jupyter Python R RStudio Apache Spark Summary For Further Exploration 11. Conclusion Let’s Recap Three Pieces of Advice Be Patient Be Creative Be Practical Where to Go from Here The Command Line Shell Programming Python, R, and SQL APIs Machine Learning Getting in Touch A. List of Command-Line Tools alias awk aws bash bat bc body cat cd chmod cols column cowsay cp csv2vw csvcut csvgrep csvjoin csvlook csvquote csvsort csvsql csvstack csvstat curl cut display dseq echo env export fc find fold for fx git grep gron head header history hostname in2csv jq json2csv l less ls make man mkdir mv nano nl parallel paste pbc pip pup pwd python R rev rm rush sample scp sed seq servewd shuf skll sort split sponge sql2csv ssh sudo tail tapkee tar tee telnet tldr tr tree trim ts type uniq unpack unrar unzip vw wc which xml2json xmlstarlet xsv zcat zsh Index
Donate to keep this site alive
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: Data Science at the Command Line, 2nd Edition: Obtain, Scrub, Explore, and Model Data with Unix Power Tools
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.