In-Memory Analytics with Apache Arrow: Perform fast and efficient data analytics on both flat and hierarchical structured data

Length: 392 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2022-06-24
ISBN-10: 1801071039
ISBN-13: 9781801071031
Sales Rank: #51618 (See Top 100 Books)

Process tabular data and build high-performance query engines on modern CPUs and GPUs using Apache Arrow, a standardized language-independent memory format, for optimal performance

Key Features

Learn about Apache Arrow’s data types and interoperability with pandas and Parquet
Work with Apache Arrow Flight RPC, Compute, and Dataset APIs to produce and consume tabular data
Reviewed, contributed, and supported by Dremio, the co-creator of Apache Arrow

Book Description

Apache Arrow is designed to accelerate analytics and allow the exchange of data across big data systems easily.

In-Memory Analytics with Apache Arrow begins with a quick overview of the Apache Arrow format, before moving on to helping you to understand Arrow’s versatility and benefits as you walk through a variety of real-world use cases. You’ll cover key tasks such as enhancing data science workflows with Arrow, using Arrow and Apache Parquet with Apache Spark and Jupyter for better performance and hassle-free data translation, as well as working with Perspective, an open source interactive graphical and tabular analysis tool for browsers. As you advance, you’ll explore the different data interchange and storage formats and become well-versed with the relationships between Arrow, Parquet, Feather, Protobuf, Flatbuffers, JSON, and CSV. In addition to understanding the basic structure of the Arrow Flight and Flight SQL protocols, you’ll learn about Dremio’s usage of Apache Arrow to enhance SQL analytics and discover how Arrow can be used in web-based browser apps. Finally, you’ll get to grips with the upcoming features of Arrow to help you stay ahead of the curve.

By the end of this book, you will have all the building blocks to create useful, efficient, and powerful analytical services and utilities with Apache Arrow.

What you will learn

Use Apache Arrow libraries to access data files both locally and in the cloud
Understand the zero-copy elements of the Apache Arrow format
Improve read performance by memory-mapping files with Apache Arrow
Produce or consume Apache Arrow data efficiently using a C API
Use the Apache Arrow Compute APIs to perform complex operations
Create Arrow Flight servers and clients for transferring data quickly
Build the Arrow libraries locally and contribute back to the community

Who this book is for

This book is for developers, data analysts, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. This book will also be useful for any engineers who are working on building utilities for data analytics and query engines, or otherwise working with tabular data, regardless of the programming language. Some familiarity with basic concepts of data analysis will help you to get the most out of this book but isn’t required. Code examples are provided in the C++, Go, and Python programming languages.

In-Memory Analytics with Apache Arrow
Foreword
Acknowledgments
Contributors
About the author
About the reviewers
Preface
    Who this book is for
    To get the most out of this book 
    Download the example code files
    Download the color images
    Conventions used
    Get in touch
    Share Your Thoughts
Section 1: Overview of What Arrow Is, its Capabilities, Benefits, and Goals 
Chapter 1: Getting Started with Apache Arrow
    Technical requirements
    Understanding the Arrow format and specifications
    Why does Arrow use a columnar in-memory format?
    Learning the terminology and physical memory layout
        Quick summary of physical layouts, or TL;DR
        How to speak Arrow
    Arrow format versioning and stability
    Would you download a library? Of course!
    Setting up your shooting range
        Using pyarrow For Python
        C++ for the 1337 coders
        Go Arrow go!
    Summary
    References
Chapter 2: Working with Key Arrow Specifications
    Technical requirements
    Playing with data, wherever it might be!
        Working with Arrow tables
        Accessing data files with pyarrow
        Accessing data files with Arrow in C++
    pandas firing Arrow
        Putting pandas in your quiver
        Making pandas run fast
        Keeping pandas from running wild
    Sharing is caring… especially when it's your memory
        Diving into memory management
        Managing buffers for performance
        Crossing the boundaries
    Summary
Chapter 3: Data Science with Apache Arrow
    Technical requirements
    ODBC takes an Arrow to the knee
    Lost in translation
    SPARKing new ideas on Jupyter
        Understanding the integration
        Everyone gets a containerized development environment!
        SPARKing joy with Arrow and PySpark
    Interactive charting powered by Arrow
    Stretching workflows onto Elasticsearch
        Indexing the data
    Summary
Section 2: Interoperability with Arrow: pandas, Parquet, Flight, and Datasets
Chapter 4: Format and Memory Handling
    Technical requirements
    Storage versus runtime in-memory versus message-passing formats
        Long-term storage formats
        In-memory runtime formats
        Message-passing formats
        Summing up
    Passing your Arrows around
        What is this sorcery?!
        Producing and consuming Arrows
    Learning about memory cartography
        The base case
        Parquet versus CSV
        Mapping data into memory
        Too long; didn't read (TL;DR) – Computers are magic
    Summary
Chapter 5: Crossing the Language Barrier with the Arrow C Data API
    Technical requirements
    Using the Arrow C data interface
        The ArrowSchema structure
        The ArrowArray structure
    Example use cases
        Using the C Data API to export Arrow-formatted data
        Importing Arrow data with Python
        Exporting Arrow data with the C Data API from Python to Go
    Streaming across the C Data API
        Streaming record batches from Python to Go
    Other use cases
        Some exercises
    Summary
Chapter 6: Leveraging the Arrow Compute APIs
    Technical requirements
    Letting Arrow do the work for you
        Input shaping
        Value casting
        Types of functions
    Executing compute functions
        Using the C++ compute library
        Using the compute library in Python
    Picking the right tools
        Adding a constant value to an array
    Summary
Chapter 7: Using the Arrow Datasets API
    Technical requirements
    Querying multifile datasets
        Creating a sample dataset
        Discovering dataset fragments
    Filtering data programmatically
        Expressing yourself – a quick detour
        Using expressions for filtering data
        Deriving and renaming columns (projecting)
    Using the Datasets API in Python
        Creating our sample dataset
        Discovering the dataset
        Using different file formats
        Filtering and projecting columns with Python
    Streaming results
        Working with partitioned datasets
    Summary
Chapter 8: Exploring Apache Arrow Flight RPC
    Technical requirements
    The basics and complications of gRPC
        Building modern APIs for data
        Efficiency and streaming are important
    Arrow Flight's building blocks
        Horizontal scalability with Arrow Flight
        Adding your business logic to Flight
        Other bells and whistles
        Understanding the Flight Protocol Buffer definitions
    Using Flight, choose your language!
        Building a Python Flight Server
        Building a Go Flight server
    What is Flight SQL?
        Setting up a performance test
        Running the performance test
        Flight SQL, the new kid on the block
    Summary
Section 3: Real-World Examples, Use Cases, and Future Development
Chapter 9: Powered by Apache Arrow
    Swimming in data with Dremio Sonar
        Clarifying Dremio Sonar's architecture
        The library of the Gods…of data analysis
    Spicing up your ML workflows
        Bringing the AI engine to where the data lives
    Arrow in the browser using JavaScript
        Gaining a little perspective
        Taking flight with Falcon
    Summary
Chapter 10: How to Leave Your Mark on Arrow
    Technical requirements
    Contributing to open source projects
        Communication is key
        You don't necessarily have to contribute code
        There are a lot of reasons why you should contribute!
    Preparing your first pull request
        Navigating JIRA
        Setting up Git
        Orienting yourself in the code base
        Building the Arrow libraries
        Creating the PR
        Understanding the CI configuration
        Development using Archery
    Find your interest and expand on it
    Getting that sweet, sweet approval
    Finishing up with style!
        C++ styling
        Python code styling
        Go code styling
    Summary
Chapter 11: Future Development and Plans
    Examining Flight SQL (redux)
        Why Flight SQL?
        Defining the Flight SQL protocol
    Firing a Ballista using Data(Fusion)
        What about Spark?
        Looking at Ballista's development roadmap
    Building a cross-language compute serialization
        Why Substrait?
        Working with Substrait serialization
        Getting involved with Substrait development
    Final words
    Why subscribe?
Other Books You May Enjoy
    Packt is searching for authors like you
    Share Your Thoughts

AI & Machine Learning Artificial Intelligence Data Modeling & Design Intelligence & Semantics

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: In-Memory Analytics with Apache Arrow: Perform fast and efficient data analytics on both flat and hierarchical structured data, sometime you may not get the results, please search the main title.