Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

by Sridhar Alla

Length: 482 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2018-05-31
ISBN-10: 1788628845
ISBN-13: 9781788628846
Sales Rank: #3390829 (See Top 100 Books)

2 ratings

Print Book Look Inside

Explore big data concepts, platforms, analytics, and their applications using the power of Hadoop 3

Key Features

Learn Hadoop 3 to build effective big data analytics solutions on-premise and on cloud
Integrate Hadoop with other big data tools such as R, Python, Apache Spark, and Apache Flink
Exploit big data using Hadoop 3 with real-world examples

Book Description

Apache Hadoop is the most popular platform for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. Big Data Analytics with Hadoop 3 shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples.

Once you have taken a tour of Hadoop 3’s latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on to learning how to integrate Hadoop with the open source tools, such as Python and R, to analyze and visualize data and perform statistical computing on big data. As you get acquainted with all this, you will explore how to use Hadoop 3 with Apache Spark and Apache Flink for real-time data analytics and stream processing. In addition to this, you will understand how to use Hadoop to build analytics solutions on the cloud and an end-to-end pipeline to perform big data analysis using practical use cases.

By the end of this book, you will be well-versed with the analytical capabilities of the Hadoop ecosystem. You will be able to build powerful solutions to perform big data analytics and get insight effortlessly.

What you will learn

Explore the new features of Hadoop 3 along with HDFS, YARN, and MapReduce
Get well-versed with the analytical capabilities of Hadoop ecosystem using practical examples
Integrate Hadoop with R and Python for more efficient big data processing
Learn to use Hadoop with Apache Spark and Apache Flink for real-time data analytics
Set up a Hadoop cluster on AWS cloud
Perform big data analytics on AWS using Elastic Map Reduce

Who This Book Is For

Big Data Analytics with Hadoop 3 is for you if you are looking to build high-performance analytics solutions for your enterprise or business using Hadoop 3’s powerful features, or you’re new to big data analytics. A basic understanding of the Java programming language is required.

Introduction to Hadoop
Overview of Big Data Analytics
Big Data Processing with MapReduce
Scientific Computing and Big Data Analysis with Python and Hadoop
Statistical Big Data Computing with R and Hadoop
Batch analytics with Apache Spark
Real time analytics with Apache Spark
Batch analytics with Apache Flink
Stream Processing with Apache Flink
Visualizing Big Data
Introduction to Cloud Computing
Using Amazon Web Services

Title Page
Copyright and Credits
    Big Data Analytics with Hadoop 3
Packt Upsell
    Why subscribe?
    PacktPub.com
Contributors
    About the author
    About the reviewers
    Packt is searching for authors like you
Preface
    Who this book is for
    What this book covers
    To get the most out of this book
        Download the example code files
        Download the color images
        Conventions used
    Get in touch
        Reviews
Introduction to Hadoop
    Hadoop Distributed File System
        High availability
        Intra-DataNode balancer
        Erasure coding
        Port numbers
    MapReduce framework
        Task-level native optimization
    YARN
        Opportunistic containers
            Types of container execution 
        YARN timeline service v.2
            Enhancing scalability and reliability
            Usability improvements
            Architecture
    Other changes
        Minimum required Java version 
        Shell script rewrite
        Shaded-client JARs
    Installing Hadoop 3 
        Prerequisites
        Downloading
        Installation
        Setup password-less ssh
        Setting up the NameNode
        Starting HDFS
        Setting up the YARN service
        Erasure Coding
        Intra-DataNode balancer
        Installing YARN timeline service v.2
            Setting up the HBase cluster
                Simple deployment for HBase
            Enabling the co-processor
            Enabling timeline service v.2
                Running timeline service v.2
                Enabling MapReduce to write to timeline service v.2
    Summary
Overview of Big Data Analytics
    Introduction to data analytics
        Inside the data analytics process
    Introduction to big data
        Variety of data
        Velocity of data
        Volume of data
        Veracity of data
        Variability of data
        Visualization
        Value
    Distributed computing using Apache Hadoop
    The MapReduce framework
    Hive
        Downloading and extracting the Hive binaries
        Installing Derby
        Using Hive
            Creating a database
            Creating a table
        SELECT statement syntax
            WHERE clauses
        INSERT statement syntax
        Primitive types
        Complex types
        Built-in operators and functions
            Built-in operators
            Built-in functions
        Language capabilities
            A cheat sheet on retrieving information 
    Apache Spark
    Visualization using Tableau
    Summary
Big Data Processing with MapReduce
    The MapReduce framework
        Dataset
        Record reader
        Map
        Combiner
        Partitioner
        Shuffle and sort
        Reduce
        Output format
    MapReduce job types
        Single mapper job
        Single mapper reducer job
        Multiple mappers reducer job
        SingleMapperCombinerReducer job
        Scenario
    MapReduce patterns
        Aggregation patterns
            Average temperature by city
                Record count
                Min/max/count
                Average/median/standard deviation
        Filtering patterns
        Join patterns
            Inner join
            Left anti join
            Left outer join
            Right outer join
            Full outer join
            Left semi join
            Cross join
    Summary
Scientific Computing and Big Data Analysis with Python and Hadoop
    Installation
        Installing standard Python
        Installing Anaconda
            Using Conda
    Data analysis
    Summary
Statistical Big Data Computing with R and Hadoop
    Introduction
        Install R on workstations and connect to the data in Hadoop
        Install R on a shared server and connect to Hadoop
        Utilize Revolution R Open
        Execute R inside of MapReduce using RMR2
            Summary and outlook for pure open source options
    Methods of integrating R and Hadoop
        RHADOOP – install R on workstations and connect to data in Hadoop
        RHIPE – execute R inside Hadoop MapReduce
        R and Hadoop Streaming
        RHIVE – install R on workstations and connect to data in Hadoop
        ORCH – Oracle connector for Hadoop
    Data analytics
    Summary
Batch Analytics with Apache Spark
    SparkSQL and DataFrames
    DataFrame APIs and the SQL API
        Pivots
        Filters
        User-defined functions
    Schema – structure of data
        Implicit schema
        Explicit schema
        Encoders
    Loading datasets
    Saving datasets
    Aggregations
        Aggregate functions
            count
            first
            last
            approx_count_distinct
            min
            max
            avg
            sum
            kurtosis
            skewness
            Variance
            Standard deviation
            Covariance
            groupBy
            Rollup
            Cube
        Window functions
        ntiles
    Joins
        Inner workings of join
        Shuffle join
        Broadcast join
        Join types
        Inner join
        Left outer join
        Right outer join
        Outer join
        Left anti join
        Left semi join
        Cross join
        Performance implications of join
    Summary
Real-Time Analytics with Apache Spark
    Streaming
        At-least-once processing
        At-most-once processing
        Exactly-once processing
    Spark Streaming
        StreamingContext
        Creating StreamingContext
        Starting StreamingContext
        Stopping StreamingContext
            Input streams
                receiverStream
                socketTextStream
                rawSocketStream
    fileStream
        textFileStream
        binaryRecordsStream
        queueStream
            textFileStream example
            twitterStream example
        Discretized Streams
    Transformations
        Windows operations
        Stateful/stateless transformations
            Stateless transformations
            Stateful transformations
    Checkpointing
        Metadata checkpointing
        Data checkpointing
    Driver failure recovery
    Interoperability with streaming platforms (Apache Kafka)
        Receiver-based
        Direct Stream
        Structured Streaming
            Getting deeper into Structured Streaming
    Handling event time and late date
    Fault-tolerance semantics
    Summary
Batch Analytics with Apache Flink
    Introduction to Apache Flink
        Continuous processing for unbounded datasets
        Flink, the streaming model, and bounded datasets
    Installing Flink
        Downloading Flink
        Installing Flink
            Starting a local Flink cluster
    Using the Flink cluster UI
    Batch analytics
        Reading file
            File-based
            Collection-based
            Generic
        Transformations
        GroupBy
        Aggregation
        Joins
            Inner join
            Left outer join
            Right outer join
            Full outer join
        Writing to a file
    Summary
Stream Processing with Apache Flink
    Introduction to streaming execution model
    Data processing using the DataStream API
        Execution environment
        Data sources
            Socket-based
            File-based
        Transformations
            map
            flatMap
            filter
            keyBy
            reduce
            fold
            Aggregations
            window
                Global windows
                Tumbling windows
                Sliding windows
                Session windows
            windowAll
            union
            Window join
            split
            Select
            Project
            Physical partitioning
                Custom partitioning
                Random partitioning
                Rebalancing partitioning
            Rescaling
            Broadcasting
            Event time and watermarks
            Connectors
                Kafka connector
                Twitter connector
                RabbitMQ connector
                Elasticsearch connector
                Cassandra connector
    Summary
Visualizing Big Data
    Introduction
    Tableau
    Chart types
        Line charts
        Pie chart
        Bar chart
        Heat map
    Using Python to visualize data
    Using R to visualize data
    Big data visualization tools
    Summary
Introduction to Cloud Computing
    Concepts and terminology
        Cloud
        IT resource
        On-premise
        Cloud consumers and Cloud providers
        Scaling
             Types of scaling
                Horizontal scaling
                Vertical scaling
            Cloud service
            Cloud service consumer
    Goals and benefits
        Increased scalability
        Increased availability and reliability
    Risks and challenges
        Increased security vulnerabilities
        Reduced operational governance control
        Limited portability between Cloud providers
    Roles and boundaries
        Cloud provider
        Cloud consumer
        Cloud service owner
        Cloud resource administrator
            Additional roles
            Organizational boundary
            Trust boundary
    Cloud characteristics
        On-demand usage
        Ubiquitous access
        Multi-tenancy (and resource pooling)
        Elasticity
        Measured usage
        Resiliency
    Cloud delivery models
        Infrastructure as a Service
        Platform as a Service
        Software as a Service
        Combining Cloud delivery models
            IaaS + PaaS
            IaaS + PaaS + SaaS
    Cloud deployment models
        Public Clouds
        Community Clouds
        Private Clouds
        Hybrid Clouds
    Summary
Using Amazon Web Services
    Amazon Elastic Compute Cloud
        Elastic web-scale computing
        Complete control of operations
        Flexible Cloud hosting services
        Integration
        High reliability
        Security
        Inexpensive
        Easy to start
        Instances and Amazon Machine Images
    Launching multiple instances of an AMI
        Instances
        AMIs
        Regions and availability zones
        Region and availability zone concepts
        Regions
        Availability zones
        Available regions
        Regions and endpoints
        Instance types
            Tag basics
            Amazon EC2 key pairs
            Amazon EC2 security groups for Linux instances
            Elastic IP addresses
        Amazon EC2 and Amazon Virtual Private Cloud
            Amazon Elastic Block Store
            Amazon EC2 instance store
    What is AWS Lambda?
        When should I use AWS Lambda?
    Introduction to Amazon S3
        Getting started with Amazon S3
        Comprehensive security and compliance capabilities
        Query in place
        Flexible management
        Most supported platform with the largest ecosystem
        Easy and flexible data transfer
        Backup and recovery
        Data archiving
        Data lakes and big data analytics
        Hybrid Cloud storage
        Cloud-native application data
        Disaster recovery
    Amazon DynamoDB
    Amazon Kinesis Data Streams
        What can I do with Kinesis Data Streams?
            Accelerated log and data feed intake and processing
            Real-time metrics and reporting
            Real-time data analytics
            Complex stream processing
            Benefits of using Kinesis Data Streams
    AWS Glue
        When should I use AWS Glue?
    Amazon EMR
        Practical AWS EMR cluster
    Summary

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data, sometime you may not get the results, please search the main title.