Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data
- Length: 482 pages
- Edition: 1
- Language: English
- Publisher: Packt Publishing
- Publication Date: 2018-05-31
- ISBN-10: 1788628845
- ISBN-13: 9781788628846
- Sales Rank: #3390829 (See Top 100 Books)
Explore big data concepts, platforms, analytics, and their applications using the power of Hadoop 3
Key Features
- Learn Hadoop 3 to build effective big data analytics solutions on-premise and on cloud
- Integrate Hadoop with other big data tools such as R, Python, Apache Spark, and Apache Flink
- Exploit big data using Hadoop 3 with real-world examples
Book Description
Apache Hadoop is the most popular platform for big data processing, and can be combined with a host of other big data tools to build powerful analytics solutions. Big Data Analytics with Hadoop 3 shows you how to do just that, by providing insights into the software as well as its benefits with the help of practical examples.
Once you have taken a tour of Hadoop 3’s latest features, you will get an overview of HDFS, MapReduce, and YARN, and how they enable faster, more efficient big data processing. You will then move on to learning how to integrate Hadoop with the open source tools, such as Python and R, to analyze and visualize data and perform statistical computing on big data. As you get acquainted with all this, you will explore how to use Hadoop 3 with Apache Spark and Apache Flink for real-time data analytics and stream processing. In addition to this, you will understand how to use Hadoop to build analytics solutions on the cloud and an end-to-end pipeline to perform big data analysis using practical use cases.
By the end of this book, you will be well-versed with the analytical capabilities of the Hadoop ecosystem. You will be able to build powerful solutions to perform big data analytics and get insight effortlessly.
What you will learn
- Explore the new features of Hadoop 3 along with HDFS, YARN, and MapReduce
- Get well-versed with the analytical capabilities of Hadoop ecosystem using practical examples
- Integrate Hadoop with R and Python for more efficient big data processing
- Learn to use Hadoop with Apache Spark and Apache Flink for real-time data analytics
- Set up a Hadoop cluster on AWS cloud
- Perform big data analytics on AWS using Elastic Map Reduce
Who This Book Is For
Big Data Analytics with Hadoop 3 is for you if you are looking to build high-performance analytics solutions for your enterprise or business using Hadoop 3’s powerful features, or you’re new to big data analytics. A basic understanding of the Java programming language is required.
Table of Contents
- Introduction to Hadoop
- Overview of Big Data Analytics
- Big Data Processing with MapReduce
- Scientific Computing and Big Data Analysis with Python and Hadoop
- Statistical Big Data Computing with R and Hadoop
- Batch analytics with Apache Spark
- Real time analytics with Apache Spark
- Batch analytics with Apache Flink
- Stream Processing with Apache Flink
- Visualizing Big Data
- Introduction to Cloud Computing
- Using Amazon Web Services
Title Page Copyright and Credits Big Data Analytics with Hadoop 3 Packt Upsell Why subscribe? PacktPub.com Contributors About the author About the reviewers Packt is searching for authors like you Preface Who this book is for What this book covers To get the most out of this book Download the example code files Download the color images Conventions used Get in touch Reviews Introduction to Hadoop Hadoop Distributed File System High availability Intra-DataNode balancer Erasure coding Port numbers MapReduce framework Task-level native optimization YARN Opportunistic containers Types of container execution YARN timeline service v.2 Enhancing scalability and reliability Usability improvements Architecture Other changes Minimum required Java version Shell script rewrite Shaded-client JARs Installing Hadoop 3 Prerequisites Downloading Installation Setup password-less ssh Setting up the NameNode Starting HDFS Setting up the YARN service Erasure Coding Intra-DataNode balancer Installing YARN timeline service v.2 Setting up the HBase cluster Simple deployment for HBase Enabling the co-processor Enabling timeline service v.2 Running timeline service v.2 Enabling MapReduce to write to timeline service v.2 Summary Overview of Big Data Analytics Introduction to data analytics Inside the data analytics process Introduction to big data Variety of data Velocity of data Volume of data Veracity of data Variability of data Visualization Value Distributed computing using Apache Hadoop The MapReduce framework Hive Downloading and extracting the Hive binaries Installing Derby Using Hive Creating a database Creating a table SELECT statement syntax WHERE clauses INSERT statement syntax Primitive types Complex types Built-in operators and functions Built-in operators Built-in functions Language capabilities A cheat sheet on retrieving information Apache Spark Visualization using Tableau Summary Big Data Processing with MapReduce The MapReduce framework Dataset Record reader Map Combiner Partitioner Shuffle and sort Reduce Output format MapReduce job types Single mapper job Single mapper reducer job Multiple mappers reducer job SingleMapperCombinerReducer job Scenario MapReduce patterns Aggregation patterns Average temperature by city Record count Min/max/count Average/median/standard deviation Filtering patterns Join patterns Inner join Left anti join Left outer join Right outer join Full outer join Left semi join Cross join Summary Scientific Computing and Big Data Analysis with Python and Hadoop Installation Installing standard Python Installing Anaconda Using Conda Data analysis Summary Statistical Big Data Computing with R and Hadoop Introduction Install R on workstations and connect to the data in Hadoop Install R on a shared server and connect to Hadoop Utilize Revolution R Open Execute R inside of MapReduce using RMR2 Summary and outlook for pure open source options Methods of integrating R and Hadoop RHADOOP – install R on workstations and connect to data in Hadoop RHIPE – execute R inside Hadoop MapReduce R and Hadoop Streaming RHIVE – install R on workstations and connect to data in Hadoop ORCH – Oracle connector for Hadoop Data analytics Summary Batch Analytics with Apache Spark SparkSQL and DataFrames DataFrame APIs and the SQL API Pivots Filters User-defined functions Schema – structure of data Implicit schema Explicit schema Encoders Loading datasets Saving datasets Aggregations Aggregate functions count first last approx_count_distinct min max avg sum kurtosis skewness Variance Standard deviation Covariance groupBy Rollup Cube Window functions ntiles Joins Inner workings of join Shuffle join Broadcast join Join types Inner join Left outer join Right outer join Outer join Left anti join Left semi join Cross join Performance implications of join Summary Real-Time Analytics with Apache Spark Streaming At-least-once processing At-most-once processing Exactly-once processing Spark Streaming StreamingContext Creating StreamingContext Starting StreamingContext Stopping StreamingContext Input streams receiverStream socketTextStream rawSocketStream fileStream textFileStream binaryRecordsStream queueStream textFileStream example twitterStream example Discretized Streams Transformations Windows operations Stateful/stateless transformations Stateless transformations Stateful transformations Checkpointing Metadata checkpointing Data checkpointing Driver failure recovery Interoperability with streaming platforms (Apache Kafka) Receiver-based Direct Stream Structured Streaming Getting deeper into Structured Streaming Handling event time and late date Fault-tolerance semantics Summary Batch Analytics with Apache Flink Introduction to Apache Flink Continuous processing for unbounded datasets Flink, the streaming model, and bounded datasets Installing Flink Downloading Flink Installing Flink Starting a local Flink cluster Using the Flink cluster UI Batch analytics Reading file File-based Collection-based Generic Transformations GroupBy Aggregation Joins Inner join Left outer join Right outer join Full outer join Writing to a file Summary Stream Processing with Apache Flink Introduction to streaming execution model Data processing using the DataStream API Execution environment Data sources Socket-based File-based Transformations map flatMap filter keyBy reduce fold Aggregations window Global windows Tumbling windows Sliding windows Session windows windowAll union Window join split Select Project Physical partitioning Custom partitioning Random partitioning Rebalancing partitioning Rescaling Broadcasting Event time and watermarks Connectors Kafka connector Twitter connector RabbitMQ connector Elasticsearch connector Cassandra connector Summary Visualizing Big Data Introduction Tableau Chart types Line charts Pie chart Bar chart Heat map Using Python to visualize data Using R to visualize data Big data visualization tools Summary Introduction to Cloud Computing Concepts and terminology Cloud IT resource On-premise Cloud consumers and Cloud providers Scaling Types of scaling Horizontal scaling Vertical scaling Cloud service Cloud service consumer Goals and benefits Increased scalability Increased availability and reliability Risks and challenges Increased security vulnerabilities Reduced operational governance control Limited portability between Cloud providers Roles and boundaries Cloud provider Cloud consumer Cloud service owner Cloud resource administrator Additional roles Organizational boundary Trust boundary Cloud characteristics On-demand usage Ubiquitous access Multi-tenancy (and resource pooling) Elasticity Measured usage Resiliency Cloud delivery models Infrastructure as a Service Platform as a Service Software as a Service Combining Cloud delivery models IaaS + PaaS IaaS + PaaS + SaaS Cloud deployment models Public Clouds Community Clouds Private Clouds Hybrid Clouds Summary Using Amazon Web Services Amazon Elastic Compute Cloud Elastic web-scale computing Complete control of operations Flexible Cloud hosting services Integration High reliability Security Inexpensive Easy to start Instances and Amazon Machine Images Launching multiple instances of an AMI Instances AMIs Regions and availability zones Region and availability zone concepts Regions Availability zones Available regions Regions and endpoints Instance types Tag basics Amazon EC2 key pairs Amazon EC2 security groups for Linux instances Elastic IP addresses Amazon EC2 and Amazon Virtual Private Cloud Amazon Elastic Block Store Amazon EC2 instance store What is AWS Lambda? When should I use AWS Lambda? Introduction to Amazon S3 Getting started with Amazon S3 Comprehensive security and compliance capabilities Query in place Flexible management Most supported platform with the largest ecosystem Easy and flexible data transfer Backup and recovery Data archiving Data lakes and big data analytics Hybrid Cloud storage Cloud-native application data Disaster recovery Amazon DynamoDB Amazon Kinesis Data Streams What can I do with Kinesis Data Streams? Accelerated log and data feed intake and processing Real-time metrics and reporting Real-time data analytics Complex stream processing Benefits of using Kinesis Data Streams AWS Glue When should I use AWS Glue? Amazon EMR Practical AWS EMR cluster Summary
Donate to keep this site alive
How to download source code?
1. Go to: https://github.com/PacktPublishing
2. In the Find a repository… box, search the book title: Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data
, sometime you may not get the results, please search the main title.
3. Click the book title in the search results.
3. Click Code to download.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.