Simplify Big Data Analytics with Amazon EMR: A beginner’s guide to learning and implementing Amazon EMR for building data analytics solutions

by Sakti Mishra

Length: 430 pages
Edition: 1
Language: English
Publisher: Packt Publishing
Publication Date: 2022-03-25
ISBN-10: 1801071071
ISBN-13: 9781801071079
Sales Rank: #1237609 (See Top 100 Books)

0 ratings

Print Book Look Inside

Design scalable big data solutions using Hadoop, Spark, and AWS cloud native services

Key Features

Build data pipelines that require distributed processing capabilities on a large volume of data
Discover the security features of EMR such as data protection and granular permission management
Explore best practices and optimization techniques for building data analytics solutions in Amazon EMR

Book Description

Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS.

This book is a practical guide to Amazon EMR for building data pipelines. You’ll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You’ll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you’ll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you’ll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR.

By the end of this book, you’ll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.

What you will learn

Explore Amazon EMR features, architecture, Hadoop interfaces, and EMR Studio
Configure, deploy, and orchestrate Hadoop or Spark jobs in production
Implement the security, data governance, and monitoring capabilities of EMR
Build applications for batch and real-time streaming data analytics solutions
Perform interactive development with a persistent EMR cluster and Notebook
Orchestrate an EMR Spark job using AWS Step Functions and Apache Airflow

Who this book is for

This book is for data engineers, data analysts, data scientists, and solution architects who are interested in building data analytics solutions with the Hadoop ecosystem services and Amazon EMR. Prior experience in either Python programming, Scala, or the Java programming language and a basic understanding of Hadoop and AWS will help you make the most out of this book.

Simplify Big Data Analytics with Amazon EMR
Contributors
About the author
About the reviewers
Preface
    Who this book is for
    What this book covers
    To get the most out of this book
    Download the example code files
    Code in Action
    Download the color images
    Conventions used
    Get in touch
    Share Your Thoughts
Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR
Chapter 1: An Overview of Amazon EMR
    What is Amazon EMR?
        What is big data?
        Hadoop – processing framework to handle big data
        Overview of Amazon EMR – managed and scalable Hadoop cluster in AWS
        A brief history of the major big data releases
    Benefits of Amazon EMR 
    Decoupling compute and storage
        Persistent versus transient clusters
    Integration with other AWS services
        Amazon S3 with EMR File System (EMRFS)
        Amazon Kinesis Data Streams (KDS)
        Amazon Managed Streaming for Kafka (MSK)
        AWS Glue Data Catalog
        Amazon Relational Database Service (RDS)
        Amazon DynamoDB
        Amazon Redshift
        AWS Lake Formation
        AWS Identity and Access Management (IAM)
        AWS Key Management Service (KMS)
        Lake House architecture overview
    EMR release history
    Comparing Amazon EMR with AWS Glue and AWS Glue DataBrew
        AWS Glue
        AWS Glue DataBrew
        Choosing the right service for your use case
    Summary
    Test your knowledge
    Further reading
Chapter 2: Exploring the Architecture and Deployment Options
    EMR architecture deep dive
        Distributed storage layer
        YARN – cluster resource manager
        Distributed processing frameworks
        Hadoop applications
    Understanding clusters and nodes
        Uniform instance groups
        Instance fleet
    Using S3 versus HDFS for cluster storage
        HDFS as cluster-persistent storage
        Amazon S3 as a persistent data store
    Understanding the cluster life cycle
        Options to submit work to the cluster
        Submitting jobs to the cluster as EMR steps
    Building Hadoop jobs with dependencies in a specific EMR release version
    EMR deployment options 
        Amazon EMR on Amazon EC2
        Amazon EMR on Amazon EKS
        Amazon EMR on AWS Outposts
        EMR pricing for different deployment options
        Monitoring and controlling your costs with AWS Budgets and Cost Explorer
    Summary
    Test your knowledge
    Further reading
Chapter 3: Common Use Cases and Architecture Patterns
    Reference architecture for batch ETL workloads
        Use case overview
        Reference architecture walkthrough
        Best practices to follow during implementation
    Reference architecture for clickstream analytics
        Use case overview
        Reference architecture walkthrough
        Best practices to follow during implementation
    Reference architecture for interactive analytics and ML
        Use case overview
        Reference architecture walkthrough
        Best practices to follow during implementation
    Reference architecture for real-time streaming analytics
        Use case overview
        Reference architecture walkthrough
        Best practices to follow during implementation
    Reference architecture for genomics data analytics
        Use case overview
        Reference architecture walkthrough
        Best practices to follow during implementation
    Reference architecture for log analytics
        Use case overview
        Reference architecture walkthrough
        Best practices to follow during implementation
    Summary
    Test your knowledge
    Further reading
Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR
    Technical requirements
    Understanding popular big data applications in EMR
        Hive
        Presto
        Spark
        HBase
        Hue
        Ganglia
    Machine learning frameworks available in EMR
        TensorFlow
        MXNet
    Notebook options available in EMR
        EMR Notebooks 
        JupyterHub 
        EMR Studio 
        Zeppelin
    Summary
    Test your knowledge
    Further reading
Section 2: Configuration, Scaling, Data Security, and Governance
Chapter 5: Setting Up and Configuring EMR Clusters
    Technical requirements
    Setting up and configuring clusters with the EMR console's quick create option
    Advanced configuration for cluster hardware and software 
        Understanding the Software Configuration section
        Understanding Steps 
        Understanding the Hardware Configuration section
        Understanding general configurations
        Understanding Security Options 
    Working with AMIs and controlling cluster termination 
        Working with AMIs
        Controlling the EMR cluster termination process
    Troubleshooting and logging in your EMR cluster 
        Tools available to debug your EMR cluster 
        Viewing and restarting cluster application processes
        Troubleshooting a failed cluster
        Troubleshooting a slow cluster
        Logging in your EMR cluster 
    Summary
    Test your knowledge
    Further reading
Chapter 6: Monitoring, Scaling, and High Availability
    Technical requirements
    Monitoring your EMR cluster
        Monitoring clusters and applications with web user interfaces 
        Monitoring cluster metrics with CloudWatch monitoring 
        EMR API audit logging with AWS CloudTrail 
    Scaling cluster resources 
        Managed scaling in EMR 
        Autoscaling in EMR with a custom policy for instance groups
        Manually resizing your EMR cluster
        Comparing managed scaling with autoscaling 
    Cluster cloning and high availability with multiple master nodes
        High availability with multiple master nodes 
        Cloning an existing EMR cluster 
    Summary
    Test your knowledge
    Further reading
Chapter 7: Understanding Security in Amazon EMR
    Technical requirements
    Understanding the basics of security 
        Creating security configurations 
        Specifying a security configuration for your cluster 
    AWS IAM integration with Amazon EMR 
        Configuring an IAM service role for your EMR cluster 
        Configuring IAM roles for EMRFS 
        Integrating IAM roles in applications that invoke AWS services directly 
        Allowing users and groups to create and modify roles 
        Identity-based policies and best practices 
        Understanding authentication to cluster nodes 
    Understanding data protection in EMR 
        Encrypting data at rest for EMRFS on Amazon S3 data 
        Encrypting data in transit for EMRFS on Amazon S3 data 
    Role of security groups and interface VPC endpoints 
        Controlling cluster network traffic with security groups 
        Connecting to Amazon EMR on an EC2 cluster using an interface VPC endpoint 
        Connecting to Amazon EMR on an EKS cluster using an interface VPC endpoint 
    Summary 
    Test your knowledge
    Further reading 
Chapter 8: Understanding Data Governance in Amazon EMR
    Technical requirements
    Understanding data catalog and access management options 
        Using AWS Glue Data Catalog 
        Integrating AWS Glue Data Catalog with Amazon EMR 
        Permission management on top of a data catalog
    Understanding Amazon EMR integration with AWS Lake Formation 
        Integrating Lake Formation with Amazon EMR 
        Launching an EMR cluster with Lake Formation 
        Setting up EMR notebooks to work with Lake Formation 
    Understanding Amazon EMR integration with Apache Ranger 
        Setting up Apache Ranger in EMR 
        Understanding Apache Ranger plugins 
    Summary
    Test your knowledge
    Further reading
Section 3: Implementing Common Use Cases and Best Practices
Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark
    Technical requirements
    Use case and architecture overview
        Architecture overview 
    Implementation steps
        Creating Amazon S3 buckets 
        Creating the AWS Lambda function 
        Configuring an S3 file arrival event to trigger the Lambda function 
        Triggering the EMR job 
    Validating the output using Amazon Athena 
        Defining a virtual Glue Data Catalog table on top of Amazon S3 data 
        Querying output data using Amazon Athena standard SQL 
    Spark ETL and Lambda function code walk-through
        Understanding the AWS Lambda function code 
        Understanding the PySpark script integrated into the EMR step 
    Summary
    Test your knowledge
    Further reading
Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming
    Technical requirements
    Use case and architecture overview
        Architecture overview 
    Implementation steps
        Creating Amazon S3 buckets 
        Creating the Amazon Kinesis data stream 
        Creating and configuring the Kinesis Data Generator tool
        Creating an Amazon EMR cluster and configuring a Spark Streaming job 
    Validating output using Amazon Athena 
        Defining a virtual Glue Catalog table on top of Amazon S3 data 
        Querying output data using a standard SQL query in Amazon Athena
    Spark Streaming code walk-through
    Summary
    Test your knowledge
    Further reading
Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi
    Technical requirements
    Apache Hudi overview
        Popular use cases
        Registering Hudi data with your Hive or Glue Data Catalog metastore 
    Creating an EMR cluster and an EMR notebook
        Creating an EMR cluster 
        Creating an EMR notebook
        Creating an Amazon S3 bucket
    Interactive development with Spark and Hudi
        Creating a PySpark notebook for development 
        Integrating Hudi with our PySpark notebook 
        Executing Spark and Hudi scripts in your notebook 
    Summary
    Test your knowledge
    Further reading
Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA
    Technical requirements
    Overview of AWS Step Functions
    Integrating AWS Step Functions to orchestrate EMR jobs
    Overview of Apache Airflow and MWAA
    Integrating Airflow to trigger EMR jobs
    Summary
    Test your knowledge
    Further reading
Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR
    Understanding migration approaches
        Lift and shift 
        Re-architecting 
        Hybrid architecture
    Migrating data and metadata catalogs
        Migrating data 
        Migrating metadata catalogs
    Migrating ETL jobs and Oozie workflows 
        Migrating Oozie workflows
    Testing and validation 
        Validating metadata quality 
        Validating data quality 
    Best practices for migration 
    Summary
    Test your knowledge
    Further reading
Chapter 14: Best Practices and Cost-Optimization Techniques
    Best practices around EMR cluster configurations
        Choosing the correct cluster type (transient versus long-running) 
        Best practices around sizing your cluster
    Optimization techniques for data processing and storage
        Best practices for cluster persistent storage 
        Best practices while processing data using EMR 
    Security best practices
        Configuring edge nodes outside of the cluster to limit connectivity 
        Integrating logging, monitoring, and audit controls into your cluster
        Blocking public access to your EMR cluster 
        Protecting your data at rest and in transit
    Cost-optimization techniques
        Cost savings with compute resources 
        Cost savings with storage 
        Integrating AWS Budgets and Cost Explorer 
        AWS Trusted Advisor
        Cost allocation tags 
    Limitations of Amazon EMR and possible workarounds
    Summary
    Test your knowledge
    Further reading
    Why subscribe?
Other Books You May Enjoy
    Packt is searching for authors like you
    Share Your Thoughts

AI & Machine Learning Artificial Intelligence Data Processing Intelligence & Semantics

Donate to keep this site alive

To access the Link, solve the captcha.

How to download source code?

1. Go to: https://github.com/PacktPublishing

2. In the Find a repository… box, search the book title: Simplify Big Data Analytics with Amazon EMR: A beginner’s guide to learning and implementing Amazon EMR for building data analytics solutions, sometime you may not get the results, please search the main title.