Simplify Big Data Analytics with Amazon EMR: A beginner’s guide to learning and implementing Amazon EMR for building data analytics solutions
- Length: 430 pages
- Edition: 1
- Language: English
- Publisher: Packt Publishing
- Publication Date: 2022-03-25
- ISBN-10: 1801071071
- ISBN-13: 9781801071079
- Sales Rank: #1237609 (See Top 100 Books)
Design scalable big data solutions using Hadoop, Spark, and AWS cloud native services
Key Features
- Build data pipelines that require distributed processing capabilities on a large volume of data
- Discover the security features of EMR such as data protection and granular permission management
- Explore best practices and optimization techniques for building data analytics solutions in Amazon EMR
Book Description
Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS.
This book is a practical guide to Amazon EMR for building data pipelines. You’ll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You’ll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you’ll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you’ll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR.
By the end of this book, you’ll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.
What you will learn
- Explore Amazon EMR features, architecture, Hadoop interfaces, and EMR Studio
- Configure, deploy, and orchestrate Hadoop or Spark jobs in production
- Implement the security, data governance, and monitoring capabilities of EMR
- Build applications for batch and real-time streaming data analytics solutions
- Perform interactive development with a persistent EMR cluster and Notebook
- Orchestrate an EMR Spark job using AWS Step Functions and Apache Airflow
Who this book is for
This book is for data engineers, data analysts, data scientists, and solution architects who are interested in building data analytics solutions with the Hadoop ecosystem services and Amazon EMR. Prior experience in either Python programming, Scala, or the Java programming language and a basic understanding of Hadoop and AWS will help you make the most out of this book.
Simplify Big Data Analytics with Amazon EMR Contributors About the author About the reviewers Preface Who this book is for What this book covers To get the most out of this book Download the example code files Code in Action Download the color images Conventions used Get in touch Share Your Thoughts Section 1: Overview, Architecture, Big Data Applications, and Common Use Cases of Amazon EMR Chapter 1: An Overview of Amazon EMR What is Amazon EMR? What is big data? Hadoop – processing framework to handle big data Overview of Amazon EMR – managed and scalable Hadoop cluster in AWS A brief history of the major big data releases Benefits of Amazon EMR Decoupling compute and storage Persistent versus transient clusters Integration with other AWS services Amazon S3 with EMR File System (EMRFS) Amazon Kinesis Data Streams (KDS) Amazon Managed Streaming for Kafka (MSK) AWS Glue Data Catalog Amazon Relational Database Service (RDS) Amazon DynamoDB Amazon Redshift AWS Lake Formation AWS Identity and Access Management (IAM) AWS Key Management Service (KMS) Lake House architecture overview EMR release history Comparing Amazon EMR with AWS Glue and AWS Glue DataBrew AWS Glue AWS Glue DataBrew Choosing the right service for your use case Summary Test your knowledge Further reading Chapter 2: Exploring the Architecture and Deployment Options EMR architecture deep dive Distributed storage layer YARN – cluster resource manager Distributed processing frameworks Hadoop applications Understanding clusters and nodes Uniform instance groups Instance fleet Using S3 versus HDFS for cluster storage HDFS as cluster-persistent storage Amazon S3 as a persistent data store Understanding the cluster life cycle Options to submit work to the cluster Submitting jobs to the cluster as EMR steps Building Hadoop jobs with dependencies in a specific EMR release version EMR deployment options Amazon EMR on Amazon EC2 Amazon EMR on Amazon EKS Amazon EMR on AWS Outposts EMR pricing for different deployment options Monitoring and controlling your costs with AWS Budgets and Cost Explorer Summary Test your knowledge Further reading Chapter 3: Common Use Cases and Architecture Patterns Reference architecture for batch ETL workloads Use case overview Reference architecture walkthrough Best practices to follow during implementation Reference architecture for clickstream analytics Use case overview Reference architecture walkthrough Best practices to follow during implementation Reference architecture for interactive analytics and ML Use case overview Reference architecture walkthrough Best practices to follow during implementation Reference architecture for real-time streaming analytics Use case overview Reference architecture walkthrough Best practices to follow during implementation Reference architecture for genomics data analytics Use case overview Reference architecture walkthrough Best practices to follow during implementation Reference architecture for log analytics Use case overview Reference architecture walkthrough Best practices to follow during implementation Summary Test your knowledge Further reading Chapter 4: Big Data Applications and Notebooks Available in Amazon EMR Technical requirements Understanding popular big data applications in EMR Hive Presto Spark HBase Hue Ganglia Machine learning frameworks available in EMR TensorFlow MXNet Notebook options available in EMR EMR Notebooks JupyterHub EMR Studio Zeppelin Summary Test your knowledge Further reading Section 2: Configuration, Scaling, Data Security, and Governance Chapter 5: Setting Up and Configuring EMR Clusters Technical requirements Setting up and configuring clusters with the EMR console's quick create option Advanced configuration for cluster hardware and software Understanding the Software Configuration section Understanding Steps Understanding the Hardware Configuration section Understanding general configurations Understanding Security Options Working with AMIs and controlling cluster termination Working with AMIs Controlling the EMR cluster termination process Troubleshooting and logging in your EMR cluster Tools available to debug your EMR cluster Viewing and restarting cluster application processes Troubleshooting a failed cluster Troubleshooting a slow cluster Logging in your EMR cluster Summary Test your knowledge Further reading Chapter 6: Monitoring, Scaling, and High Availability Technical requirements Monitoring your EMR cluster Monitoring clusters and applications with web user interfaces Monitoring cluster metrics with CloudWatch monitoring EMR API audit logging with AWS CloudTrail Scaling cluster resources Managed scaling in EMR Autoscaling in EMR with a custom policy for instance groups Manually resizing your EMR cluster Comparing managed scaling with autoscaling Cluster cloning and high availability with multiple master nodes High availability with multiple master nodes Cloning an existing EMR cluster Summary Test your knowledge Further reading Chapter 7: Understanding Security in Amazon EMR Technical requirements Understanding the basics of security Creating security configurations Specifying a security configuration for your cluster AWS IAM integration with Amazon EMR Configuring an IAM service role for your EMR cluster Configuring IAM roles for EMRFS Integrating IAM roles in applications that invoke AWS services directly Allowing users and groups to create and modify roles Identity-based policies and best practices Understanding authentication to cluster nodes Understanding data protection in EMR Encrypting data at rest for EMRFS on Amazon S3 data Encrypting data in transit for EMRFS on Amazon S3 data Role of security groups and interface VPC endpoints Controlling cluster network traffic with security groups Connecting to Amazon EMR on an EC2 cluster using an interface VPC endpoint Connecting to Amazon EMR on an EKS cluster using an interface VPC endpoint Summary Test your knowledge Further reading Chapter 8: Understanding Data Governance in Amazon EMR Technical requirements Understanding data catalog and access management options Using AWS Glue Data Catalog Integrating AWS Glue Data Catalog with Amazon EMR Permission management on top of a data catalog Understanding Amazon EMR integration with AWS Lake Formation Integrating Lake Formation with Amazon EMR Launching an EMR cluster with Lake Formation Setting up EMR notebooks to work with Lake Formation Understanding Amazon EMR integration with Apache Ranger Setting up Apache Ranger in EMR Understanding Apache Ranger plugins Summary Test your knowledge Further reading Section 3: Implementing Common Use Cases and Best Practices Chapter 9: Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark Technical requirements Use case and architecture overview Architecture overview Implementation steps Creating Amazon S3 buckets Creating the AWS Lambda function Configuring an S3 file arrival event to trigger the Lambda function Triggering the EMR job Validating the output using Amazon Athena Defining a virtual Glue Data Catalog table on top of Amazon S3 data Querying output data using Amazon Athena standard SQL Spark ETL and Lambda function code walk-through Understanding the AWS Lambda function code Understanding the PySpark script integrated into the EMR step Summary Test your knowledge Further reading Chapter 10: Implementing Real-Time Streaming with Amazon EMR and Spark Streaming Technical requirements Use case and architecture overview Architecture overview Implementation steps Creating Amazon S3 buckets Creating the Amazon Kinesis data stream Creating and configuring the Kinesis Data Generator tool Creating an Amazon EMR cluster and configuring a Spark Streaming job Validating output using Amazon Athena Defining a virtual Glue Catalog table on top of Amazon S3 data Querying output data using a standard SQL query in Amazon Athena Spark Streaming code walk-through Summary Test your knowledge Further reading Chapter 11: Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi Technical requirements Apache Hudi overview Popular use cases Registering Hudi data with your Hive or Glue Data Catalog metastore Creating an EMR cluster and an EMR notebook Creating an EMR cluster Creating an EMR notebook Creating an Amazon S3 bucket Interactive development with Spark and Hudi Creating a PySpark notebook for development Integrating Hudi with our PySpark notebook Executing Spark and Hudi scripts in your notebook Summary Test your knowledge Further reading Chapter 12: Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA Technical requirements Overview of AWS Step Functions Integrating AWS Step Functions to orchestrate EMR jobs Overview of Apache Airflow and MWAA Integrating Airflow to trigger EMR jobs Summary Test your knowledge Further reading Chapter 13: Migrating On-Premises Hadoop Workloads to Amazon EMR Understanding migration approaches Lift and shift Re-architecting Hybrid architecture Migrating data and metadata catalogs Migrating data Migrating metadata catalogs Migrating ETL jobs and Oozie workflows Migrating Oozie workflows Testing and validation Validating metadata quality Validating data quality Best practices for migration Summary Test your knowledge Further reading Chapter 14: Best Practices and Cost-Optimization Techniques Best practices around EMR cluster configurations Choosing the correct cluster type (transient versus long-running) Best practices around sizing your cluster Optimization techniques for data processing and storage Best practices for cluster persistent storage Best practices while processing data using EMR Security best practices Configuring edge nodes outside of the cluster to limit connectivity Integrating logging, monitoring, and audit controls into your cluster Blocking public access to your EMR cluster Protecting your data at rest and in transit Cost-optimization techniques Cost savings with compute resources Cost savings with storage Integrating AWS Budgets and Cost Explorer AWS Trusted Advisor Cost allocation tags Limitations of Amazon EMR and possible workarounds Summary Test your knowledge Further reading Why subscribe? Other Books You May Enjoy Packt is searching for authors like you Share Your Thoughts
Donate to keep this site alive
How to download source code?
1. Go to: https://github.com/PacktPublishing
2. In the Find a repository… box, search the book title: Simplify Big Data Analytics with Amazon EMR: A beginner’s guide to learning and implementing Amazon EMR for building data analytics solutions
, sometime you may not get the results, please search the main title.
3. Click the book title in the search results.
3. Click Code to download.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.