Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS

Length: 416 pages
Edition: 1
Language: English
Publisher: Sybex
Publication Date: 2023-05-02
ISBN-10: 1119909244
ISBN-13: 9781119909248
Sales Rank: #5942895 (See Top 100 Books)

A comprehensive and accessible roadmap to performing data analytics in the AWS cloud

In Data Analytics in the AWS Cloud: Building a Data Platform for BI and Predictive Analytics on AWS, accomplished software engineer and data architect Joe Minichino delivers an expert blueprint to storing, processing, analyzing data on the Amazon Web Services cloud platform. In the book, you’ll explore every relevant aspect of data analytics―from data engineering to analysis, business intelligence, DevOps, and MLOps―as you discover how to integrate machine learning predictions with analytics engines and visualization tools.

You’ll also find:

Real-world use cases of AWS architectures that demystify the applications of data analytics
Accessible introductions to data acquisition, importation, storage, visualization, and reporting
Expert insights into serverless data engineering and how to use it to reduce overhead and costs, improve stability, and simplify maintenance

A can’t-miss for data architects, analysts, engineers and technical professionals, Data Analytics in the AWS Cloud will also earn a place on the bookshelves of business leaders seeking a better understanding of data analytics on the AWS cloud platform.

Cover
Title Page
Copyright Page
About the Author
About the Technical Editor
Acknowledgments
Contents at a Glance
Contents
Introduction
	What Is a Data Lake?
		When You Do Not Need a Data Lake
		When Do You Need Analytics?
		When Do You Need a Data Lake for Analytics?
		How About an Analytics Team?
	The Data Platform
	The End of the Beginning
Chapter 1 AWS Data Lakes and Analytics Technology Overview
	Why AWS?
	What Does a Data Lake Look Like in AWS?
	Analytics on AWS
	Skills Required to Build and Maintain an AWS Analytics Pipeline
Chapter 2 The Path to Analytics: Setting Up a Data and Analytics Team
	The Data Vision
		Support
	DA Team Roles
		Early Stage Roles
			Team Lead
			Data Architect
			Data Engineer
			Data Analyst
		Maturity Stage Roles
			Data Scientist
			Cloud Engineer
			Business Intelligence (BI) Developer
			Machine Learning Engineer
			Business Analyst
		Niche Roles
	Analytics Flow at a Process Level
		Workflow Methodology
	The DA Team Mantra: “Automate Everything”
	Analytics Models in the Wild: Centralized, Distributed, Center of Excellence
		Centralized
		Distributed
		Center of Excellence
	Summary
Chapter 3 Working on AWS
	Accessing AWS
	Everything Is a Resource
		S3: An Important Exception
	IAM: Policies, Roles, and Users
		Policies
			Identity-Based Policies
			Resource-Based Policies
			Roles
			Users and User Groups
		Summarizing IAM
	Working with the Web Console
	The AWS Command-Line Interface
		Installing AWS CLI
			Linux Installation
			macOS Installation
			Windows
		Configuring AWS CLI
			A Note on Region
			Setting Individual Parameters
			Using Profiles and Configuration Files
			Final Notes on Configuration
		Using the AWS CLI
		Using Skeletons and File Inputs
		Cleaning Up!
	Infrastructure-as-Code: CloudFormation and Terraform
		CloudFormation
			CloudFormation Stacks
			CloudFormation Template Anatomy
			CloudFormation Changesets
			Getting Stack Information
			Cleaning Up Again
			CloudFormation Conclusions
		Terraform
			Coding Style
			Modularity
			Limitations
		Terraform vs. CloudFormation
			Infrastructure-as-Code: CDK, Pulumi, Cloudcraft, and Other Solutions
			AWS CDK
			Pulumi
			Cloudcraft
		Infrastructure Management Conclusions
Chapter 4 Serverless Computing and Data Engineering
	Serverless vs. Fully Managed
	AWS Serverless Technologies
		AWS Lambda
			Pricing Model
			Laser Focus on Code
			The Lambda Paradigm Shift
			Virtually Infinite Scalability
			Geographical Distribution
			A Lambda Hello World
			Lambda Configuration
			Runtime
			Container-Based Lambdas
			Architectures
			Memory
			Networking
			Execution Role
			Environment Variables
		AWS EventBridge
		AWS Fargate
		AWS DynamoDB
		AWS SNS
		Amazon SQS
		AWS CloudWatch
		Amazon QuickSight
		AWS Step Functions
		Amazon API Gateway
		Amazon Cognito
	AWS Serverless Application Model (SAM)
		Ephemeral Infrastructure
		AWS SAM Installation
		Configuration
		Creating Your First AWS SAM Project
		Application Structure
		SAM Resource Types
		SAM Lambda Template
		!! Recursive Lambda Invocation !!
		Function Metadata
		Outputs
		Implicitly Generated Resources
		Other Template Sections
		Lambda Code
		Building Your First SAM Application
		Testing the AWS SAM Application Locally
		Deployment
		Cleaning Up
	Summary
Chapter 5 Data Ingestion
	AWS Data Lake Architecture
		Serverless Data Lake Architecture Structure
			Ingestion
			Storage and Processing
			Cataloging, Governance, and Search
			Security and Monitoring
			Consumption
	Sample Processing Architecture: Cataloging Images into DynamoDB
		Use Case Description
		SAM Application Creation
			S3-Triggered Lambda
		Adding DynamoDB
		Lambda Execution Context
		Inserting into DynamoDB
		Cleaning Up
	Serverless Ingestion
		AWS Fargate
		AWS Lambda
		Example Architecture: Fargate-Based Periodic Batch Import
			The Basic Importer
			ECS CLI
			AWS Copilot CLI
			Clean Up
		AWS Kinesis Ingestion
			Example Architecture: Two-Pronged Delivery
	Fully Managed Ingestion with AppFlow
	Operational Data Ingestion with Database Migration Service
		DMS Concepts
			DMS Instance
			DMS Endpoints
			DMS Tasks
			Summary of the Workflow
		Common Use of DMS
		Example Architecture: DMS to S3
			DMS Instance
			DMS Endpoints
			DMS Task
	Summary
Chapter 6 Processing Data
	Phases of Data Preparation
		What Is ETL? Why Should I Care?
		ETL Job vs. Streaming Job
	Overview of ETL in AWS
		ETL with AWS Glue
		ETL with Lambda Functions
		ETL with Hadoop/EMR
		Other Ways to Perform ETL
	ETL Job Design Concepts
		Source Identification
		Destination Identification
		Mappings
		Validation
		Filter
		Join, Denormalization, Relationalization
	AWS Glue for ETL
		Really, It’s Just Spark
		Visual
		Spark Script Editor
		Python Shell Script Editor
		Jupyter Notebook
	Connectors
		Creating Connections
			Creating Connections with the Web Console
			Creating Connections with the AWS CLI
	Creating ETL Jobs with AWS Glue Visual Editor
		ETL Example: Format Switch from Raw (JSON) to Cleaned (Parquet)
		Job Bookmarks
		Transformations
			Apply Mapping
			Filter
			Other Available Transforms
			Run the Edited Job
		Visual Editor with Source and Target Conclusions
	Creating ETL Jobs with AWS Glue Visual Editor (without Source and Target)
	Creating ETL Jobs with the Spark Script Editor
	Developing ETL Jobs with AWS Glue Notebooks
		What Is a Notebook?
		Notebook Structure
		Step 1: Load Code into a DynamicFrame
		Step 2: Apply Field Mapping
		Step 3: Apply the Filter
		Step 4: Write to S3 in Parquet Format
		Example: Joining and Denormalizing Data from Two S3 Locations
		Conclusions for Manually Authored Jobs with Notebooks
	Creating ETL Jobs with AWS Glue Interactive Sessions
		It’s Magic
		Development Workflow
	Streaming Jobs
		Differences with a Standard ETL Job
		Streaming Sources
		Example: Process Kinesis Streams with a Streaming Job
		Streaming ETL Jobs Conclusions
		Summary
Chapter 7 Cataloging, Governance, and Search
	Cataloging with AWS Glue
		AWS Glue and the AWS Glue Data Catalog
		Glue Databases and Tables
			Databases
			The Idea of Schema-on-Read
			Tables
			Create Table Manually
			Creating a Table from an Existing Schema
			Creating a Table with a Crawler
			Summary on Databases and Tables
			Crawlers
			Updating or Not Updating?
			Running the Crawler
			Creating a Crawler from the AWS CLI
			Retrieving Table Information from the CLI
		Classifiers
			Classifier Example
		Crawlers and Classifiers Summary
	Search with Amazon Athena: The Heart of Analytics in AWS
		A Bit of History
		Interface Overview
		Creating Tables Manually
		Athena Data Types
			Complex Types
		Running a Query
		Connecting with JDBC and ODBC
		Query Stats
		Recent Queries and Saved Queries
		The Power of Partitions
			Athena Pricing Model
			Automatic Naming
		Athena Query Output
		Athena Peculiarities (SQL and Not)
			Computed Fields Gotcha and WITH Statement Workaround
			Lowercase!
			Query Explain
			Deduplicating Records
			Working with JSON, Flattening, and Unnesting
		Athena Views
		CREATE TABLE AS SELECT (CTAS)
		Saving Queries and Reusing Saved Queries
		Running Parameterized Queries
		Athena Federated Queries
			Athena Lambda Connectors
			Note on Connection Errors
		Performing Federated Queries
		Creating a View from a Federated Query
	Governing: Athena Workgroups, Lake Formation, and More
		Athena Workgroups
		Fine-Grained Athena Access with IAM
		Recap of Athena-Based Governance
	AWS Lake Formation
		Registering a Location in Lake Formation
		Creating a Database in Lake Formation
			Assigning Permissions in Lake Formation
		LF-Tags and Permissions in Lake Formation
		Data Filters
		Governance Conclusions
	Summary
Chapter 8 Data Consumption: BI, Visualization, and Reporting
	QuickSight
		Signing Up for QuickSight
			Standard Plan
			Enterprise Plan
		Users and User Groups
			Managing Users and Groups
		Managing QuickSight
			Users and Groups
			Your Subscriptions
			SPICE Capacity
			Account Settings
			Security and Permissions
			VPC Connections
			Mobile Settings
			Domains and Embedding
			Single Sign-On
		Data Sources and Datasets
			Creating an Athena Data Source
			Creating Other Data Sources
			Creating a Data Source from the AWS CLI
			Creating a Dataset from a Table
			Creating a Dataset from a SQL Query
			Duplicating Datasets
			Note on Creating Datasets
		QuickSight Favorites, Recent, and Folders
		SPICE
			Manage SPICE Capacity
			Refresh Schedule
		QuickSight Data Editor
			QuickSight Data Types
			Change Data Types
			Calculated Fields
			Joining Data
			Excluding Fields
			Filtering Data
			Removing Data
			Geospatial Hierarchies and Adding Fields to Hierarchies
			Unsupported Format Dates
		Visualizing Data: QuickSight Analysis
			Adding a Title and a Description to Your Analysis
			Renaming the Sheet
			Your First Visual with AutoGraph
			Field Wells
			Visual Types
			Saving and Autosaving
			A First Example: Pie Chart
			Renaming a Visual
			Filtering Data
			Adding Drill-Downs
			Parameters
			Actions
			Insights
			ML-Powered Insights
			Sharing an Analysis
		Dashboards
			Dashboard Layouts and Themes
			Publishing a Dashboard
			Embedding Visuals and Dashboards
	Data Consumption: Not Only Dashboards
	Summary
Chapter 9 Machine Learning at Scale
	Machine Learning and Artificial Intelligence
		What Are ML/AI Use Cases?
		Types of ML Models
		Overview of ML/AI AWS Solutions
	Amazon SageMaker
		SageMaker Domains
			Adding a User to the Domain
		SageMaker Studio
		SageMaker Example Notebook
			Step 1: Prerequisites and Preprocessing
			Step 2: Data Ingestion
			Step 3: Data Inspection
			Step 4: Data Conversion
			Step 5: Upload Training Data
			Step 6: Train the Model
			Step 7: Set Up Hosting and Deploy the Model
			Step 8: Validate the Model
			Step 9: Use the Model
		Inference
			Real Time
			Asynchronous
			Serverless
			Batch Transform
		Data Wrangler
		SageMaker Canvas
	Summary
Appendix Example Data Architectures in AWS
	Modern Data Lake Architecture
		ETL in a Lake House
		Consuming Data in the Lake House
		The Modern Data Lake Architecture
	Batch Processing
	Stream Processing
	Architecture Design Recommendations
		Automate Everything
		Build on Events
		Performance = Cost Savings
		AWS Glue Catalog and Athena-Centric Workflow
		Design Flexible
		Pick Your Battles
		Parquet
	Summary
Index
EULA