The Self-Service Data Roadmap: Democratize Data and Reduce Time to Insight
- Length: 286 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2020-09-29
- ISBN-10: 1492075256
- ISBN-13: 9781492075257
- Sales Rank: #760397 (See Top 100 Books)
Data-driven insights are a key competitive advantage for any industry today, but deriving insights from raw data can still take days or weeks. Most organizations can’t scale data science teams fast enough to keep up with the growing amounts of data to transform. What’s the answer? Self-service data.
With this practical book, data engineers, data scientists, and team managers will learn how to build a self-service data science platform that helps anyone in your organization extract insights from data. Sandeep Uttamchandani provides a scorecard to track and address bottlenecks that slow down time to insight across data discovery, transformation, processing, and production. This book bridges the gap between data scientists bottlenecked by engineering realities and data engineers unclear about ways to make self-service work.
- Build a self-service portal to support data discovery, quality, lineage, and governance
- Select the best approach for each self-service capability using open source cloud technologies
- Tailor self-service for the people, processes, and technology maturity of your data platform
- Implement capabilities to democratize data and reduce time to insight
- Scale your self-service portal to support a large number of users within your organization
Preface Conventions Used in This Book Using Code Examples O’Reilly Online Learning How to Contact Us 1. Introduction Journey Map from Raw Data to Insights Discover Prep Build Operationalize Defining Your Time-to-Insight Scorecard Build Your Self-Service Data Roadmap I. Self-Service Data Discovery 2. Metadata Catalog Service Journey Map Understanding Datasets Analyzing Datasets Knowledge Scaling Minimizing Time to Interpret Extracting Technical Metadata Extracting Operational Metadata Gathering Team Knowledge Defining Requirements Technical Metadata Extractor Requirements Operational Metadata Requirements Team Knowledge Aggregator Requirements Implementation Patterns Source-Specific Connectors Pattern Lineage Correlation Pattern Team Knowledge Pattern Summary 3. Search Service Journey Map Determining Feasibility of the Business Problem Selecting Relevant Datasets for Data Prep Reusing Existing Artifacts for Prototyping Minimizing Time to Find Indexing Datasets and Artifacts Ranking Results Access Control Defining Requirements Indexer Requirements Ranking Requirements Access Control Requirements Nonfunctional Requirements Implementation Patterns Push-Pull Indexer Pattern Hybrid Search Ranking Pattern Catalog Access Control Pattern Summary 4. Feature Store Service Journey Map Finding Available Features Training Set Generation Feature Pipeline for Online Inference Minimize Time to Featurize Feature Computation Feature Serving Defining Requirements Feature Computation Feature Serving Nonfunctional Requirements Implementation Patterns Hybrid Feature Computation Pattern Feature Registry Pattern Summary 5. Data Movement Service Journey Map Aggregating Data Across Sources Moving Raw Data to Specialized Query Engines Moving Processed Data to Serving Stores Exploratory Analysis Across Sources Minimizing Time to Data Availability Data Ingestion Configuration and Change Management Compliance Data Quality Verification Defining Requirements Ingestion Requirements Transformation Requirements Compliance Requirements Verification Requirements Nonfunctional Requirements Implementation Patterns Batch Ingestion Pattern Change Data Capture Ingestion Pattern Event Aggregation Pattern Summary 6. Clickstream Tracking Service Journey Map Minimizing Time to Click Metrics Managing Instrumentation Event Enrichment Building Insights Defining Requirements Instrumentation Requirements Checklist Enrichment Requirements Checklist Implementation Patterns Instrumentation Pattern Rule-Based Enrichment Patterns Consumption Patterns Summary II. Self-Service Data Prep 7. Data Lake Management Service Journey Map Primitive Life Cycle Management Managing Data Updates Managing Batching and Streaming Data Flows Minimizing Time to Data Lake Management Requirements Implementation Patterns Data Life Cycle Primitives Pattern Transactional Pattern Advanced Data Management Pattern Summary 8. Data Wrangling Service Journey Map Minimizing Time to Wrangle Defining Requirements Curating Data Operational Monitoring Defining Requirements Implementation Patterns Exploratory Data Analysis Patterns Analytical Transformation Patterns Summary 9. Data Rights Governance Service Journey Map Executing Data Rights Requests Discovery of Datasets Model Retraining Minimizing Time to Comply Tracking the Customer Data Life Cycle Executing Customer Data Rights Requests Limiting Data Access Defining Requirements Current Pain Point Questionnaire Interop Checklist Functional Requirements Nonfunctional Requirements Implementation Patterns Sensitive Data Discovery and Classification Pattern Data Lake Deletion Pattern Use Case–Dependent Access Control Summary III. Self-Service Build 10. Data Virtualization Service Journey Map Exploring Data Sources Picking a Processing Cluster Minimizing Time to Query Picking the Execution Environment Formulating Polyglot Queries Joining Data Across Silos Defining Requirements Current Pain Point Analysis Operational Requirements Functional Requirements Nonfunctional Requirements Implementation Patterns Automatic Query Routing Pattern Unified Query Pattern Federated Query Pattern Summary 11. Data Transformation Service Journey Map Production Dashboard and ML Pipelines Data-Driven Storytelling Minimizing Time to Transform Transformation Implementation Transformation Execution Transformation Operations Defining Requirements Current State Questionnaire Functional Requirements Nonfunctional Requirements Implementation Patterns Implementation Pattern Execution Patterns Summary 12. Model Training Service Journey Map Model Prototyping Continuous Training Model Debugging Minimizing Time to Train Training Orchestration Tuning Continuous Training Defining Requirements Training Orchestration Tuning Continuous Training Nonfunctional Requirements Implementation Patterns Distributed Training Orchestrator Pattern Automated Tuning Pattern Data-Aware Continuous Training Summary 13. Continuous Integration Service Journey Map Collaborating on an ML Pipeline Integrating ETL Changes Validating Schema Changes Minimizing Time to Integrate Experiment Tracking Reproducible Deployment Testing Validation Defining Requirements Experiment Tracking Module Pipeline Packaging Module Testing Automation Module Implementation Patterns Programmable Tracking Pattern Reproducible Project Pattern Summary 14. A/B Testing Service Journey Map Minimizing Time to A/B Test Experiment Design Execution at Scale Experiment Optimization Implementation Patterns Experiment Specification Pattern Metrics Definition Pattern Automated Experiment Optimization Summary IV. Self-Service Operationalize 15. Query Optimization Service Journey Map Avoiding Cluster Clogs Resolving Runtime Query Issues Speeding Up Applications Minimizing Time to Optimize Aggregating Statistics Analyzing Statistics Optimizing Jobs Defining Requirements Current Pain Points Questionnaire Interop Requirements Functionality Requirements Nonfunctional Requirements Implementation Patterns Avoidance Pattern Operational Insights Pattern Automated Tuning Pattern Summary 16. Pipeline Orchestration Service Journey Map Invoke Exploratory Pipelines Run SLA-Bound Pipelines Minimizing Time to Orchestrate Defining Job Dependencies Distributed Execution Production Monitoring Defining Requirements Current Pain Points Questionnaire Operational Requirements Functional Requirements Nonfunctional Requirements Implementation Patterns Dependency Authoring Patterns Orchestration Observability Patterns Distributed Execution Pattern Summary 17. Model Deploy Service Journey Map Model Deployment in Production Model Maintenance and Upgrade Minimizing Time to Deploy Deployment Orchestration Performance Scaling Drift Monitoring Defining Requirements Orchestration Model Scaling and Performance Drift Verification Nonfunctional Requirements Implementation Patterns Universal Deployment Pattern Autoscaling Deployment Pattern Model Drift Tracking Pattern Summary 18. Quality Observability Service Journey Map Daily Data Quality Monitoring Reports Debugging Quality Issues Handling Low-Quality Data Records Minimizing Time to Insight Quality Verify the Accuracy of the Data Detect Quality Anomalies Prevent Data Quality Issues Defining Requirements Detection and Handling Data Quality Issues Functional Requirements Nonfunctional Requirements Implementation Patterns Accuracy Models Pattern Profiling-Based Anomaly Detection Pattern Avoidance Pattern Summary 19. Cost Management Service Journey Map Monitoring Cost Usage Continuous Cost Optimization Minimizing Time to Optimize Cost Expenditure Observability Matching Supply and Demand Continuous Cost Optimization Defining Requirements Pain Points Questionnaire Functional Requirements Nonfunctional Requirements Implementation Patterns Continuous Cost Monitoring Pattern Automated Scaling Pattern Cost Advisor Pattern Summary Index
Donate to keep this site alive
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: The Self-Service Data Roadmap: Democratize Data and Reduce Time to Insight
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.