Reinforcement Learning: Industrial Applications of Intelligent Agents
- Length: 408 pages
- Edition: 1
- Language: English
- Publisher: O'Reilly Media
- Publication Date: 2020-12-01
- ISBN-10: 1098114833
- ISBN-13: 9781098114831
- Sales Rank: #159220 (See Top 100 Books)
Reinforcement learning (RL) will deliver one of the biggest breakthroughs in AI over the next decade, enabling algorithms to learn from their environment to achieve arbitrary goals. This exciting development avoids constraints found in traditional machine learning (ML) algorithms. This practical book shows data science and AI professionals how to learn by reinforcement and enable a machine to learn by itself.
Author Phil Winder of Winder Research covers everything from basic building blocks to state-of-the-art practices. You’ll explore the current state of RL, focus on industrial applications, learn numerous algorithms, and benefit from dedicated chapters on deploying RL solutions to production. This is no cookbook; doesn’t shy away from math and expects familiarity with ML.
- Learn what RL is and how the algorithms help solve problems
- Become grounded in RL fundamentals including Markov decision processes, dynamic programming, and temporal difference learning
- Dive deep into a range of value and policy gradient methods
- Apply advanced RL solutions such as meta learning, hierarchical learning, multi-agent, and imitation learning
- Understand cutting-edge deep RL algorithms including Rainbow, PPO, TD3, SAC, and more
- Get practical examples through the accompanying website
Preface Objective Who Should Read This Book? Guiding Principles and Style Prerequisites Scope and Outline Supplementary Materials Conventions Used in This Book Acronyms Mathematical Notation Fair Use Policy O’Reilly Online Learning How to Contact Us Acknowledgments 1. Why Reinforcement Learning? Why Now? Machine Learning Reinforcement Learning When Should You Use RL? RL Applications Taxonomy of RL Approaches Model-Free or Model-Based How Agents Use and Update Their Strategy Discrete or Continuous Actions Optimization Methods Policy Evaluation and Improvement Fundamental Concepts in Reinforcement Learning The First RL Algorithm Value estimation Prediction error Weight update rule Is RL the Same as ML? Reward and Feedback Delayed rewards Hindsight Reinforcement Learning as a Discipline Summary Further Reading 2. Markov Decision Processes, Dynamic Programming, and Monte Carlo Methods Multi-Arm Bandit Testing Reward Engineering Policy Evaluation: The Value Function Policy Improvement: Choosing the Best Action Simulating the Environment Running the Experiment Improving the -greedy Algorithm Markov Decision Processes Inventory Control Transition table Transition graph Transition matrix Inventory Control Simulation Policies and Value Functions Discounted Rewards Predicting Rewards with the State-Value Function Simulation using the state-value function Predicting Rewards with the Action-Value Function Optimal Policies Monte Carlo Policy Generation Value Iteration with Dynamic Programming Implementing Value Iteration Results of Value Iteration Summary Further Reading 3. Temporal-Difference Learning, Q-Learning, and n-Step Algorithms Formulation of Temporal-Difference Learning Q-Learning SARSA Q-Learning Versus SARSA Case Study: Automatically Scaling Application Containers to Reduce Cost Industrial Example: Real-Time Bidding in Advertising Defining the MDP Results of the Real-Time Bidding Environments Further Improvements Extensions to Q-Learning Double Q-Learning Delayed Q-Learning Comparing Standard, Double, and Delayed Q-learning Opposition Learning n-Step Algorithms n-Step Algorithms on Grid Environments Eligibility Traces Extensions to Eligibility Traces Watkins’s Q() Fuzzy Wipes in Watkins’s Q() Speedy Q-Learning Accumulating Versus Replacing Eligibility Traces Summary Further Reading 4. Deep Q-Networks Deep Learning Architectures Fundamentals Common Neural Network Architectures Deep Learning Frameworks Deep Reinforcement Learning Deep Q-Learning Experience Replay Q-Network Clones Neural Network Architecture Implementing DQN Example: DQN on the CartPole Environment Why train online? Which is better? DQN versus Q-learning Case Study: Reducing Energy Usage in Buildings Rainbow DQN Distributional RL Prioritized Experience Replay Noisy Nets Dueling Networks Example: Rainbow DQN on Atari Games Results Discussion Other DQN Improvements Improving Exploration Improving Rewards Learning from Offline Data Summary Further Reading 5. Policy Gradient Methods Benefits of Learning a Policy Directly How to Calculate the Gradient of a Policy Policy Gradient Theorem Policy Functions Linear Policies Logistic policy Softmax policy Arbitrary Policies Basic Implementations Monte Carlo (REINFORCE) Example: REINFORCE on the CartPole environment REINFORCE with Baseline Example: REINFORCE with baseline on the CartPole environment Gradient Variance Reduction n-Step Actor-Critic and Advantage Actor-Critic (A2C) Example: n-step actor-critic on the CartPole environment State-value learning decay rates versus policy decay rates Eligibility Traces Actor-Critic Example: Eligibility trace actor-critic on the CartPole environment A Comparison of Basic Policy Gradient Algorithms Industrial Example: Automatically Purchasing Products for Customers The Environment: Gym-Shopping-Cart Expectations Results from the Shopping Cart Environment Summary Further Reading 6. Beyond Policy Gradients Off-Policy Algorithms Importance Sampling Behavior and Target Policies Off-Policy Q-Learning Gradient Temporal-Difference Learning Greedy-GQ Off-Policy Actor-Critics Deterministic Policy Gradients Deterministic Policy Gradients Deep Deterministic Policy Gradients DDPG derivation DDPG implementation Twin Delayed DDPG Delayed policy updates (DPU) Clipped double Q-learning (CDQ) Target policy smoothing (TPS) TD3 implementation Case Study: Recommendations Using Reviews Improvements to DPG Trust Region Methods Kullback–Leibler Divergence KL divergence experiments Natural Policy Gradients and Trust Region Policy Optimization Proximal Policy Optimization PPO’s clipped objective PPO’s value function and exploration objectives Example: Using Servos for a Real-Life Reacher Experiment Setup RL Algorithm Implementation Increasing the Complexity of the Algorithm Hyperparameter Tuning in a Simulation Resulting Policies Other Policy Gradient Algorithms Retrace() Actor-Critic with Experience Replay (ACER) Actor-Critic Using Kronecker-Factored Trust Regions (ACKTR) Emphatic Methods Extensions to Policy Gradient Algorithms Quantile Regression in Policy Gradient Algorithms Summary Which Algorithm Should I Use? A Note on Asynchronous Methods Further Reading 7. Learning All Possible Policies with Entropy Methods What Is Entropy? Maximum Entropy Reinforcement Learning Soft Actor-Critic SAC Implementation Details and Discrete Action Spaces Automatically Adjusting Temperature Case Study: Automated Traffic Management to Reduce Queuing Extensions to Maximum Entropy Methods Other Measures of Entropy (and Ensembles) Optimistic Exploration Using the Upper Bound of Double Q-Learning Tinkering with Experience Replay Soft Policy Gradient Soft Q-Learning (and Derivatives) Path Consistency Learning Performance Comparison: SAC Versus PPO How Does Entropy Encourage Exploration? How Does the Temperature Parameter Alter Exploration? Industrial Example: Learning to Drive with a Remote Control Car Description of the Problem Minimizing Training Time Dramatic Actions Hyperparameter Search Final Policy Further Improvements Summary Equivalence Between Policy Gradients and Soft Q-Learning What Does This Mean For the Future? What Does This Mean Now? 8. Improving How an Agent Learns Rethinking the MDP Partially Observable Markov Decision Process Predicting the belief state Case Study: Using POMDPs in Autonomous Vehicles Contextual Markov Decision Processes MDPs with Changing Actions Regularized MDPs Hierarchical Reinforcement Learning Naive HRL High-Low Hierarchies with Intrinsic Rewards (HIRO) Learning Skills and Unsupervised RL Using Skills in HRL HRL Conclusions Multi-Agent Reinforcement Learning MARL Frameworks Centralized or Decentralized Single-Agent Algorithms Case Study: Using Single-Agent Decentralized Learning in UAVs Centralized Learning, Decentralized Execution Decentralized Learning Other Combinations Challenges of MARL MARL Conclusions Expert Guidance Behavior Cloning Imitation RL Inverse RL Curriculum Learning Other Paradigms Meta-Learning Transfer Learning Summary Further Reading 9. Practical Reinforcement Learning The RL Project Life Cycle Life Cycle Definition Data science life cycle Reinforcement learning life cycle Problem Definition: What Is an RL Project? RL Problems Are Sequential RL Problems Are Strategic Low-Level RL Indicators An entity An environment A state An action Quantify success or failure Types of Learning Online learning Offline or batch learning Concurrent learning Reset-free learning RL Engineering and Refinement Process Environment Engineering Implementation Simulation Interacting with real life State Engineering or State Representation Learning Learning forward models Constraints Transformation (dimensionality reduction, autoencoders, and world models) Policy Engineering Discrete states Continuous states Converting to discrete states Mixed state spaces Mapping Policies to Action Spaces Binary actions Continuous actions Hybrid action spaces When to perform actions Massive action spaces Exploration Is intrinsic motivation exploration? Visitation counts (sampling) Information gain (surprise) State prediction (curiosity or self-reflection) Curious challenges Random embeddings (random distillation networks) Distance to novelty (episodic curiosity) Exploration conclusions Reward Engineering Reward engineering guidelines Reward shaping Common rewards Reward conclusions Summary Further Reading 10. Operational Reinforcement Learning Implementation Frameworks RL frameworks Other frameworks Scaling RL Distributed training (Gorila) Single-machine training (A3C, PAAC) Distributed replay (Ape-X) Synchronous distribution (DD-PPO) Improving utilization (IMPALA, SEED) Scaling conclusions Evaluation Policy performance measures Statistical policy comparisons Algorithm performance measures Problem-specific performance measures Explainability Evaluation conclusions Deployment Goals Goals during different phases of development Best practices Hierarchy of needs Architecture Ancillary Tooling Build versus buy Monitoring Logging and tracing Continuous integration and continuous delivery Experiment tracking Hyperparameter tuning Deploying multiple agents Deploying policies Safety, Security, and Ethics Safe RL Secure RL Ethical RL Summary Further Reading 11. Conclusions and the Future Tips and Tricks Framing the Problem Your Data Training Evaluation Deployment Debugging ${ALGORITHM_NAME} Can’t Solve ${ENVIRONMENT}! Monitoring for Debugging The Future of Reinforcement Learning RL Market Opportunities Future RL and Research Directions Research in industry Research in academia Ethical standards Concluding Remarks Next Steps Now It’s Your Turn Further Reading A. The Gradient of a Logistic Policy for Two Actions B. The Gradient of a Softmax Policy Glossary Acronyms and Common Terms Symbols and Notation Index
Donate to keep this site alive
How to download source code?
1. Go to: https://www.oreilly.com/
2. Search the book title: Reinforcement Learning: Industrial Applications of Intelligent Agents
, sometime you may not get the results, please search the main title
3. Click the book title in the search results
3. Publisher resources
section, click Download Example Code
.
1. Disable the AdBlock plugin. Otherwise, you may not get any links.
2. Solve the CAPTCHA.
3. Click download link.
4. Lead to download server to download.