Reinforcement Learning Papers
Paper with Code RL section (opens in a new tab): Provides access to research papers along with the corresponding code.
Key Papers
- Q-learning (opens in a new tab) (1992): Introduces the Q-learning algorithm, one of the fundamental algorithms in RL.
- Policy invariance under reward transformations: Theory and application to reward shaping (opens in a new tab) (1999): Discusses the invariance of policies under reward transformations and the concept of reward shaping.
- Learning to Predict by the Methods of Temporal Differences (opens in a new tab) (1988): Introduced the temporal difference (TD) learning algorithm, which is a model-free method for learning value functions in RL.
- Actor-Critic Algorithms (opens in a new tab) (2003): Introduced the actor-critic architecture, which is a model-based method for learning policies in RL.
Deep Reinforcement Learning
Model-Free RL
Deep Q-Learning
- Playing Atari with Deep Reinforcement Learning (opens in a new tab) (2013): Presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.
- Deep Recurrent Q-Learning for Partially Observable MDPs (opens in a new tab) (2015): Proposes a deep recurrent Q-learning algorithm for partially observable Markov decision processes.
- Dueling Network Architectures for Deep Reinforcement Learning (opens in a new tab) (2015): Introduces a dueling network architecture for deep reinforcement learning that separates the estimation of state values and state-dependent action advantages.
- Deep Reinforcement Learning with Double Q-learning (opens in a new tab) (2015): Proposes a double Q-learning algorithm for deep reinforcement learning that reduces overestimation of action values.
- Prioritized Experience Replay (opens in a new tab) (2015): Introduces a prioritized experience replay mechanism for deep reinforcement learning that improves sample efficiency and learning speed.
- Rainbow: Combining Improvements in Deep Reinforcement Learning (opens in a new tab) (2017): Combines several improvements to deep reinforcement learning, including dueling networks, double Q-learning, and prioritized experience replay, to achieve state-of-the-art performance on Atari games.
Policy Gradients
- Asynchronous Methods for Deep Reinforcement Learning (opens in a new tab) (2016): Proposes asynchronous methods for deep reinforcement learning that improve sample efficiency and learning speed.
- Trust Region Policy Optimization (opens in a new tab) (2015): Introduces a trust region optimization method for policy optimization in reinforcement learning that improves stability and sample efficiency.
- High-Dimensional Continuous Control Using Generalized Advantage Estimation (opens in a new tab) (2015): Proposes a generalized advantage estimation method for continuous control tasks in reinforcement learning that improves sample efficiency and learning speed.
- Proximal Policy Optimization Algorithms (opens in a new tab) (2017): Introduces a family of proximal policy optimization algorithms for reinforcement learning that improve sample efficiency and stability.
- Emergence of Locomotion Behaviours in Rich Environments (opens in a new tab) (2017): Demonstrates the emergence of diverse locomotion behaviors in simulated environments using deep reinforcement learning.
- Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (opens in a new tab) (2017): Proposes a scalable trust-region method for deep reinforcement learning that uses Kronecker-factored approximation to improve sample efficiency and learning speed.
- Sample Efficient Actor-Critic with Experience Replay (opens in a new tab) (2016): Introduces a sample-efficient actor-critic algorithm with experience replay for reinforcement learning that improves sample efficiency and learning speed.
- Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor (opens in a new tab) (2018): Proposes a soft actor-critic algorithm for deep reinforcement learning that maximizes entropy and improves exploration.
Deterministic Policy Gradients
- Deterministic Policy Gradient Algorithms (opens in a new tab) (2014): Introduces a deterministic policy gradient algorithm for reinforcement learning that improves sample efficiency and stability.
- Continuous Control With Deep Reinforcement Learning (opens in a new tab) (2015): Demonstrates the effectiveness of deep reinforcement learning for continuous control tasks.
- Addressing Function Approximation Error in Actor-Critic Methods (opens in a new tab) (2018): Addresses the problem of function approximation error in actor-critic methods for reinforcement learning.
Distributional RL
- A Distributional Perspective on Reinforcement Learning (opens in a new tab) (2017): Presents a distributional perspective on reinforcement learning that improves sample efficiency and learning speed.
- Distributional Reinforcement Learning with Quantile Regression (opens in a new tab) (2017): Proposes a distributional reinforcement learning algorithm that uses quantile regression to estimate value distributions.
- Implicit Quantile Networks for Distributional Reinforcement Learning (opens in a new tab) (2018): Introduces implicit quantile networks for distributional reinforcement learning that improve sample efficiency and learning speed.
- Dopamine: A Research Framework for Deep Reinforcement Learning (opens in a new tab) (2018) (code) (opens in a new tab): Provides a research framework for deep reinforcement learning that includes a suite of environments and baselines.
Policy Gradients with Action-Dependent Baselines
- Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic (opens in a new tab) (2016): Proposes a sample-efficient policy gradient algorithm with an off-policy critic for reinforcement learning that improves sample efficiency and learning speed.
- Action-depedent Control Variates for Policy Optimization via Stein’s Identity, (opens in a new tab) (2017): Proposes a control variate method for policy optimization in reinforcement learning that improves sample efficiency and stability.
- The Mirage of Action-Dependent Baselines in Reinforcement Learning (opens in a new tab) (2018): Critiques the use of action-dependent baselines in reinforcement learning and proposes alternative methods.
Path-Consistency Learning
- Bridging the Gap Between Value and Policy Based Reinforcement Learning (opens in a new tab) (2017): Proposes a method for bridging the gap between value-based and policy-based reinforcement learning.
- Trust-PCL: An Off-Policy Trust Region Method for Continuous Control (opens in a new tab) (2017): Introduces an off-policy trust region method for continuous control in reinforcement learning that improves sample efficiency and stability.
Other Directions for Combining Policy-Learning & Q-Learning
- Combining Policy Gradient and Q-learning (opens in a new tab) (2016): Combines policy gradient and Q-learning methods for reinforcement learning to improve sample efficiency and stability.
- The Reactor: A Fast and Sample-Efficient Actor-Critic Agent for Reinforcement Learning (opens in a new tab) (2017): Introduces a fast and sample-efficient actor-critic algorithm for reinforcement learning that improves sample efficiency and learning speed.
- Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning (opens in a new tab) (2017): Proposes an interpolated policy gradient algorithm for deep reinforcement learning that combines on-policy and off-policy gradient estimation.
- Equivalence Between Policy Gradients and Soft Q-Learning (opens in a new tab) (2017): Shows the equivalence between policy gradients and soft Q-learning in reinforcement learning.
Evolutionary Algorithms
- Evolution Strategies as a Scalable Alternative to Reinforcement Learning (opens in a new tab) (2017): Explores the use of evolution strategies, a class of black box optimization algorithms, as an alternative to popular reinforcement learning techniques.
Exploration
Intrinsic Motivation
- VIME: Variational Information Maximizing Exploration (opens in a new tab) (2016): Proposes a variational information maximizing exploration method for reinforcement learning that improves exploration efficiency.
- Unifying Count-Based Exploration and Intrinsic Motivation (opens in a new tab) (2016): Unifies count-based exploration and intrinsic motivation methods for reinforcement learning to improve exploration efficiency.
- Count-Based Exploration with Neural Density Models (opens in a new tab) (2017): Proposes a count-based exploration method for reinforcement learning that uses neural density models to improve exploration efficiency.
- #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning (opens in a new tab) (2016): Studies the effectiveness of count-based exploration methods for deep reinforcement learning.
- EX2: Exploration with Exemplar Models for Deep Reinforcement Learning (opens in a new tab) (2017): Proposes an exploration method for deep reinforcement learning that uses exemplar models to improve exploration efficiency.
- Curiosity-driven Exploration by Self-supervised Prediction (opens in a new tab) (2017): Proposes a curiosity-driven exploration method for reinforcement learning that uses self-supervised prediction to improve exploration efficiency.
- Large-Scale Study of Curiosity-Driven Learning (opens in a new tab) (2018): Conducts a large-scale study of curiosity-driven learning in reinforcement learning.
- Exploration by Random Network Distillation (opens in a new tab) (2018): Proposes an exploration method for reinforcement learning that uses random network distillation to improve exploration efficiency.
Unsupervised RL
- Variational Intrinsic Control (opens in a new tab) (2016): Proposes a variational intrinsic control method for reinforcement learning that improves exploration efficiency.
- Diversity is All You Need: Learning Skills without a Reward Function (opens in a new tab) (2018): Proposes a method for learning skills without a reward function in reinforcement learning that improves sample efficiency.
- Variational Option Discovery Algorithms (opens in a new tab) (2018): Proposes a variational option discovery algorithm for reinforcement learning that improves sample efficiency.
Transfer and Multitask RL
- Progressive Neural Networks (opens in a new tab) (2016): Proposes a progressive neural network architecture for reinforcement learning that improves sample efficiency.
- Universal Value Function Approximators (opens in a new tab) (2015): Proposes a universal value function approximator for reinforcement learning that improves sample efficiency.
- The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously (opens in a new tab) (2017): Proposes a method for learning to solve multiple continuous control tasks simultaneously in reinforcement learning.
- PathNet: Evolution Channels Gradient Descent in Super Neural Networks (opens in a new tab) (2017): Proposes a method for combining evolution and gradient descent in neural network training for reinforcement learning.
- Mutual Alignment Transfer Learning (opens in a new tab) (2017): Proposes a mutual alignment transfer learning method for reinforcement learning that improves sample efficiency.
- Learning an Embedding Space for Transferable Robot Skills (opens in a new tab) (2018): Proposes a method for learning an embedding space for transferable robot skills in reinforcement learning.
- Hindsight Experience Replay (opens in a new tab) (2017): Proposes a hindsight experience replay method for reinforcement learning that improves sample efficiency.
Hierarchy
- Strategic Attentive Writer for Learning Macro-Actions (opens in a new tab) (2016): Proposes a strategic attentive writer method for learning macro-actions in reinforcement learning that improves sample efficiency.
- FeUdal Networks for Hierarchical Reinforcement Learning (opens in a new tab) (2017): Proposes a feudal network architecture for hierarchical reinforcement learning that improves sample efficiency.
- Data-Efficient Hierarchical Reinforcement Learning (opens in a new tab) (2018): Proposes a data-efficient hierarchical reinforcement learning method that improves sample efficiency.
Memory
- Model-Free Episodic Control (opens in a new tab) (2016): Proposes a model-free episodic control method for reinforcement learning that improves sample efficiency.
- Neural Episodic Control (opens in a new tab) (2017): Proposes a neural episodic control method for reinforcement learning that improves sample efficiency.
- Neural Map: Structured Memory for Deep Reinforcement Learning (opens in a new tab) (2017): Proposes a neural map architecture for reinforcement learning that uses structured memory to improve sample efficiency.
- Unsupervised Predictive Memory in a Goal-Directed Agent (opens in a new tab) (2018): Proposes an unsupervised predictive memory method for goal-directed agents in reinforcement learning that improves sample efficiency.
- Relational Recurrent Neural Networks (opens in a new tab) (2018): Proposes a relational recurrent neural network architecture for reinforcement learning that improves sample efficiency.
Model-Based RL
- Imagination-Augmented Agents for Deep Reinforcement Learning (opens in a new tab) (2017): Proposes an imagination-augmented agent method for reinforcement learning that improves sample efficiency.
- Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning (opens in a new tab) (2017): Proposes a neural network dynamics method for model-based deep reinforcement learning that improves sample efficiency.
- Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning (opens in a new tab) (2018): Proposes a model-based value expansion method for model-free reinforcement learning that improves sample efficiency.
- Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion (opens in a new tab) (2018): Proposes a sample-efficient reinforcement learning method with stochastic ensemble value expansion that improves sample efficiency.
- Model-Ensemble Trust-Region Policy Optimization (opens in a new tab) (2018): Proposes a model-ensemble trust-region policy optimization method for reinforcement learning that improves sample efficiency.
- Model-Based Reinforcement Learning via Meta-Policy Optimization (opens in a new tab) (2018): Proposes a model-based reinforcement learning method via meta-policy optimization that improves sample efficiency.
- Recurrent World Models Facilitate Policy Evolution (opens in a new tab) (2018): Proposes a recurrent world models method for policy evolution in reinforcement learning that improves sample efficiency.
- Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (opens in a new tab) (2017): Proposes a general reinforcement learning algorithm for mastering chess and shogi by self-play that achieves superhuman performance.
- Thinking Fast and Slow with Deep Learning and Tree Search (opens in a new tab) (2017): Proposes a thinking fast and slow method for reinforcement learning that combines deep learning and tree search to improve sample efficiency.
Meta-RL
- RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning (opens in a new tab) (2016): Proposes an RL^2 method for fast reinforcement learning via slow reinforcement learning that improves sample efficiency.
- Learning to Reinforcement Learn (opens in a new tab) (2016): Proposes a learning to reinforcement learn method for meta-reinforcement learning that improves sample efficiency.
- Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (opens in a new tab) (2017): Proposes a model-agnostic meta-learning method for fast adaptation of deep networks in reinforcement learning that improves sample efficiency.
- A Simple Neural Attentive Meta-Learner (opens in a new tab) (2018): Proposes a simple neural attentive meta-learner method for meta-reinforcement learning that improves sample efficiency.
Scaling RL
- Accelerated Methods for Deep Reinforcement Learning (opens in a new tab) (2018): Proposes accelerated methods for deep reinforcement learning that improve sample efficiency and learning speed.
- IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures (opens in a new tab) (2018): Proposes an importance weighted actor-learner architecture for scalable distributed deep reinforcement learning that improves sample efficiency and learning speed.
- Distributed Prioritized Experience Replay (opens in a new tab) (2018): Proposes a distributed prioritized experience replay method for deep reinforcement learning that improves sample efficiency and learning speed.
- Recurrent Experience Replay in Distributed Reinforcement Learning (opens in a new tab) (2018): Proposes a recurrent experience replay method for distributed reinforcement learning that improves sample efficiency and learning speed.
- RLlib: Abstractions for Distributed Reinforcement Learning (opens in a new tab) (2017): Proposes RLlib, a library of abstractions for distributed reinforcement learning that improves sample efficiency and learning speed. (docs) (opens in a new tab)
RL in the Real World
- Benchmarking Reinforcement Learning Algorithms on Real-World Robots (opens in a new tab) (2018): Conducts a benchmarking study of reinforcement learning algorithms on real-world robots.
- Learning Dexterous In-Hand Manipulation (opens in a new tab) (2018): Proposes a method for learning dexterous in-hand manipulation skills in reinforcement learning that improves sample efficiency.
- QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation (opens in a new tab) (2018): Proposes a scalable deep reinforcement learning method for vision-based robotic manipulation that improves sample efficiency.
- Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform (opens in a new tab) (2018): Introduces Horizon, Facebook's open-source applied reinforcement learning platform.
Safety
- Concrete Problems in AI Safety (opens in a new tab) (2016): Discusses concrete problems in AI safety, including reinforcement learning.
- Constrained Policy Optimization (opens in a new tab) (2017): Proposes a constrained policy optimization method for reinforcement learning that improves safety and stability.
- Safe Exploration in Continuous Action Spaces (opens in a new tab) (2018): Proposes a safe exploration method for reinforcement learning in continuous action spaces that improves safety and stability.
- Trial without Error: Towards Safe Reinforcement Learning via Human Intervention (opens in a new tab) (2017): Proposes a trial without error method for safe reinforcement learning via human intervention that improves safety and stability.
- Leave No Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning (opens in a new tab) (2017): Proposes a learning to reset method for safe and autonomous reinforcement learning that improves safety and stability.
Imitation Learning and Inverse Reinforcement Learning
- Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy (opens in a new tab) (2010): Proposes a principle of maximum causal entropy for modeling purposeful adaptive behavior in reinforcement learning.
- Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization (opens in a new tab) (2016): Proposes a guided cost learning method for deep inverse optimal control via policy optimization in reinforcement learning.
- Generative Adversarial Imitation Learning (opens in a new tab) (2016): Proposes a generative adversarial imitation learning method for reinforcement learning that improves sample efficiency.
- DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills (opens in a new tab) (2018): Proposes a deep mimic method for example-guided deep reinforcement learning of physics-based character skills that improves sample efficiency.
- Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow (opens in a new tab) (2018): Proposes a variational discriminator bottleneck method for improving imitation learning, inverse RL, and GANs by constraining information flow in reinforcement learning.
- One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL (opens in a new tab) (2018): Proposes a one-shot high-fidelity imitation method for training large-scale deep nets with reinforcement learning that improves sample efficiency.
Reproducibility, Analysis, and Critique
- Benchmarking Deep Reinforcement Learning for Continuous Control (opens in a new tab) (2016): Conducts a benchmarking study of deep reinforcement learning algorithms for continuous control.
- Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control (opens in a new tab) (2017): Conducts a reproducibility study of benchmarked deep reinforcement learning tasks for continuous control.
- Deep Reinforcement Learning that Matters (opens in a new tab) (2017): Discusses the importance of deep reinforcement learning research that addresses real-world problems.
- Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods (opens in a new tab) (2018): Conducts an empirical analysis of gradient descent optimization in policy gradient methods for reinforcement learning.
- Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? (opens in a new tab) (2018): Discusses the definition and properties of policy gradient algorithms in reinforcement learning.
- Simple Random Search Provides a Competitive Approach to Reinforcement Learning (opens in a new tab) (2018): Proposes a simple random search method for reinforcement learning that achieves competitive performance.
- Benchmarking Model-Based Reinforcement Learning (opens in a new tab) (2019): Proposes a benchmarking library for (MBRL) algorithms and environments to facilitate research and comparison of MBRL methods.
Classic Papers in RL Theory or Review
- Policy Gradient Methods for Reinforcement Learning with Function Approximation (opens in a new tab) (2000): Proposes a policy gradient method for reinforcement learning with function approximation that improves sample efficiency.
- An Analysis of Temporal-Difference Learning with Function Approximation (opens in a new tab) (1997): Conducts an analysis of temporal-difference learning with function approximation in reinforcement learning.
- Reinforcement Learning of Motor Skills with Policy Gradients (opens in a new tab) (2008): Proposes a policy gradient method for reinforcement learning of motor skills that improves sample efficiency.
- Approximately Optimal Approximate Reinforcement Learning (opens in a new tab) (2002): Proposes an approximately optimal approximate reinforcement learning method that improves sample efficiency.
- A Natural Policy Gradient (opens in a new tab) (2002): Proposes a natural policy gradient method for reinforcement learning that improves sample efficiency.
- Algorithms for Reinforcement Learning (opens in a new tab) (2009): Provides an overview of reinforcement learning algorithms, including model-based and model-free methods, and their applications.