Reinforcement Learning Papers

Paper with Code RL section (opens in a new tab): Provides access to research papers along with the corresponding code.

Never Give Up: Learning Directed Exploration Strategies (opens in a new tab) (2020)

Key Papers

Q-learning (opens in a new tab) (1992): Introduces the Q-learning algorithm, one of the fundamental algorithms in RL.
Policy invariance under reward transformations: Theory and application to reward shaping (opens in a new tab) (1999): Discusses the invariance of policies under reward transformations and the concept of reward shaping.
Learning to Predict by the Methods of Temporal Differences (opens in a new tab) (1988): Introduced the temporal difference (TD) learning algorithm, which is a model-free method for learning value functions in RL.
Actor-Critic Algorithms (opens in a new tab) (2003): Introduced the actor-critic architecture, which is a model-based method for learning policies in RL.

Deep Reinforcement Learning

Model-Free RL

Deep Q-Learning

Playing Atari with Deep Reinforcement Learning (opens in a new tab) (2013): Presents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning.
Deep Recurrent Q-Learning for Partially Observable MDPs (opens in a new tab) (2015): Proposes a deep recurrent Q-learning algorithm for partially observable Markov decision processes.
Dueling Network Architectures for Deep Reinforcement Learning (opens in a new tab) (2015): Introduces a dueling network architecture for deep reinforcement learning that separates the estimation of state values and state-dependent action advantages.
Deep Reinforcement Learning with Double Q-learning (opens in a new tab) (2015): Proposes a double Q-learning algorithm for deep reinforcement learning that reduces overestimation of action values.
Prioritized Experience Replay (opens in a new tab) (2015): Introduces a prioritized experience replay mechanism for deep reinforcement learning that improves sample efficiency and learning speed.
Rainbow: Combining Improvements in Deep Reinforcement Learning (opens in a new tab) (2017): Combines several improvements to deep reinforcement learning, including dueling networks, double Q-learning, and prioritized experience replay, to achieve state-of-the-art performance on Atari games.

Policy Gradients

Asynchronous Methods for Deep Reinforcement Learning (opens in a new tab) (2016): Proposes asynchronous methods for deep reinforcement learning that improve sample efficiency and learning speed.
Trust Region Policy Optimization (opens in a new tab) (2015): Introduces a trust region optimization method for policy optimization in reinforcement learning that improves stability and sample efficiency.
High-Dimensional Continuous Control Using Generalized Advantage Estimation (opens in a new tab) (2015): Proposes a generalized advantage estimation method for continuous control tasks in reinforcement learning that improves sample efficiency and learning speed.
Proximal Policy Optimization Algorithms (opens in a new tab) (2017): Introduces a family of proximal policy optimization algorithms for reinforcement learning that improve sample efficiency and stability.
Emergence of Locomotion Behaviours in Rich Environments (opens in a new tab) (2017): Demonstrates the emergence of diverse locomotion behaviors in simulated environments using deep reinforcement learning.
Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (opens in a new tab) (2017): Proposes a scalable trust-region method for deep reinforcement learning that uses Kronecker-factored approximation to improve sample efficiency and learning speed.
Sample Efficient Actor-Critic with Experience Replay (opens in a new tab) (2016): Introduces a sample-efficient actor-critic algorithm with experience replay for reinforcement learning that improves sample efficiency and learning speed.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor (opens in a new tab) (2018): Proposes a soft actor-critic algorithm for deep reinforcement learning that maximizes entropy and improves exploration.

Deterministic Policy Gradients

Deterministic Policy Gradient Algorithms (opens in a new tab) (2014): Introduces a deterministic policy gradient algorithm for reinforcement learning that improves sample efficiency and stability.
Continuous Control With Deep Reinforcement Learning (opens in a new tab) (2015): Demonstrates the effectiveness of deep reinforcement learning for continuous control tasks.
Addressing Function Approximation Error in Actor-Critic Methods (opens in a new tab) (2018): Addresses the problem of function approximation error in actor-critic methods for reinforcement learning.

Distributional RL

A Distributional Perspective on Reinforcement Learning (opens in a new tab) (2017): Presents a distributional perspective on reinforcement learning that improves sample efficiency and learning speed.
Distributional Reinforcement Learning with Quantile Regression (opens in a new tab) (2017): Proposes a distributional reinforcement learning algorithm that uses quantile regression to estimate value distributions.
Implicit Quantile Networks for Distributional Reinforcement Learning (opens in a new tab) (2018): Introduces implicit quantile networks for distributional reinforcement learning that improve sample efficiency and learning speed.
Dopamine: A Research Framework for Deep Reinforcement Learning (opens in a new tab) (2018) (code) (opens in a new tab): Provides a research framework for deep reinforcement learning that includes a suite of environments and baselines.

Policy Gradients with Action-Dependent Baselines

Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic (opens in a new tab) (2016): Proposes a sample-efficient policy gradient algorithm with an off-policy critic for reinforcement learning that improves sample efficiency and learning speed.
Action-depedent Control Variates for Policy Optimization via Stein’s Identity, (opens in a new tab) (2017): Proposes a control variate method for policy optimization in reinforcement learning that improves sample efficiency and stability.
The Mirage of Action-Dependent Baselines in Reinforcement Learning (opens in a new tab) (2018): Critiques the use of action-dependent baselines in reinforcement learning and proposes alternative methods.

Path-Consistency Learning

Bridging the Gap Between Value and Policy Based Reinforcement Learning (opens in a new tab) (2017): Proposes a method for bridging the gap between value-based and policy-based reinforcement learning.
Trust-PCL: An Off-Policy Trust Region Method for Continuous Control (opens in a new tab) (2017): Introduces an off-policy trust region method for continuous control in reinforcement learning that improves sample efficiency and stability.

Other Directions for Combining Policy-Learning & Q-Learning

Combining Policy Gradient and Q-learning (opens in a new tab) (2016): Combines policy gradient and Q-learning methods for reinforcement learning to improve sample efficiency and stability.
The Reactor: A Fast and Sample-Efficient Actor-Critic Agent for Reinforcement Learning (opens in a new tab) (2017): Introduces a fast and sample-efficient actor-critic algorithm for reinforcement learning that improves sample efficiency and learning speed.
Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning (opens in a new tab) (2017): Proposes an interpolated policy gradient algorithm for deep reinforcement learning that combines on-policy and off-policy gradient estimation.
Equivalence Between Policy Gradients and Soft Q-Learning (opens in a new tab) (2017): Shows the equivalence between policy gradients and soft Q-learning in reinforcement learning.

Evolutionary Algorithms

Evolution Strategies as a Scalable Alternative to Reinforcement Learning (opens in a new tab) (2017): Explores the use of evolution strategies, a class of black box optimization algorithms, as an alternative to popular reinforcement learning techniques.

Exploration

Intrinsic Motivation

VIME: Variational Information Maximizing Exploration (opens in a new tab) (2016): Proposes a variational information maximizing exploration method for reinforcement learning that improves exploration efficiency.
Unifying Count-Based Exploration and Intrinsic Motivation (opens in a new tab) (2016): Unifies count-based exploration and intrinsic motivation methods for reinforcement learning to improve exploration efficiency.
Count-Based Exploration with Neural Density Models (opens in a new tab) (2017): Proposes a count-based exploration method for reinforcement learning that uses neural density models to improve exploration efficiency.
#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning (opens in a new tab) (2016): Studies the effectiveness of count-based exploration methods for deep reinforcement learning.
EX2: Exploration with Exemplar Models for Deep Reinforcement Learning (opens in a new tab) (2017): Proposes an exploration method for deep reinforcement learning that uses exemplar models to improve exploration efficiency.
Curiosity-driven Exploration by Self-supervised Prediction (opens in a new tab) (2017): Proposes a curiosity-driven exploration method for reinforcement learning that uses self-supervised prediction to improve exploration efficiency.
Large-Scale Study of Curiosity-Driven Learning (opens in a new tab) (2018): Conducts a large-scale study of curiosity-driven learning in reinforcement learning.
Exploration by Random Network Distillation (opens in a new tab) (2018): Proposes an exploration method for reinforcement learning that uses random network distillation to improve exploration efficiency.

Unsupervised RL

Variational Intrinsic Control (opens in a new tab) (2016): Proposes a variational intrinsic control method for reinforcement learning that improves exploration efficiency.
Diversity is All You Need: Learning Skills without a Reward Function (opens in a new tab) (2018): Proposes a method for learning skills without a reward function in reinforcement learning that improves sample efficiency.
Variational Option Discovery Algorithms (opens in a new tab) (2018): Proposes a variational option discovery algorithm for reinforcement learning that improves sample efficiency.

Transfer and Multitask RL

Progressive Neural Networks (opens in a new tab) (2016): Proposes a progressive neural network architecture for reinforcement learning that improves sample efficiency.
Universal Value Function Approximators (opens in a new tab) (2015): Proposes a universal value function approximator for reinforcement learning that improves sample efficiency.
The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously (opens in a new tab) (2017): Proposes a method for learning to solve multiple continuous control tasks simultaneously in reinforcement learning.
PathNet: Evolution Channels Gradient Descent in Super Neural Networks (opens in a new tab) (2017): Proposes a method for combining evolution and gradient descent in neural network training for reinforcement learning.
Mutual Alignment Transfer Learning (opens in a new tab) (2017): Proposes a mutual alignment transfer learning method for reinforcement learning that improves sample efficiency.
Learning an Embedding Space for Transferable Robot Skills (opens in a new tab) (2018): Proposes a method for learning an embedding space for transferable robot skills in reinforcement learning.
Hindsight Experience Replay (opens in a new tab) (2017): Proposes a hindsight experience replay method for reinforcement learning that improves sample efficiency.

Hierarchy

Strategic Attentive Writer for Learning Macro-Actions (opens in a new tab) (2016): Proposes a strategic attentive writer method for learning macro-actions in reinforcement learning that improves sample efficiency.
FeUdal Networks for Hierarchical Reinforcement Learning (opens in a new tab) (2017): Proposes a feudal network architecture for hierarchical reinforcement learning that improves sample efficiency.
Data-Efficient Hierarchical Reinforcement Learning (opens in a new tab) (2018): Proposes a data-efficient hierarchical reinforcement learning method that improves sample efficiency.

Memory

Model-Free Episodic Control (opens in a new tab) (2016): Proposes a model-free episodic control method for reinforcement learning that improves sample efficiency.
Neural Episodic Control (opens in a new tab) (2017): Proposes a neural episodic control method for reinforcement learning that improves sample efficiency.
Neural Map: Structured Memory for Deep Reinforcement Learning (opens in a new tab) (2017): Proposes a neural map architecture for reinforcement learning that uses structured memory to improve sample efficiency.
Unsupervised Predictive Memory in a Goal-Directed Agent (opens in a new tab) (2018): Proposes an unsupervised predictive memory method for goal-directed agents in reinforcement learning that improves sample efficiency.
Relational Recurrent Neural Networks (opens in a new tab) (2018): Proposes a relational recurrent neural network architecture for reinforcement learning that improves sample efficiency.

Model-Based RL

Imagination-Augmented Agents for Deep Reinforcement Learning (opens in a new tab) (2017): Proposes an imagination-augmented agent method for reinforcement learning that improves sample efficiency.
Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning (opens in a new tab) (2017): Proposes a neural network dynamics method for model-based deep reinforcement learning that improves sample efficiency.
Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning (opens in a new tab) (2018): Proposes a model-based value expansion method for model-free reinforcement learning that improves sample efficiency.
Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion (opens in a new tab) (2018): Proposes a sample-efficient reinforcement learning method with stochastic ensemble value expansion that improves sample efficiency.
Model-Ensemble Trust-Region Policy Optimization (opens in a new tab) (2018): Proposes a model-ensemble trust-region policy optimization method for reinforcement learning that improves sample efficiency.
Model-Based Reinforcement Learning via Meta-Policy Optimization (opens in a new tab) (2018): Proposes a model-based reinforcement learning method via meta-policy optimization that improves sample efficiency.
Recurrent World Models Facilitate Policy Evolution (opens in a new tab) (2018): Proposes a recurrent world models method for policy evolution in reinforcement learning that improves sample efficiency.
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (opens in a new tab) (2017): Proposes a general reinforcement learning algorithm for mastering chess and shogi by self-play that achieves superhuman performance.
Thinking Fast and Slow with Deep Learning and Tree Search (opens in a new tab) (2017): Proposes a thinking fast and slow method for reinforcement learning that combines deep learning and tree search to improve sample efficiency.

Meta-RL

RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning (opens in a new tab) (2016): Proposes an RL^2 method for fast reinforcement learning via slow reinforcement learning that improves sample efficiency.
Learning to Reinforcement Learn (opens in a new tab) (2016): Proposes a learning to reinforcement learn method for meta-reinforcement learning that improves sample efficiency.
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (opens in a new tab) (2017): Proposes a model-agnostic meta-learning method for fast adaptation of deep networks in reinforcement learning that improves sample efficiency.
A Simple Neural Attentive Meta-Learner (opens in a new tab) (2018): Proposes a simple neural attentive meta-learner method for meta-reinforcement learning that improves sample efficiency.

Scaling RL

Accelerated Methods for Deep Reinforcement Learning (opens in a new tab) (2018): Proposes accelerated methods for deep reinforcement learning that improve sample efficiency and learning speed.
IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures (opens in a new tab) (2018): Proposes an importance weighted actor-learner architecture for scalable distributed deep reinforcement learning that improves sample efficiency and learning speed.
Distributed Prioritized Experience Replay (opens in a new tab) (2018): Proposes a distributed prioritized experience replay method for deep reinforcement learning that improves sample efficiency and learning speed.
Recurrent Experience Replay in Distributed Reinforcement Learning (opens in a new tab) (2018): Proposes a recurrent experience replay method for distributed reinforcement learning that improves sample efficiency and learning speed.
RLlib: Abstractions for Distributed Reinforcement Learning (opens in a new tab) (2017): Proposes RLlib, a library of abstractions for distributed reinforcement learning that improves sample efficiency and learning speed. (docs) (opens in a new tab)

RL in the Real World

Benchmarking Reinforcement Learning Algorithms on Real-World Robots (opens in a new tab) (2018): Conducts a benchmarking study of reinforcement learning algorithms on real-world robots.
Learning Dexterous In-Hand Manipulation (opens in a new tab) (2018): Proposes a method for learning dexterous in-hand manipulation skills in reinforcement learning that improves sample efficiency.
QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation (opens in a new tab) (2018): Proposes a scalable deep reinforcement learning method for vision-based robotic manipulation that improves sample efficiency.
Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform (opens in a new tab) (2018): Introduces Horizon, Facebook's open-source applied reinforcement learning platform.

Safety

Concrete Problems in AI Safety (opens in a new tab) (2016): Discusses concrete problems in AI safety, including reinforcement learning.
Constrained Policy Optimization (opens in a new tab) (2017): Proposes a constrained policy optimization method for reinforcement learning that improves safety and stability.
Safe Exploration in Continuous Action Spaces (opens in a new tab) (2018): Proposes a safe exploration method for reinforcement learning in continuous action spaces that improves safety and stability.
Trial without Error: Towards Safe Reinforcement Learning via Human Intervention (opens in a new tab) (2017): Proposes a trial without error method for safe reinforcement learning via human intervention that improves safety and stability.
Leave No Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning (opens in a new tab) (2017): Proposes a learning to reset method for safe and autonomous reinforcement learning that improves safety and stability.

Imitation Learning and Inverse Reinforcement Learning

Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy (opens in a new tab) (2010): Proposes a principle of maximum causal entropy for modeling purposeful adaptive behavior in reinforcement learning.
Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization (opens in a new tab) (2016): Proposes a guided cost learning method for deep inverse optimal control via policy optimization in reinforcement learning.
Generative Adversarial Imitation Learning (opens in a new tab) (2016): Proposes a generative adversarial imitation learning method for reinforcement learning that improves sample efficiency.
DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills (opens in a new tab) (2018): Proposes a deep mimic method for example-guided deep reinforcement learning of physics-based character skills that improves sample efficiency.
Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow (opens in a new tab) (2018): Proposes a variational discriminator bottleneck method for improving imitation learning, inverse RL, and GANs by constraining information flow in reinforcement learning.
One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL (opens in a new tab) (2018): Proposes a one-shot high-fidelity imitation method for training large-scale deep nets with reinforcement learning that improves sample efficiency.

Reproducibility, Analysis, and Critique

Benchmarking Deep Reinforcement Learning for Continuous Control (opens in a new tab) (2016): Conducts a benchmarking study of deep reinforcement learning algorithms for continuous control.
Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control (opens in a new tab) (2017): Conducts a reproducibility study of benchmarked deep reinforcement learning tasks for continuous control.
Deep Reinforcement Learning that Matters (opens in a new tab) (2017): Discusses the importance of deep reinforcement learning research that addresses real-world problems.
Where Did My Optimum Go?: An Empirical Analysis of Gradient Descent Optimization in Policy Gradient Methods (opens in a new tab) (2018): Conducts an empirical analysis of gradient descent optimization in policy gradient methods for reinforcement learning.
Are Deep Policy Gradient Algorithms Truly Policy Gradient Algorithms? (opens in a new tab) (2018): Discusses the definition and properties of policy gradient algorithms in reinforcement learning.
Simple Random Search Provides a Competitive Approach to Reinforcement Learning (opens in a new tab) (2018): Proposes a simple random search method for reinforcement learning that achieves competitive performance.
Benchmarking Model-Based Reinforcement Learning (opens in a new tab) (2019): Proposes a benchmarking library for (MBRL) algorithms and environments to facilitate research and comparison of MBRL methods.

Classic Papers in RL Theory or Review

Policy Gradient Methods for Reinforcement Learning with Function Approximation (opens in a new tab) (2000): Proposes a policy gradient method for reinforcement learning with function approximation that improves sample efficiency.
An Analysis of Temporal-Difference Learning with Function Approximation (opens in a new tab) (1997): Conducts an analysis of temporal-difference learning with function approximation in reinforcement learning.
Reinforcement Learning of Motor Skills with Policy Gradients (opens in a new tab) (2008): Proposes a policy gradient method for reinforcement learning of motor skills that improves sample efficiency.
Approximately Optimal Approximate Reinforcement Learning (opens in a new tab) (2002): Proposes an approximately optimal approximate reinforcement learning method that improves sample efficiency.
A Natural Policy Gradient (opens in a new tab) (2002): Proposes a natural policy gradient method for reinforcement learning that improves sample efficiency.
Algorithms for Reinforcement Learning (opens in a new tab) (2009): Provides an overview of reinforcement learning algorithms, including model-based and model-free methods, and their applications.

Prompt Engineering Quantum Machine Learning