Reinforcement Learning algorithms — an intuitive overview

SmartLab AI
10 min readFeb 18, 2019

Author: Robert Moni

This article pursues to highlight in a non-exhaustive manner the main type of algorithms used for reinforcement learning (RL). The goal is to provide an overview of existing RL methods on an intuitive level by avoiding any deep dive into the models or the math behind it.

When it comes to explaining machine learning to those not concerned in the field, reinforcement learning is probably the easiest sub-field for this challenge. RL it’s like teaching your dog (or cat if you live your life in a challenging way) to do tricks: you provide goodies as a reward if your pet performs the trick you desire, otherwise, you punish him by not treating him, or by providing lemons. Dogs really hate lemons.

This is just for the cover[Source]

Beyond controversy, RL is a more complex and challenging method to be realized, but basically, it deals with learning via interaction and feedback, or in other words learning to solve a task by trial and error, or in other-other words acting in an environment and receiving rewards for it. Essentially an agent (or several) is built that can perceive and interpret the environment in which is placed, furthermore, it can take actions and interact with it.


For the beginning lets tackle the terminologies used in the field of RL.

Agent-environment interaction [Source]
  1. Agent — the learner and the decision maker.
  2. Environment — where the agent learns and decides what actions to perform.
  3. Action — a set of actions which the agent can perform.
  4. State — the state of the agent in the environment.
  5. Reward — for each action selected by the agent the environment provides a reward. Usually a scalar value.
  6. Policy — the decision-making function (control strategy) of the agent, which represents a mapping from situations to actions.
  7. Value function — mapping from states to real numbers, where the value of a state represents the long-term reward achieved starting from that state, and executing a particular policy.
  8. Function approximator — refers to the problem of inducing a function from training examples. Standard approximators include decision trees, neural networks, and nearest-neighbor methods
  9. Markov decision process (MDP) — A probabilistic model of a sequential decision problem, where states can be perceived exactly, and the current state and action selected determine a probability distribution on future states. Essentially, the outcome of applying an action to a state depends only on the current action and state (and not on preceding actions or states).
  10. Dynamic programming (DP) — is a class of solution methods for solving sequential decision problems with a compositional cost structure. Richard Bellman was one of the principal founders of this approach.
  11. Monte Carlo methods — A class of methods for learning of value functions, which estimates the value of a state by running many trials starting at that state, then averages the total rewards received on those trials.
  12. Temporal Difference (TD) algorithms — A class of learning methods, based on the idea of comparing temporally successive predictions. Possibly the single most fundamental idea in all of reinforcement learning.
  13. Model — The agent’s view of the environment, which maps state-action pairs to probability distributions over states. Note that not every reinforcement learning agent uses a model of its environment

OpenAI — a non-profit AI research company with the mission to build and share safe Artificial General Intelligence (AGI) — launched a program to “spin up” deep RL. The website provides a comprehensive introduction to main RL algorithms. This blog will mainly follow this overview with additional explanation.

Reinforcement Learning taxonomy as defined by OpenAI [Source]

Model-Free vs Model-Based Reinforcement Learning

Model-based RL uses experience to construct an internal model of the transitions and immediate outcomes in the environment. Appropriate actions are then chosen by searching or planning in this world model.

Model-free RL, on the other hand, uses experience to learn directly one or both of two simpler quantities (state/ action values or policies) which can achieve the same optimal behavior but without estimation or use of a world model. Given a policy, a state has a value, defined in terms of the future utility that is expected to accrue starting from that state.

Model-free methods are statistically less efficient than model-based methods, because information from the environment is combined with previous, and possibly erroneous, estimates or beliefs about state values, rather than being used directly.

( Peter Dayana and Yael Niv — Reinforcement learning: The Good, The Bad and The Ugly, 2008)

Well, that should’ve explained it. Generally: Model-based learning attempts to model the environment then choose the optimal policy based on it’s learned model; In Model-free learning the agent relies on trial-and-error experience for setting up the optimal policy.

I. Model-free RL

Two main approaches to represent agents with model-free reinforcement learning is Policy optimization and Q-learning.

I.1. Policy optimization or policy-iteration methods

In policy optimization methods the agent learns directly the policy function that maps state to action. The policy is determined without using a value function.

Important to mention that there are two types of policies: deterministic and stochastic. Deterministic policy maps state to action without uncertainty. It happens when you have a deterministic environment like a chess table. Stochastic policy outputs a probability distribution over actions in a given state. This process is called Partially Observable Markov Decision Process (POMDP).

I.1.1. Policy Gradient (PG)

In this method, we have the policy π that has a parameter θ. This π outputs a probability distribution of actions.

Probability of taking action a given state s with parameters theta. [Source]

Then we must find the best parameters (θ) to maximize (optimize) a score function J(θ), given the discount factor γ and the reward r.

Policy score function [Source]

Main steps:

  • Measure the quality of a policy with the policy score function.
  • Use policy gradient ascent to find the best parameter that improves the policy.

A great and detailed explanation with all the math included about policy gradient can be found in Jonathan Hui’s blog or in Thomas Simonini’s introduction blog to PG with examples in Tensorflow.

I.1.2. Asynchronous Advantage Actor-Critic (A3C)

This methods was published by Google’s DeepMind group and covers the following key concept embedded in it’s naming:

  • Asynchronous: Several agents are trained in it’s own copy of the environment and the model form these agent’s are gathered in a master agent. The reason behind this idea, is that the experience of each agent is independent of the experience of the others. In this way the overall experience available for training becomes more diverse.
  • Advantage: Similarly to PG where the update rule used the dicounted returns from a set of experiences in order to tell the agnet which acttions were “good” or “bad”.
  • Actor-critic: combines the benefits of both approaches from policy-iteration method as PG and value-iteration method as Q-learning (See below). The network will estimate both a value function V(s) (how good a certain state is to be in) and a policy π(s).

A simple but throughout explanation with code implemented in Tensorflow can be found in Arthur Juliani blog.

I.1.3. Trust Region Policy Optimization (TRPO)

A on-policy algorithm that can be used or environments with either discrete or continuous action spaces. TRPO updates policies by taking the largest step possible to improve performance, while satisfying a special constraint on how close the new and old policies are allowed to be.

A comprehensive introduction is provided on TRPO in this and this blog post and a great repo provides Tensorflow and OpenAI Gym based solutions.

I.1.4. Proximal Policy Optimization (PPO)

Also an on-policy algorithm which similarly to TRPO can perform on discrete or continuous action spaces. PPO shares motivation with TRPO in the task of answering the question: how to increase policy improvement without the risk of performance collapse? The idea is that PPO improves the stability of the Actor training by limiting the policy update at each training step.

PPO became popular when OpenAI made a breakthrough in Deep RL when they released an algorithm trained to play Dota2 and they won against some of the best players in the world. See description on this page.

For deep dive into PPO visit this blog.

I.2. Q-learning or value-iteration methods

Q-learning learns the action-value function Q(s, a): how good to take an action at a particular state. Basically a scalar value is assigned over an action a given the state s. The following chart provides a good representation of the algorithm.

Q-learning steps [Source]

I.2.1 Deep Q Neural Network (DQN)

DQN is Q-learning with Neural Networks . The motivation behind is simply related to big state space environments where defining a Q-table would be a very complex, challenging and time-consuming task. Instead of a Q-table Neural Networks approximate Q-values for each action based on the state.

For deep dive to DQN visit this course and play Doom meanwhile.

I.2.2 C51

C51 is a feasible algorithm proposed by Bellemare et al. to perform iterative approximation of the value distribution Z using Distributional Bellman equation. The number 51 represents the use of 51 discrete values to parameterize the value distribution Z(s,a). See the original paper here and for a deep dive follow this exploratory tutorial with implementation in Keras.

I.2.3 Distributional Reinforcement Learning with Quantile Regression (QR-DQN)

In QR-DQN for each state-action pair instead of estimating a single value a distribution of values values in learned. The distribution of the values, rather than just the average, can improve the policy.This means that quantiles are learned which threshold values attached to certain probabilities in the cumulative distribution function. See paper for the method here and an easy implementation using Pytorch here .

I.2.4 Hindsight Experience Replay (HER)

In Hindsight Experience Replay method, basically a DQN is suplied with a state and a desired end-state, or in other words goal. It allow to quickly learn when the rewards are sparse. In other words when the rewards are uniform for most of the time, with only a few rare reward-values that really stand out.

For a better understanding, beside the paper check out this blog post, fr coding this github repository

I.3 Hybrid

Simply as it sounds, these methods combine the strengths of Q-learning and policy gradients, thus the policy function that maps state to action and the action-value function that provides a value for each action is learned.

Some hybrid model-free algorithms are:

  • Deep Deterministic Policy Gradients (DDPG): paper and code,
  • Soft Actor -Critic (SAC): paper and code.
  • Twin Delayed Deep Deterministic Policy Gradients (TD3) paper and code

II. Model-based RL

Model-based RL has a strong influence from control theory, and the goal is to plan through an f(s,a) control function to choose the optimal actions. Thing of it as the RL field where the laws of physics are provided by the creator. The drawback of model-based methods is that although they have more assumptions and approximations on a given task, but may may be limited only to these specific types of tasks. There are two main approaches: learning the model or learn given the model.

II.1. Learn the Model

To learn the model a base policy is ran, like a random or any educated policy, while the trajectory is observed. The model is fited using the sampled data. Below steps describe the procedure:

Supervised learning is used to train a model to minimize the least square error from the sampled data for the control function. Optimal trajectory using the model and a cost function is used in step three. The cost function can measure how far we are from the target location and the amount of effort spent. [source]

  • World models: one of my favorite approaches in which the agent can learn from it’s own “dreams” due to the Variable Auto-encoders, See paper and code.
  • Imagination-Augmented Agents (I2A): learns to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. BAsically it’s a hybrid learning method because it combines model-baes and model-free methods. Paper and implementation.
  • Model-Based Priors for Model-Free Reinforcement Learning (MBMF): aims to bridge tge gap between model-free and model-based reinforcement learning. See paper and code.
  • Model-Based Value Expansion (MBVE): Authors of the paper state that this method controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.

II.2. Given the Model

I would say this had the “hypest” hype in recent time when AlphaGo Zero defeated the best go player in the world. You can found anything you want on Deep Mind’s website.

The original post can be found on Robert’s personal medium page:



SmartLab AI

Deep Learning and AI solutions from Budapest University of Technology and Economics.