Reinforcement Learning algorithms — an intuitive overview

This is just for the cover[Source]


Agent-environment interaction [Source]
  1. Agent — the learner and the decision maker.
  2. Environment — where the agent learns and decides what actions to perform.
  3. Action — a set of actions which the agent can perform.
  4. State — the state of the agent in the environment.
  5. Reward — for each action selected by the agent the environment provides a reward. Usually a scalar value.
  6. Policy — the decision-making function (control strategy) of the agent, which represents a mapping from situations to actions.
  7. Value function — mapping from states to real numbers, where the value of a state represents the long-term reward achieved starting from that state, and executing a particular policy.
  8. Function approximator — refers to the problem of inducing a function from training examples. Standard approximators include decision trees, neural networks, and nearest-neighbor methods
  9. Markov decision process (MDP) — A probabilistic model of a sequential decision problem, where states can be perceived exactly, and the current state and action selected determine a probability distribution on future states. Essentially, the outcome of applying an action to a state depends only on the current action and state (and not on preceding actions or states).
  10. Dynamic programming (DP) — is a class of solution methods for solving sequential decision problems with a compositional cost structure. Richard Bellman was one of the principal founders of this approach.
  11. Monte Carlo methods — A class of methods for learning of value functions, which estimates the value of a state by running many trials starting at that state, then averages the total rewards received on those trials.
  12. Temporal Difference (TD) algorithms — A class of learning methods, based on the idea of comparing temporally successive predictions. Possibly the single most fundamental idea in all of reinforcement learning.
  13. Model — The agent’s view of the environment, which maps state-action pairs to probability distributions over states. Note that not every reinforcement learning agent uses a model of its environment
Reinforcement Learning taxonomy as defined by OpenAI [Source]

Model-Free vs Model-Based Reinforcement Learning

I. Model-free RL

Probability of taking action a given state s with parameters theta. [Source]
Policy score function [Source]
  • Measure the quality of a policy with the policy score function.
  • Use policy gradient ascent to find the best parameter that improves the policy.
  • Asynchronous: Several agents are trained in it’s own copy of the environment and the model form these agent’s are gathered in a master agent. The reason behind this idea, is that the experience of each agent is independent of the experience of the others. In this way the overall experience available for training becomes more diverse.
  • Advantage: Similarly to PG where the update rule used the dicounted returns from a set of experiences in order to tell the agnet which acttions were “good” or “bad”.
  • Actor-critic: combines the benefits of both approaches from policy-iteration method as PG and value-iteration method as Q-learning (See below). The network will estimate both a value function V(s) (how good a certain state is to be in) and a policy π(s).
Q-learning steps [Source]
  • Deep Deterministic Policy Gradients (DDPG): paper and code,
  • Soft Actor -Critic (SAC): paper and code.
  • Twin Delayed Deep Deterministic Policy Gradients (TD3) paper and code

II. Model-based RL

  • World models: one of my favorite approaches in which the agent can learn from it’s own “dreams” due to the Variable Auto-encoders, See paper and code.
  • Imagination-Augmented Agents (I2A): learns to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. BAsically it’s a hybrid learning method because it combines model-baes and model-free methods. Paper and implementation.
  • Model-Based Priors for Model-Free Reinforcement Learning (MBMF): aims to bridge tge gap between model-free and model-based reinforcement learning. See paper and code.
  • Model-Based Value Expansion (MBVE): Authors of the paper state that this method controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.




Deep Learning and AI solutions from Budapest University of Technology and Economics.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Proxy Anchor Loss for Deep Metric Learning

Sentiment Analysis — Supervised and Unsupervised Learning

Convolutional Neural Networks

Deep Learning on Graphs with Graph Neural Network

TensorFlow Debugging — Commands and Training

A New First Order Hold!

Bayesian Networks: Combining Machine Learning and Expert Knowledge into Explainable AI

Ask-me-anything: How I used Neural Search to ask German politicians the tough questions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
SmartLab AI

SmartLab AI

Deep Learning and AI solutions from Budapest University of Technology and Economics.

More from Medium

Using reinforcement to make machines human-like

Reinforcement Learning

Building A Virtual Self-Driving Car Using Deep Q-Learning

Reinforcement Learning Snake Algorithm