Please refer to the RFC for my general thoughts about Reinforcement Learning Digest format and potential forms.

Research papers

Evolution Strategies as a Scalable Alternative to Reinforcement Learning

by Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever

from OpenAI


We explore the use of Evolution Strategies, a class of black box optimization algorithms, as an alternative to popular RL techniques such as Q-learning and Policy Gradients. Experiments on MuJoCo and Atari show that ES is a viable solution strategy that scales extremely well with the number of CPUs available: By using hundreds to thousands of parallel workers, ES can solve 3D humanoid walking in 10 minutes and obtain competitive results on most Atari games after one hour of training time. In addition, we highlight several advantages of ES as a black box optimization technique: it is invariant to action frequency and delayed rewards, tolerant of extremely long horizons, and does not need temporal discounting or value function approximation.

Key Points

  • Seemingly forgotten (and partially shadowed by Q-Learning and Policy Gradients) Evolution Strategies prove to be exceedingly efficient compared to state-of-the-art Reinforcement Learning algorithms.
  • Parallelization of Evolution Strategies gives a linear performance boost while being fairly trivial.
  • ES prove to be better than Policy Gradients in environments with long action sequences and delayed rewards.
  • ES allow better environment exploration compared to Policy Gradient methods.
  • Introduced algorithm has only 2 hyperparameters while the original DQN algorithm has 16.


This paper is amazing at making us realize how far from actual general-purpose AI we are: Q-Learning algorithms, such as mentioned DQN, and Policy Gradient-based algorithms, which are state-of-the-art RL algorithms turned out to be outperformed (in a certain way) by Evolutionary Algorithms, which were explored a long while ago. While being extremely straightforward, Evolutionary Algorithms are a certain way of performing a directed search, which is so natural that it does not seem to solve complicated problems, and yet it managed to comply with state-of-the-art papers performance without using (relatively) extensive computational power.

This probably means that the algorithms we explored so far are still not good enough for a decent generalization. Evolution Strategies also didn’t get good results in Supervised Learning problems, i.e. ES gradient estimation on the MNIST dataset could be 1000 times slower than using backpropagation, which only makes ES competitive in Reinforcement Learning tasks.


OpanAI wrote Blog Post about ES. It’s an amazing explanation of why ES is a major success in Reinforcement Learning and provides neat visualizations. They also released supplimentary code.

Neural Episodic Control by Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech,

Oriol Vinyals, Demis Hassabis, Daan Wierstra, Charles Blundell

from DeepMind

Deep reinforcement learning methods attain super-human performance in a wide range of environments. Such methods are grossly inefficient, often taking orders of magnitudes more data than humans to achieve reasonable performance. We propose Neural Episodic Control: a deep reinforcement learning agent that is able to rapidly assimilate new experiences and act upon them. Our agent uses a semi-tabular representation of the value function: a buffer of past experience containing slowly changing state representations and rapidly updated estimates of the value function. We show across a wide range of environments that our agent learns significantly faster than other state-of-the-art, general-purpose deep reinforcement learning agents.

Key Points

  • Neural Episodic Control (NEC) improves the speed of learning while grasping highly successful strategies as soon as they were experienced
  • Learning curve of NEC (performance as a function of data processed) is significantly steeper compared to DQN, A3C, Prioritised Replay etc while showing much better performance in the beginning of learning and converging to the optimal policy about 3-4 times faster in a number of benchmarks
  • Model-Free Episodic Control is the closest to NEC in terms of speed, but it still learns 2-3 times slower on average
  • Prioritised Replay agent has the higher final performance compared to NEC


NEC benefits from three features, which actually allow the learning speed boost. First, Differentiable Neural Dictionary (DND) is the memory model built into NEC, which provides efficient insert and lookup operations for mapping state embeddings to Q function estimates. DND is a generalization of memory module described in Matching Networks for One Shot Learning and Learning to Remember Rare Events. Secondly, N-step Q estimates are used. Thirdly, NEC uses state representation provided by a Convolutional Neural Network, which is later inserted into DND along with the corresponding Q estimate.

NEC is reported to perform better in environments with sparse reward signals, however Atari’s Montezuma’s Revenge, which is an example of such environment, wasn’t solved.

FeUdal Networks for Hierarchical Reinforcement Learning

by Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, Koray Kavukcuoglu

from DeepMind


We introduce FeUdal Networks (FuNs): a novel architecture for hierarchical reinforcement learning. Our approach is inspired by the feudal reinforcement learning proposal of Dayan and Hinton, and gains power and efficacy by decoupling end-to-end learning across multiple levels - allowing it to utilise different resolutions of time. Our framework employs a Manager module and a Worker module. The Manager operates at a lower temporal resolution and sets abstract goals which are conveyed to and enacted by the Worker. The Worker generates primitive actions at every tick of the environment. The decoupled structure of FuN conveys several benefits - in addition to facilitating very long timescale credit assignment it also encourages the emergence of sub-policies associated with different goals set by the Manager. These properties allow FuN to dramatically outperform a strong baseline agent on tasks that involve long-term credit assignment or memorisation. We demonstrate the performance of our proposed system on a range of tasks from the ATARI suite and also from a 3D DeepMind Lab environment.

Key Points

  • Atari’s “Montezuma’s Revenge” is an environment, which was usually excluded from the Reinforcement Learning algorithms benchmarks due to their inability to provide a decent result (a notable exception would be Unifying Count-Based Exploration and Intrinsic Motivation). FuNs are able to solve the first room in <200 epochs and score up to 2600 points overall.
  • FuN applies a novel RNN architecture for the Manager module; this RNN is a dilated LSTM, which allows efficient back-prop through hundreds of steps
  • Through Manager and Worker frameworks FuN is trying to decompose the agent strategy into some meaningful primitive actions, which is achieved by formulating sub-goals as directions in latent state space


The introduced RNN architecture is crucial for the success of the model: using similar LSTM counterpart results in 5x smaller final reward.

The attempts of decoupling sophisticated tasks into primitives and trying to learn skills separately in order to combine them later looks promising. Authors mention that the modular structure of FuN allows transfer learning, which is especially interesting in light of recent papers exploring the capabilities to learn primitive skills and transfer them between the agents (e.g. Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning).


OpenAI implementation of “Evolution Strategies” paper

OpenAI released an official Github repository containing Evolution Strategies implementation.

PyTorch implementation of “Evolution Strategies” paper

Facebook AI Research recently released PyTorch, a multi-purpose deep learning framework, inspired by its predecessor, Torch, which was used by FAIR and DeepMind for a long while. The framework is really user-friendly and it is about to become stable, which would potentially become a good option for implementing Deep Learning models.

The repository contains Evolution Strategies algorithm implementation and it is real simple like advertised in the published paper and the blog post.