Skip to content

Latest commit

 

History

History
57 lines (40 loc) · 2.18 KB

README.md

File metadata and controls

57 lines (40 loc) · 2.18 KB

CartPole_v1 A2C

Solution to the CartPole_v1 environment using the Advantage Actor Critic (A2C) algorithm

Code

Running

python Main.py

Dependencies

  • gym
  • numpy
  • tensorflow

Detailed Description

Problem Statement and Environment

The environment is identical to CartPole-v0 but the number of required timesteps is increased to 475.


A2C Algorithm

We need to approximate two separate functions:

  • Actor's policy
  • Critic's state value function

Both of them are modeled by a neural network with one hidden layer.

Actor - policy

Policy is a mapping from the state space to a set of probability distributions over action space (in our case only discrete). To each given state we assign a 2-element probability vector whose elements sum up to 1.

Critic - state value function

State value function is a mapping from the state space to the real numbers. To each given state we assign a real number representing a "value/utility" of being at that state.

Training

For actor we are directly optimizing in the policy space. To do that we use the Policy Gradient Theorem and a below TD(0) estimate of the advantage function

screen shot 2017-09-18 at 9 19 12 am

This can be reformulated in terms of a cross entropy loss minimization.

The critic is always updated based on the TD(0) back up. This can be reformulated in terms of a squared error loss minimization.

Results and discussion

This method seems to converge no matter what the initialization is. See below an evolution of scores for one run. screen shot 2017-09-18 at 8 31 17 am

Resources and links

  • RLCode - Similar algorithm in Keras and same hyperparameters
  • David Silver - Policy Gradient Lecture Slides

License

This project is licensed under the MIT License - see the LICENSE.md file for details