Skip to content

Advantage Action Critic (A2C) algorithm tested in the CartPole-v1 environment of OpenAI's gym

Notifications You must be signed in to change notification settings

jankrepl/CartPole-v1_A2C

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

CartPole_v1 A2C

Solution to the CartPole_v1 environment using the Advantage Actor Critic (A2C) algorithm

Code

Running

python Main.py

Dependencies

  • gym
  • numpy
  • tensorflow

Detailed Description

Problem Statement and Environment

The environment is identical to CartPole-v0 but the number of required timesteps is increased to 475.


A2C Algorithm

We need to approximate two separate functions:

  • Actor's policy
  • Critic's state value function

Both of them are modeled by a neural network with one hidden layer.

Actor - policy

Policy is a mapping from the state space to a set of probability distributions over action space (in our case only discrete). To each given state we assign a 2-element probability vector whose elements sum up to 1.

Critic - state value function

State value function is a mapping from the state space to the real numbers. To each given state we assign a real number representing a "value/utility" of being at that state.

Training

For actor we are directly optimizing in the policy space. To do that we use the Policy Gradient Theorem and a below TD(0) estimate of the advantage function

screen shot 2017-09-18 at 9 19 12 am

This can be reformulated in terms of a cross entropy loss minimization.

The critic is always updated based on the TD(0) back up. This can be reformulated in terms of a squared error loss minimization.

Results and discussion

This method seems to converge no matter what the initialization is. See below an evolution of scores for one run. screen shot 2017-09-18 at 8 31 17 am

Resources and links

  • RLCode - Similar algorithm in Keras and same hyperparameters
  • David Silver - Policy Gradient Lecture Slides

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

Advantage Action Critic (A2C) algorithm tested in the CartPole-v1 environment of OpenAI's gym

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages