Solution to the CartPole_v1 environment using the Advantage Actor Critic (A2C) algorithm
python Main.py
- gym
- numpy
- tensorflow
The environment is identical to CartPole-v0 but the number of required timesteps is increased to 475.
We need to approximate two separate functions:
- Actor's policy
- Critic's state value function
Both of them are modeled by a neural network with one hidden layer.
Policy is a mapping from the state space to a set of probability distributions over action space (in our case only discrete). To each given state we assign a 2-element probability vector whose elements sum up to 1.
State value function is a mapping from the state space to the real numbers. To each given state we assign a real number representing a "value/utility" of being at that state.
For actor we are directly optimizing in the policy space. To do that we use the Policy Gradient Theorem and a below TD(0) estimate of the advantage function
This can be reformulated in terms of a cross entropy loss minimization.
The critic is always updated based on the TD(0) back up. This can be reformulated in terms of a squared error loss minimization.
This method seems to converge no matter what the initialization is. See below an evolution of scores for one run.
This project is licensed under the MIT License - see the LICENSE.md file for details