This is a repo where I generalize DeepMind's MuZero reinforcement learning algorithm on stochastic environments, and create an algorithm called SigmaZero (stochastic MuZero). For more details on the MuZero algorithm, check out the original paper and my project on applying the MuZero algorithm on the cartpole environment.
- MuZero
- Monte Carlo Tree Search in Stochastic Environments
- SigmaZero
- Environment
- Experiments
- Discussions
- Future Work
- File Descriptions
MuZero contains 3 functions approximated by neural networks, to be learned from the environment:
- A representation function,
$h(o_t) \rightarrow s^0$ , which given an observation$o_t$ from the environment at time step$t$ , outputs the hidden state representation$s^0$ of the observation at hypothetical time step$0$ (this hidden state will be used as the root node in MCTS, so its hypothetical time step is zero)- The representation function is used in tandem with the dynamics function to represent the environment's state in whatever way the algorithm finds useful in order to make accurate predictions for the reward, value and policy
- A dynamics function,
$g(s^k,a^{k+1}) \rightarrow s^{k+1},r^{k+1}$ , which given a hidden state representation$s^k$ at hypothetical time step$k$ and action$a^{k+1}$ at hypothetical time step$k+1$ , outputs the predicted resulting hidden state representation$s^{k+1}$ and transition reward$r^{k+1}$ at hypothetical time step$k+1$ - The dynamics function is the learned transition model, which allows MuZero to utilize MCTS and plan hypothetical future actions on future board states
- A prediction function,
$f(s^k) \rightarrow p^k,v^k$ , which given a hidden state representation$s^k$ , outputs the predicted policy distribution over actions$p^k$ and value$v^k$ at hypothetical time step$k$ - The prediction function is used to limit the search breadth by using the policy output to prioritize MCTS to search for more promising actions, and limit the search depth by using the value output as a substitute for a Monte Carlo rollout
The MuZero algorithm can be summarized as follows:
- loop for a number of episodes:
- at every time step
$t$ of the episode:- perform Monte Carlo tree search (MCTS)
- pass the current observation of the environment
$o_t$ to the representation function,$h(o_t) \rightarrow s^0$ , and get the hidden state representation$s^0$ from the output - pass the hidden state representation
$s^0$ into the prediction function,$f(s^0) \rightarrow p^0,v^0$ , and get the predicted policy distribution over actions$p^0$ and value$v^0$ from the output - for a number of simulations:
- select a leaf node based on maximizing UCB score
- expand the node by passing the hidden state representation of its parent node
$s^k$ and its corresponding action$a^{k+1}$ into the dynamics function,$g(s^k,a^{k+1}) \rightarrow s^{k+1},r^{k+1}$ , and get the predicted resulting hidden state representation$s^{k+1}$ and transition reward$r^{k+1}$ from the output - pass the resulting hidden state representation
$s^{k+1}$ into the prediction function,$f(s^{k+1}) \rightarrow p^{k+1},v^{k+1}$ , and get the predicted policy distribution over actions$p^{k+1}$ and value$v^{k+1}$ from the output - backpropagate the predicted value
$v^{k+1}$ up the search path
- pass the current observation of the environment
- sample an action based on the visit count of each child node of the root node
- apply the sampled action to the environment and observe the resulting transition reward
- perform Monte Carlo tree search (MCTS)
- once the episode is over, save the game trajectory (including the MCTS results) into the replay buffer
- sample a number of game trajectories from the replay buffer:
- pass the first observation of the environment
$o_0$ from the game trajectory to the representation function,$h(o_0) \rightarrow s^0$ and get the hidden state representation$s^0$ from the output - pass the hidden state representation
$s^0$ into the prediction function,$f(s^0) \rightarrow p^0,v^0$ , and get the predicted policy distribution over actions$p^0$ and value$v^0$ from the output - for every time step
$t$ in the game trajectory:- pass the current hidden state representation
$s^t$ and the corresponding action$a^{t+1}$ into the dynamics function,$g(s^t,a^{t+1}) \rightarrow s^{t+1},r^{t+1}$ , and get the predicted resulting hidden state representation$s^{t+1}$ and transition reward$r^{t+1}$ from the output- this predicted transition reward
$r^{t+1}$ is matched to the actual transition reward target received from the environment
- this predicted transition reward
- pass the resulting hidden state representation
$s^{t+1}$ into the prediction function,$f(s^{t+1}) \rightarrow p^{t+1},v^{t+1}$ , and get the predicted policy distribution over actions$p^{t+1}$ and value$v^{t+1}$ from the output- this predicted policy distribution
$p^{t+1}$ is matched to the child node visit count distribution outputted by MCTS at that time step in that game trajectory - this predicted value
$v^{t+1}$ is matched to the value outputted by MCTS at that time step in that game trajectory
- this predicted policy distribution
- update the weights of the representation, dynamics and prediction function based on these three targets
- pass the current hidden state representation
- pass the first observation of the environment
- at every time step
MCTS requires a model of the environment when expanding leaf nodes during its search. The environment model takes in a state and action and outputs the resulting state and transition reward; this is the functional definition of the dynamics function,
In stochastic environments, the functional definition of the environment model changes. Given a state and action, the environment model instead outputs a set of possible resulting states, transition rewards and the corresponding probabilities of those outcomes occurring. To approximate this environment model, we can re-define the dynamics function as:
Given a current state
Instead we define a stochastic branching factor hyperparameter
Below is an animation illustrating how MCTS is executed in stochastic environments (you can find the individual pictures in assets/sigmazero_graph_images/
for closer inspection). In this environment, the action space is size 3 and the stochastic branching factor hyperparameter is set to 2.
Instead of a single node representing a state, we have a node set representing a number of possible stochastic states, that are a result of applying a sequence of actions starting from the state at the root node. For example,
Just like nodes used for MCTS in deterministic environments, node sets also contain a predicted reward, value and policy distribution attribute. Whereas in the former case where the predicted reward would be obtained directly from the dynamics function output and the predicted value and policy distribution would be obtained directly from the prediction function output, the predicted reward, value and policy distribution for node sets are calculated via expectation.
More formally, to obtain the predicted reward
where
Once these three values are computed, MCTS can perform all 4 stages of selection, expansion, simulation and backpropagation on a node set just like it does on a regular node.
The SigmaZero algorithm can be summarized as follows:
- loop for a number of episodes:
- at every time step
$t$ of the episode:- perform Monte Carlo tree search (MCTS)
- pass the current observation of the environment
$o_t$ to the representation function,$h(o_t) \rightarrow s^0$ , and get the hidden state representation$s^0$ from the output - pass the hidden state representation
$s^0$ into the prediction function,$f(s^0) \rightarrow p^0,v^0$ , and get the predicted policy distribution over actions$p^0$ and value$v^0$ from the output - for a number of simulations:
- select a leaf node set based on maximizing UCB score
- expand the node set by passing the hidden state representation
$s^k_i$ for each node$i$ in the parent node set$N^k$ and the corresponding action$a^{k+1}$ into the dynamics function,$g(s^k_i,a^{k+1}) \rightarrow [s^{k+1}_{i1},...,s^{k+1}_{ib}],[r^{k+1}_{i1},...,r^{k+1}_{ib}],[\pi^{k+1}_{i1},...,\pi^{k+1}_{ib}]$ , and get the predicted resulting hidden state representations$[s^{k+1}_{i1},...,s^{k+1}_{ib}]$ , transition rewards$[r^{k+1}_{i1},...,r^{k+1}_{ib}]$ and transition probabilities$[\pi^{k+1}_{i1},...,\pi^{k+1}_{ib}]$ from the output, for each parent node$i$ - create a child node set
$N^{k+1}$ and assign its hidden state as the set of all resulting hidden states from applying action$a^{k+1}$ to all parent nodes - for each node
$j$ in the child node set$N^{k+1}$ , update their transition probability so that it represents the probability of transitioning to the state represented by this node after applying the corresponding actions to the state of the root node:$\pi^{k+1}_j \leftarrow \pi^{k+1}_j \pi^k_{\rho(j)}$ , where$\rho(j)$ is the parent node of node$j$ - assign the transition reward of the child node set
$N^{k+1}$ as:$r_{N^{k+1}} = \sum_{j \in N^{k+1}} \pi_j^{k+1} r_j^{k+1}$ - for each node
$j$ in the child node set$N^{k+1}$ , pass their resulting hidden state representation$s^{k+1}_j$ into the prediction function,$f(s^{k+1}_j) \rightarrow p^{k+1}_j,v^{k+1}_j$ , and get the predicted policy distribution over actions$p^{k+1}_j$ and value$v^{k+1}_j$ from the output - assign the policy distribution over actions of the child node set
$N^{k+1}$ as:$p_{N^{k+1}} = [ \sum_{j \in N^{k+1}} \pi_j^{k+1} p_{j1}^{k+1} , ... , \sum_{j \in N^{k+1}} \pi_j^{k+1} p_{ja}^{k+1} ]$ - assign the value of the child node set
$N^{k+1}$ as:$v_{N^{k+1}} = \sum_{j \in N^{k+1}} \pi_j^{k+1} v_j^{k+1}$ - backpropagate the predicted value
$v_{N^{k+1}}$ up the search path
- pass the current observation of the environment
- sample an action based on the visit count of each child node set of the root node set
- apply the sampled action to the environment and observe the resulting transition reward
- perform Monte Carlo tree search (MCTS)
- once the episode is over, save the game trajectory (including the MCTS results) into the replay buffer
- sample a number of game trajectories from the replay buffer:
- pass the first observation of the environment
$o_0$ from the game trajectory to the representation function,$h(o_0) \rightarrow s^0$ and get the hidden state representation$s^0$ from the output - pass the hidden state representation
$s^0$ into the prediction function,$f(s^0) \rightarrow p^0,v^0$ , and get the predicted policy distribution over actions$p^0$ and value$v^0$ from the output - for every time step
$t$ in the game trajectory:- pass the hidden state representation
$s^t_i$ for each node$i$ in the current node set$N^t$ and the corresponding action$a^{t+1}$ into the dynamics function,$g(s^t_i,a^{t+1}) \rightarrow [s^{t+1}_{i1},...,s^{t+1}_{ib}],[r^{t+1}_{i1},...,r^{tk+1}_{ib}],[\pi^{t+1}_{i1},...,\pi^{t+1}_{ib}]$ , and get the predicted resulting hidden state representations$[s^{t+1}_{i1},...,s^{t+1}_{ib}]$ , transition rewards$[r^{t+1}_{i1},...,r^{t+1}_{ib}]$ and transition probabilities$[\pi^{t+1}_{i1},...,\pi^{t+1}_{ib}]$ from the output, for each node$i$ - create a node set
$N^{t+1}$ and assign its hidden state as the set of all resulting hidden states from applying action$a^{t+1}$ to all nodes of the node set$N^t$ from the previous time step - for each node
$j$ in the node set$N^{t+1}$ , update their transition probability so that it represents the probability of transitioning to the state represented by this node after applying the actions taken in this game trajectory so far to the initial state of this game trajectory:$\pi^{t+1}_j \leftarrow \pi^{t+1}_j \pi^t_{\rho(j)}$ , where$\rho(j)$ is the corresponding node from the previous time step of node$j$ - assign the transition reward of the node set
$N^{t+1}$ as:$r_{N^{t+1}} = \sum_{j \in N^{t+1}} \pi_j^{t+1} r_j^{t+1}$ - this predicted transition reward
$r_{N^{t+1}}$ is matched to the actual transition reward target received from the environment
- this predicted transition reward
- for each node
$j$ in the node set$N^{t+1}$ , pass their resulting hidden state representation$s^{t+1}_j$ into the prediction function,$f(s^{t+1}_j) \rightarrow p^{t+1}_j,v^{t+1}_j$ , and get the predicted policy distribution over actions$p^{t+1}_j$ and value$v^{t+1}_j$ from the output - assign the policy distribution over actions of the node set
$N^{t+1}$ as:$p_{N^{t+1}} = [ \sum_{j \in N^{t+1}} \pi_j^{t+1} p_{j1}^{t+1} , ... , \sum_{j \in N^{t+1}} \pi_j^{t+1} p_{ja}^{t+1} ]$ - this predicted policy distribution
$p_{N^{t+1}}$ is matched to the child node visit count distribution outputted by MCTS at that time step in that game trajectory
- this predicted policy distribution
- assign the value of the node set
$N^{t+1}$ as:$v_{N^{t+1}} = \sum_{j \in N^{t+1}} \pi_j^{t+1} v_j^{t+1}$ - this predicted value
$v_{N^{t+1}}$ is matched to the value outputted by MCTS at that time step in that game trajectory
- this predicted value
- update the weights of the representation, dynamics and prediction function based on these three targets
- pass the hidden state representation
- pass the first observation of the environment
- at every time step
Rather than a list of nodes, the node set is implemented as a single class object; the same as a node in regular MCTS. Comparisons between how the attributes are implemented between a node and node set are detailed in the table below:
Attribute | Node | Node Set |
---|---|---|
Prior | A scalar value obtained from the policy value of the parent node; used to calculate the UCB score | A scalar value obtained from the policy value of the parent node set; used to calculate the UCB score |
Hidden State Representation | A |
An |
Transition Probabilities | The Node class does not have this property, but if it were implemented, this value would be 1 | An |
Transition Reward | A scalar value obtained from the dynamics function, denoting the predicted reward received for transitioning to the state represented by this node, from the state represented by the parent node | A scalar value obtained by taking the expected value of the transition rewards from each node in the node set, denoting the predicted reward received for transitioning to the stochastic state represented by this node set, from the stochastic state represented by the parent node set; we don't need to keep track of the individual transition rewards as only the expected transition reward will be used to match against the target during training |
Policy | An |
An |
Value | A scalar value obtained from the prediction function, denoting the predicted return received if we were to follow the agent's policy at the state represented by this node for the remainder of the episode | A scalar value obtained by taking the expected value of the values from each node in the node set, denoting the predicted return received if we were to follow the agent's policy at the stochastic state represented by this node set for the remainder of the episode; we don’t need to keep track of the individual values as only the expected value will be used to match against the target during training |
Children | A list of child nodes for each corresponding action in the action space | A list of child node sets for each corresponding action in the action space |
Cumulative Value | A scalar value that accumulates values backpropagated up the search path; the average value is used to calculate the UCB score | A scalar value that accumulates values backpropagated up the search path; the average value is used to calculate the UCB score |
Visit Count | The number of times this node has been visited during MCTS; dividing the cumulative value by the visit count gives the average value | The number of times this node set has been visited during MCTS; dividing the cumulative value by the visit count gives the average value |
We now look to test the SigmaZero algorithm on a stochastic environment and compare its performance with MuZero.
A toy example StochasticWorld environment is used for this experiment. The StochasticWorld environment has the following parameters that can be configured:
-
num_states
: the number of states in the environment's state space -
num_actions
: the number of actions in the environment's action space -
timestep_limit
: the number of time steps in an episode before termination -
stochastic_branching_factor
: the number of possible states that can result from applying an action to a state (this value defines the true branching factor of the environment, whereas the hyperparameter$b$ is an approximation used by the model) -
transition_probabilities_stdev
: determines the skewness of the transition probabilities, which are generated as follows:- for every state, for every action, sample a vector of length
stochastic_branching_factor
from a normal distribution with mean 0 andtransition_probabilities_stdev
standard deviation - softmax the sampled values to get the corresponding transition probabilities of reaching one of the
stochastic_branching_factor
states (which are assigned randomly), as a result of applying that particular action to that particular state - if
transition_probabilities_stdev
is 0, all stochastic transitions have uniform probability, whereas the higher the value, the more skewed the probabilities are
- for every state, for every action, sample a vector of length
-
transition_rewards_range
: the range of possible transition reward values; this range is sampled from uniformly when constructing the transition dynamics of the environment
Each episode, the agent starts at a random initial state. At every time step, the agent applies an action which results in a possible new state and receives a transition reward, according to the transition dynamics generated from the parameter configuration. Once the time step limit is reached, the episode terminates and the return is calculated as the total sum of transition rewards received by the agent during the episode.
A StochasticWorld environment with the following parameters was used for experiments:
num_states
: 100num_actions
: 3timestep_limit
: 50stochastic_branching_factor
: 2transition_probabilities_stdev
: 10-2transition_rewards_range
: [-1,1]
For both the MuZero and SigmaZero algorithms, a separate neural network with the following parameters were used to approximate the representation, dynamics and prediction function:
num_layers
: 2num_neurons
: 32activation_function
: ReLUregularizer
: L2(10-3)hidden_state_size
: 32
The stochastic branching factor hyperparameter
Below is a graph illustrating the average reward received by agents using the MuZero and SigmaZero algorithm on the same StochasticWorld environment:
Additional experiments were conducted when the stochastic branching factor
As we can see from the results, SigmaZero (
In this project, we see an example of how SigmaZero can outperform MuZero on stochastic environments. We now discuss and compare possible advantages and disadvantages for using one algorithm over another.
Given that MCTS is set to run
The answer is not clear as to whether increasing the stochastic branching factor of SigmaZero will increase its performance, if the factor is less than the inherent stochastic branching factor of the environment. This may be worth exploring for future work.
From this project's experiments, we can see that if the stochastic branching factor is at least the inherent stochastic branching factor of the environment, SigmaZero performs better than MuZero, at the cost of greater runtime and space complexity.
Theoretically if we make the hidden state representation size equal to the number of states in the environment's state space, then MuZero would be able to represent any possible set of stochastic states for this environment; similar to how SigmaZero can represent a set of stochastic states in its node sets.
For example if the state space is of size 5, then a feature representation of
In environments with large state spaces, explicit representation becomes infeasible, but theoretically increasing the hidden state representation size should increase MuZero's power in modelling stochastic states.
Here is a list of possible ideas for future work in extending this project:
- find a way to make SigmaZero computationally more efficient
- after expanding a node set using the dynamics function, reduce the size of the hidden state representation by:
- merging similar resulting states together
- discarding resulting states with low transition probabilities
- modify the UCB score to factor in the depth level of node sets to control for how fast the size of hidden state representations increase in the search tree
- after expanding a node set using the dynamics function, reduce the size of the hidden state representation by:
- try StochasticWorld environments with different configurations
- higher stochastic branching factors
- more skewed transition probabilities
- sampling transition rewards from a non-uniform distribution
- use a bigger hidden state representation size for MuZero and compare its performance with the results of this project's experiments
- use more simulations for MCTS and see if that affects the results for both MuZero and SigmaZero
muzero / sigmazero
: there are identical Python files in themuzero
andsigmazero
directories; their functions are primarily the same except applying the MuZero and SigmaZero algorithm respectively:classes.py
holds data structure classes used by MuZero / SigmaZerostochastic_world.py
holds the data structure class for the StochasticWorld environmentmain.py
holds functions for self-play, MCTS, training and testingself_play
is the main function to call; it initiates self-play and trains MuZero / SigmaZero
test_rewards/
holds the average rewards received by the agent when it is periodically tested during trainingenv_configs/
holds the specific StochasticWorld environment configuration used by both MuZero and SigmaZero
plot.py
plots and compares the average test rewards received by MuZero and SigmaZero agentsassets/
holds media files used in thisREADME.md
requirements.txt
holds all required dependencies, which can be installed by typingpip install -r requirements.txt
in the command line
For this project, I'm using Python 3.7.4.