-
Notifications
You must be signed in to change notification settings - Fork 0
Blog Post #2 (Planning)
We left last week with a defined set of states, actions, transition probabilities, and a big question of how these become an MDP, Markov Decision Process. It turns out, the math isn't that complicated! It simply takes reawakening long-asleep memories of how to express systems of equations in matrix form and what dot products represent.
First of all, a policy is a mapping from states to actions. In other words, it tells the robot what action to do for every state. A policy can be evaluated by summing the rewards associated with every state hit by the robot until it reaches the goal. We call this the value function for a given policy. An optimal policy should tell the robot what action will maximize the total reward. In our case, we set the goal reward to 10, and the reward for any other state to -1. This motivates the robot to spend as little time in non-goal states as possible, effectively optimizing the path length. In this algorithm, we start with a random policy, which we will iterate on. The value function at a state (V_pi(S)) can be computed recursively with:
V_pi(S) = Reward at S + gamma * sum over transition states S' (probability of transition * V_pi(S'))
Where gamma is a number between 0 and 1 that weighs down future states (to encode uncertainty) We can also do this in vector form with:
Value_function_vector = rewards_vector + gamma * transition_matrix * value_function_vector V_pi = R + gamma * P_pi * V_pi
Where P_pi = [ [transition from S1 to S1 with action pi(S1), ..., transition from S1 to Sn with action pi(S1)], [transition from S2 to S1 with action pi(S2), ..., transition from S2 to Sn with action pi(S1)], ... [transition from Sn to S1 with action pi(Sn), ..., transition from Sn to Sn with action pi(Sn)] ]
Wielding the power of linear algebra, we can solve this by rearranging the equation as such:
V_pi = (I - gamma * P_pi)^-1 * R
Where I is the identity matrix of the size that works. And that's it! Since this is just a matrix operation, numpy can do that pretty quickly.
Having calculated the value of rewards you would collect by reaching any given state and following this initial policy, how do we know that we have the optimal policy? Let's take an arbitrary state s. Remember that the value at s is the sum of the reward at s and the weighted sum of the values gained in the long term by transitioning to all other states, based on the policy and the probability of transitioning from s to the other states. Now, is the action to be taken at s the most optimal? What if we switched the action for one iteration? The idea of a greedy improvement algorithm is very helpful here. For each state s, we calculate the value we would gain by substituting action a1, a2, a3,... for the action given by our current policy. Out of the resulting values, we choose the new action to be the one that gives the greatest value (mmm, maximum greed). Cycling through each state, we get a new policy which we then use to calculate a new value function, which we use to improve the policy. When do we know we have the optimal policy? If the actions giving the greatest value are not changing from the previous policy, we reach a point where we cannot improve anymore!
When building the roadmap, we very quickly ran into a problem. Simply looping through every state to find the nearest neighbor takes a lot of time. O(n^2) kind of time. So when we ran the markov model generator with a sufficient number of nodes, it took entirely too long. Luckily, there are data structures for this exact problem! In one dimension, a fan favorite is the binary tree with a very attractive log(n) search time. The same general principle behind a binary tree can be applied in more dimensions using what's called a K-Dimensional Tree (or k-d tree for short). An average query is O(log(n)), which is pretty sweet. However, kd-trees partition the space based on euclidian distance. This doesn't work if one of the dimensions of each state is the robot's orientation because euclidian distance doesn't do well with wraparound. We are using sklearn's implementation of the kd tree, which does the heavy lifting in C. So if we give it our own distance metric, the kd tree needs to constantly jump out of C to call the python distance function, significantly slowing things down. Our solution was to find a few nearest neighbors with the kd tree, ignoring orientation. Then, search for the closest orientation match linearly from that smaller sample. This is still taking way too much time, so we'll have to keep squeezing efficiency out of this.
Now that we understood how our transition matrix became a policy, we decided to write our own implementation and compare it to Python's mdptoolbox. We had seen this package earlier, but before understanding what the states, actions, and transitions represented and what format they were expected in, it was not very clear how to start using it. It turns out, mdptoolbox simply requires a transition matrix, with a size of (num_actions, num_states, num_states), a reward vector, and a discount factor (representing how much weight to give rewards farther out into the future). In our own implementation, we tried to transcribe the infinite horizon linear programming approach we discussed above. The resulting policies, where a red state should turn left, green should go forwards, magenta should turn right, and the green sphere the goal state, were as follows:
Homebrew MDP | mdptoolbox |
---|---|
Right off the bat, we could see that our implementation of the MDP had some problems: the optimal policy was to turn at every state. There is no way you could reach the target goal like that! Mdptoolbox's method seemed to give a more reasonable policy, with more green go-forwards arrows, suggesting that there was hope. However, looking closely at the green poses, we noticed that most of these had orientations pointing AWAY from the target goal, often going off of the map! Perhaps we got a particularly curious robit algorithm?
To verify that out hunch that the policy we generated would lead the robot to the goal state, we created a simulation of a robot (blue sphere). Given different start states, it would attempt to execute the policy by repetitively completing the action at its current state and approximating which state it transitioned to (updating its current state). The following simulation was done with 1000 states (100 positions, 10 orientations). Curiously, most of start states would generate maybe 2 transitions before staying in the same state infinitely. This particular state was a bit more lively, but the robit absolutely ignores the goal state!
At this point, we were stumped. It seemed like the algorithm was working, generating states and transitions which seemed to make sense, mdptoolbox's policy was converging, and no errors were being thrown. A gremlin had made a nest for itself somewhere in the code and was failing it, silently. We turned to visualizing our transition matrix as a first step. We noticed that no transitions were happening past the second state, as shown by the two colored plots. In the 20-state plot, a red color means - probability, while a bluer probability represents a higher chance of a transition. In the 50-state plot, purple means no probability, and yellow means 100% change. This particular gremlin was nasty: instead of appending the end_state_idx to the transition matrix, we were appending its index in the list of transitions. Since this list was most often of length 1 or 2, only the first couple of states became end_idxs! The bottom two plots show the transition matrix after this bug fix.
20 states | 50 states |
---|---|
Unfortunately, this gremlin did not fix the policy simulation. However, our next lead is mdptoolbox's requirement of an input transition matrix with size (num_actions, num_states, num_states). Throughout this project, we've come up against a couple of decisions which we did not have any basis to lean one way or another. One of these was to how to format our transition matrix (the second being a good look-up structure for our states). We started with a list of [start_state_idx, end_state_idx, action_idx, probability]. When we realized we would need to perform matrix magic on this matrix, we understood that it needed to be 3-dimensional, with the probability corresponding to a point along the 3 axis (start state, end state, action). Arbitrarily, we chose a matrix with size (num_states, num_states, num_actions). However, this may not work out with matrix math, since the probabilities must add up along the rows instead of the columns.
We are off to fight the rest of our invisible gremlins. Until the final demo!