Policy Iteration

Time to read: 3 min read

Please read my blog post on value iteration before reading this blog.

Premise

While value iteration is very effective in solving MDPs, it can be very computationally expensive to recalculate the entire board after each iteration (especially in the mid-20th century). In comes Ronald Howard with policy iteration. The idea is that instead of performing iterations on the entire board, one would only perform iterations on the actual policy (the series of actions the agent takes from each state).

Intuition

In value iteration, we first calculate the expected value of each state (the utility functions of each state), then we derive our policy for each state based ont the utility functions. In value iteration we calculate the values for each state before we decide the policy. In policy iteration, we start with a policy, and we check the expected value/utility derived by following the policy, and we make changes to the policy as appropriate; this is much more akin to the reinforcement learning algorithms being used today, where we take actions, measure the results, and alter the actions.

Back to the Star Wars example: in value iteration Luke draws out a map of Hoth, and decides which steps to take after the map has been finalized. In policy iteration, Luke decides which steps to take first, then tries out the path, and modifies his map based on what he encounters. Luke the alters his policy based on his new map. Each individual step takes longer to complete, but the goal is to complete less steps, which allows the algorithm overall to be less expensive than value iteration.

Algorithm

In policy iteration, we first pick a policy, then we calculate , the utility of each state if were to be executed; this step is known as policy evaluation. We then calculate a new policy using . We perform those two steps iteratively and we can either terminate when converges (the difference between the utilities of states after each iteration is below a certain threshold) or we can terminate once there are no more changes in the policy.

The policy evaluation step is as follows:

This is very similar to the Bellman equation from value iteration; the difference is that we are only calculating the value based on the action determined by the pre-set policy, instead of the max of all possible actions.

The policy improvement step is as follows:

We update the policy for each state to the action which maximizes rewards; this step is akin to the final step of value iteration, where you determine the policy based on the board with the finalized utility values.

Conclusion

Policy iteration can be thought of as a rearranged value iteration. While policy iteration may not guarantee the globally optimal path (it may find a locally optimal path), it will most likely terminate faster and use less computation.

It's interesting to note how sometimes innovation happens because of constraints (in this case physical hardware constraints); as our hardware becomes increasingly more advanced, especially with the advent of quantum computing, will computational constraints becomes less relevant? If so, would algorithmic innovation still persist?