Time to read: 3 min read
Please read my blog post on value iteration before reading this blog.
While value iteration is very effective in solving MDPs, it can be very computationally expensive to recalculate the entire board after each iteration (especially in the mid-20th century). In comes Ronald Howard with policy iteration. The idea is that instead of performing iterations on the entire board, one would only perform iterations on the actual policy (the series of actions the agent takes from each state).
In value iteration, we first calculate the expected value of each state (the utility functions of each state), then we derive our policy for each state based ont the utility functions. In value iteration we calculate the values for each state before we decide the policy. In policy iteration, we start with a policy, and we check the expected value/utility derived by following the policy, and we make changes to the policy as appropriate; this is much more akin to the reinforcement learning algorithms being used today, where we take actions, measure the results, and alter the actions.
Back to the Star Wars example: in value iteration Luke draws out a map of Hoth, and decides which steps to take after the map has been finalized. In policy iteration, Luke decides which steps to take first, then tries out the path, and modifies his map based on what he encounters. Luke the alters his policy based on his new map. Each individual step takes longer to complete, but the goal is to complete less steps, which allows the algorithm overall to be less expensive than value iteration.
In policy iteration, we first pick a policy,
The policy evaluation step is as follows:
This is very similar to the Bellman equation from value iteration; the difference is that we are only calculating the value based on the action determined by the pre-set policy, instead of the max of all possible actions.
The policy improvement step is as follows:
We update the policy for each state to the action which maximizes rewards; this step is akin to the final step of value iteration, where you determine the policy based on the board with the finalized utility values.
Policy iteration can be thought of as a rearranged value iteration. While policy iteration may not guarantee the globally optimal path (it may find a locally optimal path), it will most likely terminate faster and use less computation.
It's interesting to note how sometimes innovation happens because of constraints (in this case physical hardware constraints); as our hardware becomes increasingly more advanced, especially with the advent of quantum computing, will computational constraints becomes less relevant? If so, would algorithmic innovation still persist?