<aside>

Table of Contents

</aside>

The goal of reinforcement learning is to maximize the cumulative reward that an agent receives in a given task. From a high-level point of view, there are two main ways to do this.

One is the direct approach: if we have the optimal policy function $\pi^*(a|s)$ directly, and the agent simply samples actions from this policy and interacts with the environment. Policy-based methods refer to approaches that directly learn a policy function and improve it iteratively so that it approaches the optimal policy.

The other is a more indirect approach: instead of learning the policy itself, we learn how good each action is in a given state, which is the optimal action-value function $Q^(s,a)$. Once we have this, we can obtain a policy by choosing the action with the highest value using a greedy strategy, $a^=\pi^(s)=\argmax_a Q^(s,a)$. Value-based methods refer to approaches that learn an action-value function and iteratively improve it so that it approaches the optimal action-value function.

Value-Based RL

Temporal Difference(TD) Method

Let’s consider an intuitive example to understand the TD algorithm. Suppose a model $Q$ can predict the travel time for a driving trip. At first, the model isn’t very accurate and might even behave almost randomly. But as more people use it, we gather more data and continue training, and the model gets better and better over time.

Figure 1

Then how should we train this model?

Before the trip starts, the user provides the model with the starting point $s$ and the destination $d$, and the model makes a prediction $\hat{q}=Q(s,d;\textbf{w})$. When the user finishes the trip, the actual driving time $y$ is fed back to the model. The difference $\hat{q} - y$ reflects whether the model has overestimated or underestimated the driving time. This feedback can then be used to update the model and make its estimates more accurate. We want the estimated value $\hat{q}$ to be as close as possible to the true observed value $y$.

The approach above requires running a complete episode to obtain an actual observation. If we only execute part of an episode, can we still update the model?

Figure 2

Follow the trip example, now think like this: Before departure, the model estimates that the total travel time will be $\hat{q} =14 hours$, and the route it recommends passes through Jinan. I set out from Beijing, and after $r=4.5 hours$, I arrive in Jinan. At this point, I ask the model to make another prediction, and the model tells me:

$$ \hat{q}' \triangleq Q(\text{"Jinan"}, \text{"Shanghai"；\textbf{w}}) = 11 . $$

Suppose that at this point my car breaks down and must be repaired in Jinan, so I have to cancel the trip. In other words, I do not complete the journey. Can this piece of data still help train the model? In fact, it can—the algorithm used here is Temporal Difference (TD) learning.

Next, we explain the principle of the TD algorithm. Let us first review the data we already have: the model estimated that traveling from Beijing to Shanghai would take $\hat{q} =14 hours$ in total. In reality, it took $r=4.5 hours$ to reach Jinan, and the model estimated that traveling from Jinan to Shanghai would still require $\hat{q}'=11 hours$. Upon arriving in Jinan, according to the model’s latest estimate, the total travel time for the entire journey is:

$$ \hat{y} \triangleq r + \hat{q}' = 4.5 + 11 = 15.5 . $$

In the TD algorithm, $\hat{y}=15.5$ is called the TD target. It is more reliable than the initial prediction $\hat{q}=14$ . The initial prediction is purely an estimate and contains no factual component. The TD target $\hat{y}=15.5$ is also an estimate, but it incorporates a factual component: $r=4.5$ is an actual observation.

Based on the above discussion, we consider the TD target $\hat{y}=15.5$ to be more reliable than the model’s initial estimate $\hat{q}=14$ . Therefore, we can use $\hat{y}$ to ‘correct’ the model. We would like the estimate $\hat{q}$ to be as close as possible to the TD target $\hat{q}$, so we use the difference between the two as the loss function. $\delta = \hat{q} - \hat{y}$ is called TD error. Our goal is to make the TD error as close to zero as possible.