Reinforcement learning, by itself isn't good enough for autonomous driving. This is because other agents are unpredictable; the environment evolves not just due to our action, but also due to what other agents are doing.

It is possible to shun the Markovian assumption and still do policy gradient learning for such complex systems. Let's define trajectories as sequence of state action pairs, following a stochastic policy $\pi_\theta$:

$\bar{s} = \{(s_1,a_1),(s_2,a_2)\cdots (s_T,a_T)\} \mbox{ ; \bar{s}_ {i:j} is the subtrajectory from timestep i to j}$

Let the reward function $R(\bar{s})$ be defined over trajectories; this could be discounted.

Goal: to find the optimal stochastic policy that maximizes $\mathbb{E}_{\bar{s}}[R(\bar{s})]$

Any stochastic policy could lead to many possible trajectories. Let $\mathbf{P}(\bar{s})$ be the probability of a trajectory:

\begin{align*} \mathbf{P}(\bar{s}) &= P(s_1)\cdot \pi_\theta(a_1|s_1) \cdot P(s_2 | \bar{s}_ {1:1}) \cdot \pi_\theta(a_2|s_2) P(s_3 | \bar{s}_ {1:2}) \cdot P(a_3 | s_3) \cdots \\ &= \prod_{i=1}^T P(s_i | \bar{s}_{1:i-1}) \cdot \pi_\theta(a_i | s_i) \end{align*}

The policy gradient turns out to be the same as in the Markovian case:

\begin{align*} \nabla_\theta \mathbb{E}_{\bar{s}}[R(\bar{s})] &= \nabla_\theta \sum_{\bar{s}} \mathbf{P}(\bar{s}) \cdot R(\bar{s}) \\ &= \sum_{\bar{s}} R(\bar{s}) \cdot \nabla_\theta \mathbf{P}(\bar{s}) \mbox{ ; reward doesn't depend on \theta} \\ &= \sum_{\bar{s}} R(\bar{s}) \cdot \mathbf{P}(\bar{s}) \cdot \nabla_\theta \log \mathbf{P}(\bar{s}) \mbox{ ; the log trick used in policy gradient} \\ &= \sum_{\bar{s}} R(\bar{s}) \cdot \mathbf{P}(\bar{s}) \cdot \nabla_\theta \bigg[ \log \bigg\{ \prod_{i=1}^T P(s_i | \bar{s}_{1:i-1}) \cdot \pi_\theta(a_i | s_i) \bigg\} \bigg] \\ &= \sum_{\bar{s}} R(\bar{s}) \cdot \mathbf{P}(\bar{s}) \cdot \nabla_\theta \bigg[ \sum_{i=1}^T \log P(s_i | \bar{s}_{1:i-1}) + \sum_{i=1}^T \log \pi_\theta(a_i | s_i) \bigg] \\ &= \sum_{\bar{s}} R(\bar{s}) \cdot \mathbf{P}(\bar{s}) \cdot \bigg[\sum_{i=1}^T \nabla_\theta \log \pi_\theta(a_i | s_i) \bigg] \mbox{ ; P(s_i | \bar{s}_{1:i-1}) doesn't depend on \theta} \\ &= \mathbb{E}_{\bar{s}} \bigg[ R(\bar{s}) \cdot \bigg\{ \sum_{i=1}^T \nabla_\theta \log \pi_\theta(a_i | s_i) \bigg\} \bigg] = \mathbb{E}_{\bar{s}} \bigg[ \sum_{i=1}^T R(\bar{s}) \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \bigg] \end{align*}

Using the policy gradient like this is problematic, because the term inside the expectation could vary a lot with time; and then this policy gradient update is not a good update.

This problem can be solved by introducing a baseline term $b_i$:

$\nabla_\theta \mathbb{E}_ {\bar{s}}[R(\bar{s})] = \mathbb{E}_ {\bar{s}} \bigg[ \sum_{i=1}^T R(\bar{s}) \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \bigg] = \mathbb{E}_ {\bar{s}} \bigg[ \sum_{i=1}^T [R(\bar{s}) - b_i] \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \bigg]$

To be able to introduce a baseline term, we need: $$\mathbb{E}_ {\bar{s}} \bigg[ \sum_{i=1}^T b_i \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \bigg] = 0$$ To prove that this is indeed the case, we expand the expectation: \begin{align*} \mathbb{E}_ {\bar{s}} \bigg[ \sum_{i=1}^T b_i \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \bigg] &= \sum_{\bar{s}} \mathbf{P}(\bar{s}) \cdot \sum_{i=1}^T b_i \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \\ &= \sum_{i=1}^T \sum_{\bar{s}} \mathbf{P}(\bar{s}) \cdot b_i \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \end{align*} The summation over trajectory $\bar{s}$ can be rewritten by breaking down into subtrajectories, specifically into $(s_ 1,a_ 1,s_ 2,\cdots,s_ i)$, the action $a_ i$ and $(s_ {i+1},a_ {i+1},\cdots s_ {T}, a_ {T})$. Finally, note that $b_ i$ is decided before $a_ i$ is taken, and the $\nabla_\theta \log \pi_\theta(a_ i|s_ i)$ doesn't depend on $s_ {i+1:T}$, so we can move those outside. \begin{align*} \mathbb{E}_ {\bar{s}} [\cdots] &= \sum_{i=1}^T \sum_{\bar{s}} \mathbf{P}(\bar{s}_{1:i-1}, s_i) \cdot \pi_\theta(a_i|s_i) \cdot \mathbf{P}(\bar{s}_{i+1:T}) \cdot b_i \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \\ &= \sum_{i=1}^T \sum_{\bar{s}_{1:i-1}, s_i} \mathbf{P}(\bar{s}_{1:i-1}, s_i) \cdot \sum_{a_i} \pi_\theta(a_i|s_i) \cdot \sum_{\bar{s}_{i+1:T}} \mathbf{P}(\bar{s}_{i+1:T}) \cdot b_i \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \\ &= \sum_{i=1}^T \sum_{\bar{s}_{1:i-1}, s_i} \mathbf{P}(\bar{s}_{1:i-1}, s_i) \cdot b_i \cdot \sum_{a_i} \pi_\theta(a_i|s_i) \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \cdot \sum_{\bar{s}_{i+1:T}} \mathbf{P}(\bar{s}_{i+1:T}) \end{align*} $\mathbf{P}(\bar{s}_{i+1:T})$ is the probability of a subtrajectory from timestep $i+1$ to $T$, and it should sum to $1$ over all possible $\bar{s}_{i+1:T}$: \begin{align*} \mathbb{E}_ {\bar{s}} [\cdots] &= \sum_{i=1}^T \sum_{\bar{s}_{1:i-1}, s_i} \mathbf{P}(\bar{s}_{1:i-1}, s_i) \cdot b_i \cdot \sum_{a_i} \pi_\theta(a_i|s_i) \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \\ &= \sum_{i=1}^T \sum_{\bar{s}_{1:i-1}, s_i} \mathbf{P}(\bar{s}_{1:i-1}, s_i) \cdot b_i \cdot \sum_{a_i} \pi_\theta(a_i|s_i) \cdot \frac{\nabla_\theta \pi_\theta(a_i | s_i)}{\pi_\theta(a_i | s_i)} \\ &= \sum_{i=1}^T \sum_{\bar{s}_{1:i-1}, s_i} \mathbf{P}(\bar{s}_{1:i-1}, s_i) \cdot b_i \cdot \sum_{a_i} \nabla_\theta \pi_\theta(a_i | s_i) \\ &= \sum_{i=1}^T \sum_{\bar{s}_{1:i-1}, s_i} \mathbf{P}(\bar{s}_{1:i-1}, s_i) \cdot b_i \cdot \nabla_\theta \sum_{a_i} \pi_\theta(a_i | s_i) \\ &= \sum_{i=1}^T \sum_{\bar{s}_{1:i-1}, s_i} \mathbf{P}(\bar{s}_{1:i-1}, s_i) \cdot b_i \cdot \nabla_\theta 1 = \sum_{i=1}^T \sum_{\bar{s}_{1:i-1}, s_i} \mathbf{P}(\bar{s}_{1:i-1}, s_i) \cdot b_i \cdot 0 = 0 \end{align*}

Let's rewrite the policy gradient in matrix form. Let $R_ {1\times T}$ be a matrix of size $T$ with all values equal to $R(\bar{s})$, $b_ {1\times T} = \{ b_ i \}_ {i = 1}^T$ and $F_ {T \times 1} = \{ \nabla_\theta \log \pi_\theta(a_ i|s_ i)\}_ {i = 1}^T$, then our policy gradient is

$\mathbb{E}_ {\bar{s}} \bigg[ \sum_{i=1}^T [R(\bar{s}) - b_i] \cdot \nabla_\theta \log \pi_\theta(a_i | s_i) \bigg] = \mathbb{E}_ {\bar{s}} [(R-b)F]$

To minimize the variance of the term $(R-b)F$, we can equivalently minimize $\{(R-b)F\}^2$, as shown in Peters et al. \cite{peters2008reinforcement}. So, this problem of finding the baseline coefficients is equivalent to:

$\frac{\partial}{\partial b} \{(R-b)F\}^2 := 0$

In practice, even adding baselines doesn't solve the problem. Consider an example of designing a autonomous driving system, such that accidents occur with probability $p$. Suppose we penalize the system with a reward of $-r$ when accidents happen, and in any other case, our rewards are between $-1$ and $1$. The expected reward is: $$\mathbb{E}_{\bar{s}}[R(\bar{s})] \leq -r\cdot p + 1\cdot(1-p)$$ The variance of the reward is: \begin{align*} \mbox{var}[R(\bar{s})] &= \mathbb{E}_{\bar{s}}[R(\bar{s})^2] - \mathbb{E}_{\bar{s}}[R(\bar{s})]^2 \\ &\geq (-r)^2\cdot p + (-1)^2\cdot (1-p) - \{ -r\cdot p + 1\cdot(1-p) \}^2 \\ &= pr^2+(1-p)-\{ 1-p-pr \}^2 \\ &= pr^2+(1-p)-\{1+p^2+p^2r^2-2p+2p^2r-2pr\}\\ &= pr^2+p-p^2-p^2r^2-2p^2r+2pr \\ &= (p-p^2)r^2+p-p^2-2p^2r+2pr = \Omega(r^2) \end{align*} Our choice of $r$ is definitely large in comparison to other rewards, since we want to prioritize our learning about accidents more than other behaviors. But, then, our variance will keep on increasing as our trajectories get bigger. The fundamental problem is that our problem space is too large.

The solution is to use the options framework, and subdivide the problem into smaller domains. Each option could learn a separate maneuver.