Behavior planning systems for autonomous driving can be categorized broadly into three types.

First, the system behavior could be rule based. If the state of our agent or car is $\chi_ 0$, and the states of other agents are $\chi_ {1:K}$, then this kind of behavior produces a decision $\zeta$ dependent only on current states:

$$\zeta = f(\chi_ 0, \chi_ {1:K})$$

Second, the system behavior could be reactive. In this case, the system’s decision is dependent on current states, and predictions of trajectories of other vehicles $\tau_ {1:K}$:

$$\zeta = f(\chi_ 0, \chi_ {1:K}, \tau_ {1:K})$$

Third, the system behavior can be modeled interactively. This means that the predictions of trajectories of other vehicles are computed keeping in mind that they would evolve based on ego’s decision $\zeta$:

$$\zeta = f(\chi_ 0, \chi_ {1:K}, \tau_ {1:K}) \mbox{ and } \tau_ {1:K} = g(\zeta)$$

If the states are exact values, solving the interactive case is hard, because of the circular dependencies. However, we can compute a solution by maintaining a belief probability distribution over the possible current states. This kind of planning is called belief state planning.

Due to the stochastic transition dynamics, it makes sense to formulate belief states within a POMDP framework. This formulation can be then solved using the adaptive belief tree algorithm.

Suppose that for vehicle $k$, a feature vector $f_k$ can be generated that is dependent on its exact state, i.e. position and velocity. If the possible next routes are $r^{(i)}$, then using Bayes rule we can compute a belief given our behavior so far:

$$P(r_k = r^{(i)} | f_k) \propto P(f_k | r^{(i)}) \cdot P(r^{(i)})$$

Since for all $i$, $P(r^{(i)})$ are equal and sum to $1$,

$$P(r_k = r^{(i)} | f_k) \propto P(f_k | r^{(i)})$$

If $f_k$ has independent features that compute the difference between the state of vehicle $k$ and a reference state (a reference path, a reference velocity), then we can assume that this difference follows a normal distribution, i.e. it is more likely to stay close to the reference state.

So, using a normal distribution for $P(f_k | \cdot)$, we can generate a potential next state by sampling a route from our posterior $P(r_k = r^{(i)} | f_k)$, and then keep doing this until our belief posterior spikes out (reaching a depth or termination).

Since we are multiplying our belief everytime with a normal distribution with $\mu=0$, eventually, the belief posterior will become highly probable around the mean.

Now we can use Monte Carlo sampling at every timestep to make decisions. At each timestep, we start with our current belief, sample a set of possible actions, and produce the next set of beliefs for the next level of the tree. We do this recursively until a depth or until termination.

In practice, the belief distribution is first discretized. Then some of the current discrete beliefs are chosen, and a UCB action is chosen for each of them (random action with probability $\epsilon$), then new discrete beliefs are generated. Further, the velocities may be assumed constant after some depth in the tree, to make the computation less intensive.