In the classical probability system, events and probabilities obey laws of set theory. Events correspond to possibilities in a sample space of all atomic events, and the system is in one of the atomic events.

For example, in a ball and bins experiment with 3 balls and 3 bins, there are 9 atomic states corresponding to the possibility of each ball being in each bin. An event could be the possibility of ball 1 not being in bin 3, which corresponds to two atomic possibilities, it is either in bin 1 or bin 2.

In the quantum probability system, the state of the system is represented as an $n$-dimensional vector in a Hilbert space. Notation wise, people prefer to use the Dirac notation, where $|a\rangle$ is a column vector and $\langle a|$ is a row vector.

For example, the state of ball $i$ could be $|B_i\rangle$ and it would be a superposition of the three states with orthonormal bases $|b_1\rangle, |b_2\rangle, |b_3\rangle$, each one corresponding to a different bin.

$|B_i\rangle = \alpha_1 |b_1\rangle + \alpha_2 |b_2\rangle + \alpha_3 |b_3\rangle$

The state vector for each ball is a unit vector, and hence, $\alpha_ 1^2+\alpha_ 2^2+\alpha_ 3^2=1$. The probability of the ball $i$ being in bin $j$ is the inner product $\langle b_j | B_i \rangle = \alpha_j^2$. The entire state of the system would be $|B_1,B_2,B_3\rangle$.

In quantum information theory, we have qubits, instead of bits. Each qubit's state is represented as a superposition of the orthogonal base states $|0\rangle$ and $|1\rangle$:

$|\psi\rangle = \alpha|0\rangle + \beta|1 \rangle, \mbox{ such that } \alpha^2+\beta^2 = 1$

For a system of $n$ qubits, the system state is the tensor product:

$|\phi\rangle = |\psi_1\rangle \mbox{ }\otimes\mbox{ } \cdots \mbox{ }\otimes\mbox{ } |\psi_n\rangle = \sum_{x\in \{0,1\}^n} \alpha_x |x\rangle, \mbox{ such that } \sum_{x\in \{0,1\}^n} \alpha_x^2 = 1$

To discuss the idea of quantum reinforcement learning, we need to understand two transforms.

First, a Hadamard transform can convert a qubit in $|0\rangle$ to a uniform superposition:

$H | 0 \rangle = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix} \begin{pmatrix} 1\\ 0 \end{pmatrix} = \frac{1}{\sqrt{2}}|0\rangle + \frac{1}{\sqrt{2}}|1\rangle$

Second, we need a Grover transform to reinforce good decisions. To understand the effect of this transform, let's first define the states and actions in this quantum framework.

Consider the states as bases of a Hilbert space $\mathcal{H}_1$ and actions as the bases of another Hilbert space $\mathcal{H}_2$. At any step of time, the state could be $|S\rangle$ and the action to be taken could also be $|A\rangle$, which are superpositions:

$|S\rangle = \sum_n \alpha_n |s_n\rangle \mbox{ and } |A\rangle = \sum_m \beta_m |a_m\rangle$

Without loss of generality, assume $s\in \{0,1\}^n$ and $a \in \{0, 1\}^m$:

$|S\rangle = \sum_{s\in \{0,1\}^n} C_s |s\rangle \mbox{ and } |A\rangle = \sum_{a\in \{0,1\}^m} D_a |a\rangle$

Here $C_s, D_a$ are complex numbers and follow:

$\sum_{s\in \{0,1\}^n} |C_s|^2 = \sum_{a\in \{0,1\}^m} |D_a|^2 = 1$

Now, when an action $|A\rangle$ is measured, it will collapse into one of its bases $|a\rangle$ with probability $|D_a|^2$. The goal is to amplify this probability for a "good" action. This can be done using the temporal difference update rule:

$V_{\pi,t} (s) \leftarrow V_{\pi,t-1}(s) + \alpha_t\bigg\{ r_{\pi}(s) + \gamma V_{\pi,t-1}(s') - V_{\pi,t-1}(s) \bigg\}$

Let $|a_0\rangle$ be a uniformly random action. Just like the Hadamard transform can convert a single qubit $|0\rangle$ into a uniform superposition of $|0\rangle$ and $|1\rangle$, if we apply this transform sequentially to each of the qubits, we will end up with a state vector that can collapse to any of the $2^m$ action bases with equal probability.

Now let us define two transformations $U_ a$ and $U_ {a_ 0}$, for our purposes. Let $U_a$ be defined as:

$U_a = I-2|a\rangle \langle a|$

The effect of this transformation on $|a\rangle$ is:

$U_a |a\rangle = (I-2|a\rangle \langle a|)|a\rangle = |a\rangle -2|a\rangle\langle a|a\rangle = - |a\rangle$

And, for any vector orthogonal to $a$, say $|a'\rangle$, the effect is:

$U_a |a'\rangle = (I-2|a\rangle \langle a|)|a'\rangle = |a'\rangle -2|a\rangle\langle a|a'\rangle = |a'\rangle$

Hence, for any vector with $|a\rangle$ as a basis, $U_a$ will create a mirror image of it along the plane orthogonal to $|a\rangle$.

Similarly, let us define the transformation $U_{a_0}$ in an opposite way, so it has the reverse effect, i.e. it inverts orthogonal vectors and doesn't affect vectors along $|a_0'\rangle$:

$U_{a_0} = 2|a_0\rangle \langle a_0| - I$

The Grover transform would be:

$U_{g} = U_{a_0}U_a$

For an initial vector $|a_s\rangle$, $U_g$ flips it along the $|a'\rangle$ axis first, and then along the $|a_0\rangle$:

This is a net rotation of $2\theta$ from the original position, where $\theta$ is the angle between $|a_0\rangle$ and $|a'\rangle$.

From Boyer et al. \cite{boyer1998tight}, if we apply $U_g$ for $t$ times, our final action is:

$U_g |a_0\rangle = \sin\{(2t+1)\theta\}\cdot|a\rangle+\cos\{(2t+1)\theta\}\cdot|a'\rangle$

This means we can change the probability from $p_1 = |\langle a | a_0 \rangle|^2 = 2^{-n}$ to

$p_2 = |\langle a | U_g^{(t)} a_0 \rangle|^2 = \sin^2\{(2t+1)\theta\}$

And this is a function we can maximize. Even in the case when $|a_s\rangle$ is not $|a_0\rangle$, we can control the final value of the action by this kind of update, and in most cases, maximize it.

Algorithm: Just like in the standard TD learning algorithm, we initialize the states $|s\rangle$, the action vectors that are possible from any state $s$, $|a_s\rangle$, and values $V(s)$. Then, for each episode:
1. We observe $|a_s\rangle$ (by collapse) for the given $|s\rangle$ and get $|a\rangle$.
2. We execute action $|a\rangle$, observe next state $|s'\rangle$, a reward $r$ and then do the TD update.
3. We find the appropriate $t$ for the given setting of $|a_s\rangle$ and amplify its probability by applying the Grover transform $t$ times: $$U_g|a_s\rangle \leftarrow U_{a_0}U_a|a_s\rangle$$

There are two significant advantages of doing this type of learning with a quantum framework:

• There is no direct need for exploration. This is because exploration happens by the collapse of $|a_s\rangle$.
• We don't need to update all the states one by one since we can do the value/TD update in parallel because of quantum parallelism.