Quantum TD Learning
In the classical probability system, events and probabilities obey laws of set theory. Events correspond to possibilities in a sample space of all atomic events, and the system is in one of the atomic events.
For example, in a ball and bins experiment with 3 balls and 3 bins, there are 9 atomic states corresponding to the possibility of each ball being in each bin. An event could be the possibility of ball 1 not being in bin 3, which corresponds to two atomic possibilities, it is either in bin 1 or bin 2.
In the quantum probability system, the state of the system is represented as an $n$-dimensional vector in a Hilbert space. Notation wise, people prefer to use the Dirac notation, where $|a\rangle$ is a column vector and $\langle a|$ is a row vector.
For example, the state of ball $i$ could be $|B_i\rangle$ and it would be a superposition of the three states with orthonormal bases $|b_1\rangle, |b_2\rangle, |b_3\rangle$, each one corresponding to a different bin.
\[ |B_i\rangle = \alpha_1 |b_1\rangle + \alpha_2 |b_2\rangle + \alpha_3 |b_3\rangle \]
The state vector for each ball is a unit vector, and hence, $\alpha_ 1^2+\alpha_ 2^2+\alpha_ 3^2=1$. The probability of the ball $i$ being in bin $j$ is the inner product $\langle b_j | B_i \rangle = \alpha_j^2$. The entire state of the system would be $|B_1,B_2,B_3\rangle$.
In quantum information theory, we have qubits, instead of bits. Each qubit's state is represented as a superposition of the orthogonal base states $|0\rangle$ and $|1\rangle$:
\[ |\psi\rangle = \alpha|0\rangle + \beta|1 \rangle, \mbox{ such that } \alpha^2+\beta^2 = 1 \]
For a system of $n$ qubits, the system state is the tensor product:
\[ |\phi\rangle = |\psi_1\rangle \mbox{ }\otimes\mbox{ } \cdots \mbox{ }\otimes\mbox{ } |\psi_n\rangle = \sum_{x\in \{0,1\}^n} \alpha_x |x\rangle, \mbox{ such that } \sum_{x\in \{0,1\}^n} \alpha_x^2 = 1 \]
To discuss the idea of quantum reinforcement learning, we need to understand two transforms.
First, a Hadamard transform can convert a qubit in $|0\rangle$ to a uniform superposition:
\[ H | 0 \rangle = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix} \begin{pmatrix} 1\\ 0 \end{pmatrix} = \frac{1}{\sqrt{2}}|0\rangle + \frac{1}{\sqrt{2}}|1\rangle \]
Second, we need a Grover transform to reinforce good decisions. To understand the effect of this transform, let's first define the states and actions in this quantum framework.
Consider the states as bases of a Hilbert space $\mathcal{H}_1$ and actions as the bases of another Hilbert space $\mathcal{H}_2$. At any step of time, the state could be $|S\rangle$ and the action to be taken could also be $|A\rangle$, which are superpositions:
\[ |S\rangle = \sum_n \alpha_n |s_n\rangle \mbox{ and } |A\rangle = \sum_m \beta_m |a_m\rangle \]
Without loss of generality, assume $s\in \{0,1\}^n$ and $a \in \{0, 1\}^m$:
\[ |S\rangle = \sum_{s\in \{0,1\}^n} C_s |s\rangle \mbox{ and } |A\rangle = \sum_{a\in \{0,1\}^m} D_a |a\rangle \]
Here $C_s, D_a$ are complex numbers and follow:
\[ \sum_{s\in \{0,1\}^n} |C_s|^2 = \sum_{a\in \{0,1\}^m} |D_a|^2 = 1 \]
Now, when an action $|A\rangle$ is measured, it will collapse into one of its bases $|a\rangle$ with probability $|D_a|^2$. The goal is to amplify this probability for a "good" action. This can be done using the temporal difference update rule:
\[ V_{\pi,t} (s) \leftarrow V_{\pi,t-1}(s) + \alpha_t\bigg\{ r_{\pi}(s) + \gamma V_{\pi,t-1}(s') - V_{\pi,t-1}(s) \bigg\} \]
Now let us define two transformations $U_ a$ and $U_ {a_ 0}$, for our purposes. Let $U_a$ be defined as:
\[ U_a = I-2|a\rangle \langle a| \]
The effect of this transformation on $|a\rangle$ is:
\[ U_a |a\rangle = (I-2|a\rangle \langle a|)|a\rangle = |a\rangle -2|a\rangle\langle a|a\rangle = - |a\rangle \]
And, for any vector orthogonal to $a$, say $|a'\rangle$, the effect is:
\[ U_a |a'\rangle = (I-2|a\rangle \langle a|)|a'\rangle = |a'\rangle -2|a\rangle\langle a|a'\rangle = |a'\rangle \]
Hence, for any vector with $|a\rangle$ as a basis, $U_a$ will create a mirror image of it along the plane orthogonal to $|a\rangle$.
Similarly, let us define the transformation $U_{a_0}$ in an opposite way, so it has the reverse effect, i.e. it inverts orthogonal vectors and doesn't affect vectors along $|a_0'\rangle$:
\[ U_{a_0} = 2|a_0\rangle \langle a_0| - I \]
The Grover transform would be:
\[ U_{g} = U_{a_0}U_a \]
For an initial vector $|a_s\rangle$, $U_g$ flips it along the $|a'\rangle$ axis first, and then along the $|a_0\rangle$:

This is a net rotation of $2\theta$ from the original position, where $\theta$ is the angle between $|a_0\rangle$ and $|a'\rangle$.
From Boyer et al. \cite{boyer1998tight}, if we apply $U_g$ for $t$ times, our final action is:
\[ U_g |a_0\rangle = \sin\{(2t+1)\theta\}\cdot|a\rangle+\cos\{(2t+1)\theta\}\cdot|a'\rangle \]
This means we can change the probability from $p_1 = |\langle a | a_0 \rangle|^2 = 2^{-n}$ to
\[ p_2 = |\langle a | U_g^{(t)} a_0 \rangle|^2 = \sin^2\{(2t+1)\theta\} \]
And this is a function we can maximize. Even in the case when $|a_s\rangle$ is not $|a_0\rangle$, we can control the final value of the action by this kind of update, and in most cases, maximize it.
There are two significant advantages of doing this type of learning with a quantum framework:
- There is no direct need for exploration. This is because exploration happens by the collapse of $|a_s\rangle$.
- We don't need to update all the states one by one since we can do the value/TD update in parallel because of quantum parallelism.