Geoffrey Hinton outlined a few years ago the problems with convolutional nets, specifically

• Subsampling or pooling loses the spatial relationships between components of the image.
• CNNs cannot apply their understanding of one viewpoint to another viewpoint

## Capsule Theory

A capsule is a group of neurons whose activity vector $\textbf{A}$ represents the parameters of a specific component of the input image.

The magnitude of this activity vector represents the probability of existence of the component in the image. The direction of this vector represents the parameters of the component like orientation, texture etc.

Therefore, components that are the same but slightly modified, like rotated, enlarged etc. should have different $\textbf{A}$, but the same $|\textbf{A}|$.

## Dynamic Routing

Sabour et al. (2017) is the first implementation of this theory, and it uses dynamic routing to construct the relationship between higher and lower level capsules.

First, some feature maps are obtained from an input image using 2D convolution. Then, there is a PrimaryCaps layer which converts these feature maps to 8 dimensional vectors.

As capsule theory requires, each vector should have a maximum magnitude of 1, since it represents a probability. So the paper uses a squashing function

$$f(x)=\frac{||x||^2}{1+||x||^2} \frac{x}{||x||}$$

The effect of this function on a vector is that it converts vectors of large magnitude to reach 1, since $1+||x||^2 \approx ||x||^2$, and then $f(x) \rightarrow \hat{x}$ (unit vector in the direction of $x$). For small magnitude vectors, the output stays small.

In the final DigitCaps layer, the squashed output of previous layer $u_ i$ is converted from an $8$ dimensional vector to a $16$ dimensional vector by multiplying with a transformation matrix $W_{ij}$ of dimensions $(16,8)$.

This gives us $16$D vectors, which we can use to create a parse tree like structure with $16$ dimensional vectors on the higher layer. Then we can do dynamic routing:

The loss function is a modified margin loss, and a decoder network is also suggested to provide a pixel reconstruction loss (which acts as a regularizer).

## Experiments

We got a $99.7%$ accuracy in $50$ epochs, each of which took $10$ minutes each, with $3$ routing iterations. This is very close to what the paper claims, which is $99.75%$ accuracy.

Then, we took the trained model and modified the test images by simple rotation to see how good or bad the model would perform on it. The accuracy for vertically flipped or $90^\circ$ rotation wasn't as good ($45.96%$ and $16.14%$), so that meant rotation wasn't one of the things this capsule implementation was good with.

Next, we tried this same architecture on FashionMNIST, which is another dataset similar to MNIST. We got about $92-93%$ accuracy with some effort.

Here is a reconstruction using the decoder network on FashionMNIST:

## References

1. Sabour, Sara, Nicholas Frosst, and Geoffrey E. Hinton. “Dynamic routing between capsules.” Advances in Neural Information Processing Systems. 2017.
2. Geoffrey Hinton talk “What is wrong with convolutional neural nets ?", Youtube
3. Capsule theory slides
4. Xiao, Han, Kashif Rasul, and Roland Vollgraf. “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.” arXiv preprint arXiv:1708.07747 (2017).