Leverage Turing Intelligence capabilities to integrate AI into your operations, enhance automation, and optimize cloud migration for scalable impact.
Advance foundation model research and improve LLM reasoning, coding, and multimodal capabilities with Turing AGI Advancement.
Access a global network of elite AI professionals through Turing Jobs—vetted experts ready to accelerate your AI initiatives.
Deep learning and neural networks have been the most sought-after and robust paradigms for machine learning in the last few years. One of these networks is convolutional neural network or CNN, which is mainly used in image recognition or classification and computer vision. Another is capsule network or CapsNet, which adds structures to CNNs.
This article will look at CapsNet, its relation with neural networks, and how it stacks up against CNNs.
Before the creation of CNNs, features were extracted manually from images which was both time-consuming and inefficient. CNN changed this by leveraging complex principles of linear algebra and matrix multiplication to automate and scale the laborious task of feature and pattern extraction from images.
That being said, CNNs require a lot of hardware investment (especially GPUs) to function efficiently. Yet, it’s worth it since the technology has revolutionized several fields like facial recognition, autonomous cars, object detection, natural language processing, cancer detection, etc.
However, like any other technology, regular and rigorous use of CNN has led researchers to find its Achilles’ heel.
To understand the problem with CNN, we need to know what goes on under the hood.
A CNN extracts features and patterns, which in turn helps an algorithm recognize the object in an image. Computers perceive the image as a matrix with values consisting of RGB intensity at that particular point. What this means is that images are essentially just matrices.
When training a CNN, different parts of the convolution are triggered by different features in the image. The activation function along with a convolving neural net is used to create feature maps for all the different characteristics of the image.
These feature maps are then downsampled by using a pooling operation called max pooling. This extracts the maximum values from patches of the feature maps to somewhat accurately extract the high-level features from the image.
In the above example, we can see that there’s a loss of 1/4th of the information. Geoffrey E Hinton, one of the creators behind deep learning and artificial neural nets, also pointed out this flaw when asked about his most controversial opinion in machine learning:
“The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.”
Let us consider the other weakness of CNN with an example. A typical human face consists of two eyes, one nose, and one mouth.
Different parts of the CNN will be activated for different features of the face, resulting in an accurate guess by the model that this indeed is a face. But see what happens if we feed this image into the CNN model:
As is apparent, this is not a real face. But from a machine’s perspective, it is, since it has all the necessary features of a face (i.e. two eyes, one nose, and one mouth).
CNN is created to extract only the features from an image and not the relative positioning or the spatial information, thus, making it hard for it to differentiate between real and inhuman faces.
To tackle this problem, Hinton, with his colleagues Sara Sabour and Nicholas Frosst, came up with the concept of capsule networks (CapsNet) and their training in their 2017 paper Dynamic Routing Between Capsules.
An artificial neural network is inspired by the individual neurons of the human brain. Hinton and Sabour, too, took their inspiration from the human brain but not from the individual neurons. Rather, they did so from the different regions or modules of the brain which they considered as ‘capsules’.
Incorporating this concept of capsule network and dynamic routing algorithms, they could successfully estimate and extract spatial features like size, orientation, relative position, etc., thus, overcoming the loss of information by pooling operations.
The key difference between CapsNet and traditional CNN is that capsules leverage vectors for more detailed representation, rather than scalars. The differences will be apparent if we look at how both paradigms function.
CNNs:
CapsNet has a similar process, but with vectors:
Let’s take a deeper look at capsule networks to understand how the technology functions.
1. Multiplying input vectors by weight matrices
Here, the initial inputs are in vector form, therefore, carrying more information than their scalar CNN counterparts.
First, the vectors are multiplied by weighted matrices. These matrices contain spatial information that is generally not considered in CNNs.
The resultant product gives an estimate of the high-level feature of the image. If all the estimations by the low-level features point to the same high-level feature - for example, if the nose, eyes, and mouth in an image predict somewhat similar outlines for the face - the image will be successfully classified as a face.
2. Multiplying results by scalar weights
Capsule networks use dynamic routing instead of backpropagation to determine and modify the weights in the network.
Below is a summary of the algorithm:
These weights determine which higher-level capsule shall receive the current capsule’s output.
3. Computing the sum of weighted vectors
All the outputs from the previous step are summed up (nothing fancy here).
4. Applying non-linearity ‘Squash’ function
Non-linearity is obtained by an activation function called ‘Squash’. The function scales the vector with a maximum value of 1 and a minimum value of 0 while preserving its direction.
CapsNet has clear advantages over traditional neural nets like preserving spatial information, minimizing loss of features due to pooling, needing less data to train, and faster training time.
It has achieved never-before-seen accurate results on simple datasets like the MNIST. However, when more complex datasets are used - for example, ImageNet and CIFAR-10 - capsules can’t handle that amount of dense data, and the model underperforms. Just like any other new technology, continued research and development are needed to make it more robust, versatile, and computationally efficient.