Deep learning and neural networks have been the most sought-after and robust paradigms for machine learning in the last few years. One of these networks is convolutional neural network or CNN, which is mainly used in image recognition or classification and computer vision. Another is capsule network or CapsNet, which adds structures to CNNs.

This article will look at CapsNet, its relation with neural networks, and how it stacks up against CNNs.

1. CNN: Automating feature and pattern extraction
2. CNN’s weakness: Max pooling and loss of spatial information
3. CapsNet: Overcoming CNN’s flaws
4. The workings of capsule networks
5. Neural net vs CapsNet: Why haven’t we switched to the latter yet?

CNN: Automating feature and pattern extraction

Before the creation of CNNs, features were extracted manually from images which was both time-consuming and inefficient. CNN changed this by leveraging complex principles of linear algebra and matrix multiplication to automate and scale the laborious task of feature and pattern extraction from images.

That being said, CNNs require a lot of hardware investment (especially GPUs) to function efficiently. Yet, it’s worth it since the technology has revolutionized several fields like facial recognition, autonomous cars, object detection, natural language processing, cancer detection, etc.

However, like any other technology, regular and rigorous use of CNN has led researchers to find its Achilles’ heel.

CNN’s weakness: Max pooling and loss of spatial information

To understand the problem with CNN, we need to know what goes on under the hood.

A CNN extracts features and patterns, which in turn helps an algorithm recognize the object in an image. Computers perceive the image as a matrix with values consisting of RGB intensity at that particular point. What this means is that images are essentially just matrices.

image4_11zon (1).webp

Image source

image4_11zon (1).webp

Image source

When training a CNN, different parts of the convolution are triggered by different features in the image. The activation function along with a convolving neural net is used to create feature maps for all the different characteristics of the image.

These feature maps are then downsampled by using a pooling operation called max pooling. This extracts the maximum values from patches of the feature maps to somewhat accurately extract the high-level features from the image.

Image source

In the above example, we can see that there’s a loss of 1/4th of the information. Geoffrey E Hinton, one of the creators behind deep learning and artificial neural nets, also pointed out this flaw when asked about his most controversial opinion in machine learning:

“The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.”

Let us consider the other weakness of CNN with an example. A typical human face consists of two eyes, one nose, and one mouth.

Image source

Different parts of the CNN will be activated for different features of the face, resulting in an accurate guess by the model that this indeed is a face. But see what happens if we feed this image into the CNN model:

Image source

As is apparent, this is not a real face. But from a machine’s perspective, it is, since it has all the necessary features of a face (i.e. two eyes, one nose, and one mouth).

CNN is created to extract only the features from an image and not the relative positioning or the spatial information, thus, making it hard for it to differentiate between real and inhuman faces.

To tackle this problem, Hinton, with his colleagues Sara Sabour and Nicholas Frosst, came up with the concept of capsule networks (CapsNet) and their training in their 2017 paper Dynamic Routing Between Capsules.

CapsNet: Overcoming CNN’s flaws

An artificial neural network is inspired by the individual neurons of the human brain. Hinton and Sabour, too, took their inspiration from the human brain but not from the individual neurons. Rather, they did so from the different regions or modules of the brain which they considered as ‘capsules’.

Incorporating this concept of capsule network and dynamic routing algorithms, they could successfully estimate and extract spatial features like size, orientation, relative position, etc., thus, overcoming the loss of information by pooling operations.

The key difference between CapsNet and traditional CNN is that capsules leverage vectors for more detailed representation, rather than scalars. The differences will be apparent if we look at how both paradigms function.

CNNs:

Multiplying the input scalars to their respective weights.
Computing the sum of weighted scalars.
Activation function (eg. ReLu, Sigmoid, etc.) applied to obtain the output.

CapsNet has a similar process, but with vectors:

Input vectors multiplied by weight matrices.
Results multiplied by scalar weights.
Computing the sum of weighted vectors.
Non-linearity ‘Squash’ function applied.

Image source

Let’s take a deeper look at capsule networks to understand how the technology functions.

The workings of capsule networks

1. Multiplying input vectors by weight matrices

Here, the initial inputs are in vector form, therefore, carrying more information than their scalar CNN counterparts.

First, the vectors are multiplied by weighted matrices. These matrices contain spatial information that is generally not considered in CNNs.

The resultant product gives an estimate of the high-level feature of the image. If all the estimations by the low-level features point to the same high-level feature - for example, if the nose, eyes, and mouth in an image predict somewhat similar outlines for the face - the image will be successfully classified as a face.

2. Multiplying results by scalar weights

Capsule networks use dynamic routing instead of backpropagation to determine and modify the weights in the network.

Below is a summary of the algorithm:

Image source

These weights determine which higher-level capsule shall receive the current capsule’s output.

3. Computing the sum of weighted vectors

All the outputs from the previous step are summed up (nothing fancy here).

4. Applying non-linearity ‘Squash’ function

Non-linearity is obtained by an activation function called ‘Squash’. The function scales the vector with a maximum value of 1 and a minimum value of 0 while preserving its direction.

Image source

Neural net vs CapsNet: Why haven’t we switched to the latter yet?

CapsNet has clear advantages over traditional neural nets like preserving spatial information, minimizing loss of features due to pooling, needing less data to train, and faster training time.

It has achieved never-before-seen accurate results on simple datasets like the MNIST. However, when more complex datasets are used - for example, ImageNet and CIFAR-10 - capsules can’t handle that amount of dense data, and the model underperforms. Just like any other new technology, continued research and development are needed to make it more robust, versatile, and computationally efficient.

What is Exploratory Data Analysis in Python and How Does It Work?

Before performing data analysis or using machine learning algorithms to run data, you should always make sure that the...

Comprehensive Guide to LSTM & RNNs

In the current scenario of data and science, neural networks are emerging with the ability to rapidly perform tasks. This article will take a look at LSTM...

5 Powerful Text Summarization Techniques in Python

Text summarization is a natural language processing (NLP) task that allows users to summarize large amounts of text for

A Step-by-Step Guide for Binary Image Classification in TensorFlow: Detection of Pneumothorax From Chest X-ray

By coding in Python, you will be able to detect pneumothorax (lung collapse due to air or gas in the cavity between the lungs and the chest wall) from...