Leverage Turing Intelligence capabilities to integrate AI into your operations, enhance automation, and optimize cloud migration for scalable impact.
Advance foundation model research and improve LLM reasoning, coding, and multimodal capabilities with Turing AGI Advancement.
Access a global network of elite AI professionals through Turing Jobs—vetted experts ready to accelerate your AI initiatives.
Classification problems are a major machine learning issue. When we try to label a dataset into a category based on an input dataset, there are numerous challenges we have to deal with in the data. An imbalanced dataset is one such challenge. It contains two classes, where one class is excessively high compared to the other. There are different techniques to deal with it, but SMOTE - Synthetic Minority Oversampling Technique - is one of the more popular.
This article will explore imbalanced data and how it can be handled using the SMOTE algorithm.
An imbalanced dataset contains two different observations, where one observation is in the majority class and the other is in the minority class.
Let’s understand this with the help of an example:
Suppose we want to build a model that will help us identify whether a given patient has a tumor or not. There are 1000 patients, with 900 being non-cancer patients and the other 100 being cancer patients. Since the non-cancer patients are high in number, they will belong to the majority class while the rest will belong to the minority class.
As the purpose of our model is to predict whether someone has cancer or not, the focus is primarily on the minority class. In this case, however, the majority is nine times bigger than the minority class. This is an imbalanced dataset because the model will deliver high accuracy in predicting non-cancer patients and will be more inclined to the majority class - even though this is not the main objective of building our system.
The algorithm in our example tends to incline towards the majority class, though our job is to build a model that helps predict cancer patients. Even if the majority class is just twice or thrice the minority class, we still consider it an imbalanced dataset.
If we assume that for all 1000 records the model will predict that all are non-cancer patients, then there is a fallacy in the model. It will have 90% accuracy since there are 900 non-cancer patients. Thus, despite the high accuracy of the model, it will not give the best results.
An imbalanced dataset is a common classification problem. A well-known example is when we have to identify whether email is spam or not.
Here are more examples of imbalanced datasets:
Before learning about SMOTE’s functionality, it’s important to understand two important terms: undersampling and oversampling.
The purpose of undersampling is to reduce the majority class. We perform it by removing some observations of the said class. There are two ways of doing so: in the first method, we randomly remove some records of the majority class, which is known as random undersampling. In the second method, we use statistical methods to remove the majority class, known as informed undersampling.
These undersampling methods also use data clearing techniques to further refine the majority of classes. Undersampling methods are generally not preferred because there is a chance of losing valuable information. This also leads to bias since we are removing data to ensure that the proportion of the two classes remains the same.
Oversampling is the opposite of undersampling. The objective is to increase the samples of the minority class so that the observations of both major and minor classes become equal. Unlike undersampling, where we remove datasets, we add new data to the dataset in oversampling. It can be achieved in two ways: random oversampling and synthetic oversampling.
In case of random oversampling, we replicate the existing minority class and add it to our dataset to increase the minority class. Meanwhile, synthetic oversampling technique is the process of generating artificial samples for the minority class. New samples are created in such a way that they add relevant information to the minority class while avoiding misclassification. The only downfall of oversampling is that it can lead to overfitting due to duplication of the same information.
An oversampling method, SMOTE creates new, synthetic observations from present samples of the minority class. Not only does it duplicate the existing data, it also creates new data that contains values that are close to the minority class with the help of data augmentation. These new synthetic training records are made randomly by selecting one or more K-nearest neighbors for each of the minority classes. After completing oversampling, the problem of an imbalanced dataset is resolved and we are ready to test different classification models.
Below are the steps to implement the SMOTE algorithm:
1. The first step is to import all the necessary libraries. We will also install the imbalanced learned package and Pandas and NumPy - two important libraries.
# install the libraries pip install imblearn import numpy as np import pandas as pd
2. The next task is to load the dataset. read_csv is used to load a CSV file as a Pandas dataframe.
# read csv data- salary.csv df = pd.read_csv('salary.csv')
3. After the data is successfully loaded, it’s time to analyze class distribution.
Emp_inf = df['Class'].value_counts() print(emp_inf) print(“\n\tClass 0: {:0.2f}%”.format(20 * salary_status[0] / (emp_inf[0] + emp_inf[1]))) print(“\n\tClass 1: {:0.2f}%”.format(20 * salary_status[1] / (emp_inf[0]+emp_inf[1]))) Here, Class 0: 98.83% Class 1: 0.18%
4. Once the analysis is done, we will split the data into train and test sets.
X = df.drop(columns = ['Emp_Cal','Class']) y = df['Class'] X_diver, x_runner, y_diver, y_runner = train_test_split(X,y,random_state = 100,test_size = 0.5,stratify = y)
5. Finally, we will evaluate the results without SMOTE as well as with SMOTE.
First_model = LogisticRegression() First_model.fit(x_diver,y_diver) Deter = First_model.predict(x_runner) print(“Prediction_Score”,accuracy_score(y_runner,deter)) print(classification_report(y_runner,deter)) sns.heatmap(confusion_matrix(y_runner,deter), annot = True,fmt ='.2g')
Now, we will use the SMOTE module from imblearn.
print("\n\t Pre Oversampling, counts of label '1': {}".format(sum(y_runner == 1))) print("\n\t Pre Oversampling, counts of label '0': {}".format(sum(y_runner == 0))) # import SMOTE for sampling from imblearn.over_sampling import SMOTE sm = SMOTE(sampling_strategy = 0.3, k_neighbors = 5, random_state = 100) X_diver_res, y_diver_res = sm.fit_sample(x_diver, y_diver.ravel()) # Print the oversampling results print(“\n\t Post OverSampling, the shape of diver_X: {}”.format(X_diver_res.shape)) print(“\n\t Post OverSampling, the shape of diver_y: {}”.format(y_diver_res.shape)) print("Post OverSampling, label count '1': {}".format(sum(y_diver_res == 1))) print("Post OverSampling, label count '0': {}".format(sum(y_diver_res == 0)))
Results of SMOTE:
Inr = LogisticRegression() lnr.fit(X_diver_res, y_diver_res.ravel()) predictions = lr.predict(x_runner) # print the outcomes of SMOTE print(“Predicting_Score”,accuracy_score(y_runner,predictions)) print(classification_report(y_runner, predictions)) sns.heatmap(confusion_matrix(y_runner,predictions), annot=True, fmt='.2g')
As discussed, imbalanced datasets contain two classes where one class (known as the majority class) has an excessively higher number of observations than the other class (minority class). Due to this, the model doesn’t yield expected results. Thus, imbalanced data needs to be dealt with to ensure that the machine learning model is effective. With the help of SMOTE, we can increase the number of observations of the minority class in a balanced way, helping the model to become more effective. Note that the technique also has a disadvantage as valuable information can be lost owing to duplication.