Handling Imbalanced Classes in the Dataset

What is Imbalanced Dataset ?

The dataset may contain uneven samples /instances , so that it makes the algorithm to predict with accuracy of 1.0 each time u run the model. For example, if u have simple dataset with 4 features and output(target) feature with 2 class, then total no. of instances/samples be 100. Now, out of 100, 80 instances belongs to category1 of the output(target) feature and only 20 instances contribute to the category2 of the output(target) feature. So, obviously, this makes bias in training and predicting the model. So, this dataset refers to Imbalanced dataset.

Let’s get our hands dirty by exploring the Imbalanced dataset and measures to handle the imbalanced classes.

First, for instance we can take a dataset with 7 features along with a target variable, So totally our dataset contains 8 features.

Initially, we read the dataset through “read_csv” method and print the head of the dataset as below:

file = pd.read_csv("../input/ecoli.csv")
print(file.head())

Output:

   Mcg   Gvh   Lip  Chg   Aac  Alm1  Alm2     Class
0  0.49  0.29  0.48  0.5  0.56  0.24  0.35  positive
1  0.07  0.40  0.48  0.5  0.54  0.35  0.44  positive
2  0.56  0.40  0.48  0.5  0.49  0.37  0.46  positive
3  0.59  0.49  0.48  0.5  0.52  0.45  0.36  positive
4  0.23  0.32  0.48  0.5  0.55  0.25  0.35  positive

Next, we need to find how many categories are there in the target variable “class”. So for that:

file['Class'].describe()

Output:

count          220
unique           2
top       positive
freq           143
Name: Class, dtype: object

As you can see, there are two unique categories in the “class” feature. Now we need to find the exact counts of the two categories, to do that:

f = file.groupby("Class")
f.count()

Output:
           Mcg Gvh Lip Chg Aac Alm1 Alm2 
Class
negative   77  77  77   77  77   77   77 
positive   143 143 143  143 143 143   143

Well, its pretty straight forward that our target feature in dataset has more number of “positive” classes than negative.

So, now we can visualize this in a histogram plot, so to do that, we need to convert the “object” type of Class to int:

file['Class'] = file['Class'].map({'positive': 1, 'negative': 0})
file['Class'].hist()

 

It’s easy to understand when you visualize your data like this, Isn’t it ? Well, yes our dataset has more number of positive classes(1’s) and less negative classes(0’s).

Before training our model, we need to find the most important features in our dataset, so that it helps to increase the accuracy of our model and to discard the useless features that does not contribute to the overall accuracy of the model. To do that, we have our own classifier “RandomForest“.

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
model = clf.fit(features_train, labels_train)
feature_labels = ['Mcg','Gvh','Lip','Chg','Aac','Alm1','Alm2']
for feature in zip(feature_labels,model.feature_importances_):
    print(feature)

Output:
('Mcg', 0.11586269275979075)
('Gvh', 0.012807906652840087)
('Lip', 0.0)
('Chg', 0.0)
('Aac', 0.0117212198350282)
('Alm1', 0.48041880476655613)
('Alm2', 0.3791893759857849)

As you can see, the feature ‘Chg‘ and ‘Lip‘ are contributing very low. So we can slice them and make the dataset with only limited features.

new_file = file[['Mcg','Gvh','Aac','Alm1','Alm2','Class']]
new_file.head()

Output:
      Mcg  Gvh   Aac  Alm1  Alm2  Class 
0     0.49 0.29  0.56 0.24  0.35  1 
1     0.07 0.40  0.54 0.35  0.44  1
2     0.56 0.40  0.49 0.37  0.46  1 
3     0.59 0.49  0.52 0.45  0.36  1  
4     0.23 0.32  0.55 0.25  0.35  1
       

Now, to make the things clear, we split our dataset into train and split and evaluate to witness how our model predicts biased results. Let’s dive in:

from sklearn.cross_validation import train_test_split 
train, test = train_test_split(new_file,test_size=0.2) 
features_train = train[['Mcg','Gvh','Aac','Alm1','Alm2']] 
features_test = test[['Mcg','Gvh','Aac','Alm1','Alm2']] 
labels_train = train.Class 
labels_test = test.Class 
print(train.shape) 
print(test.shape)

Output:
(176, 6) 
(44, 6)

We split our dataset into train (80%) to train our model and test (20%) to evaluate our model. So we train our model with 176 samples and test our model on 44 samples.

Now it’s time to train our model using “RandomForest” Classifier, we can train our model by:

clf = RandomForestClassifier()
model = clf.fit(features_train, labels_train)
print("Accuracy of Randomforest Classifier:",clf.score(features_test,labels_test))

Output:
Accuracy of Randomforest Classifier: 1.0

As explained previously, RandomForest classifier produces accuracy of 100% , which is biased due to the fact that there are more Positive classes than the Negative class ( 143 POSITIVE classes and 77 NEGATIVE classes. )So this creates the biased results.

So, to handle this, we have two approcahes:

  1. Over Sampling
  2. Under Sampling

Over Sampling:

It is nothing but Sampling the minority class and making it equivalent to the majority class.

Ex:

before sampling: Counter({1: 111, 0: 65})

after sampling: Counter({1: 111, 0: 111})

Note:The counts of 1’s and 0’s before and after sampling.

Under Sampling:

It is nothing but Sampling the majority class and making it equivalent to the minority class

Ex:

before sampling: Counter({1: 111, 0: 65})

after sampling: Counter({0: 65, 1: 65})

There are several algorithms for over sampling and under sampling. The one we use here is,

Over Sampling Algorithm:

  1. SMOTE – “Synthetic Minority Over Sampling Technique”. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models.

Under Sampling Algorithm:

  1. RandomUnderSampler – Random Undersampling aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out.
  2. NearMiss – selects the majority class samples whose average distances to three closest minority class samples are the smallest.

Hope with all these tiny descriptions, you might have understood the overall picture of the sampling algorithms, let’s implement them in our code and check the accuracy of our model:

from collections import Counter
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE(kind='borderline1').fit_sample(features_train, labels_train)
print("before sampling:",format(Counter(labels_train)))
print("after sampling:",format(Counter(y_resampled)))

Output:
before sampling: Counter({1: 115, 0: 61})
after sampling: Counter({1: 115, 0: 115})

Now the counts of Positive and Negative classes are equal, as we over-sampled the Negative class to match the counts of the positive class using the SMOTE sampling algorithm. Now, let’s train our model again with the re-sampled data and evaluate.

clf1 = RandomForestClassifier()()
clf1.fit(X_resampled, y_resampled)
print('Accuracy:',clf1.score(features_test,labels_test))

Output:
Accuracy: 0.9545454545454546

Well, the accuracy is reduced to 95% which is reasonable when compared to the biased accuracy of 100%.

What you wait for ? Hurrayyyy we have learnt what is imbalanced classes in a dataset and how to handle them using various sampling algorithms practically. 🙂