AutoKeras – a tough competitor for Google’s AutoML

Auto-Keras is an open source software library for automated machine learning (AutoML). It is developed by DATA Lab at Texas A&M University and community contributors. The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background. Auto-Keras provides functions to automatically search for architecture and hyperparameters of deep learning models.

We can create deep learning models in just 4 lines of code :

import autokeras as ak 
clf = ak.ImageClassifier() 
clf.fit(x_train, y_train) 
results = clf.predict(x_test)

Simple right ? The preview version has been released and awaiting for its final official release.

I’m damn sure that this will make the newbies in DeepLearning to make their hands wet by creating complex deep learning models at ease.

Handling Imbalanced Classes in the Dataset

What is Imbalanced Dataset ?

The dataset may contain uneven samples /instances , so that it makes the algorithm to predict with accuracy of 1.0 each time u run the model. For example, if u have simple dataset with 4 features and output(target) feature with 2 class, then total no. of instances/samples be 100. Now, out of 100, 80 instances belongs to category1 of the output(target) feature and only 20 instances contribute to the category2 of the output(target) feature. So, obviously, this makes bias in training and predicting the model. So, this dataset refers to Imbalanced dataset.

Let’s get our hands dirty by exploring the Imbalanced dataset and measures to handle the imbalanced classes.

First, for instance we can take a dataset with 7 features along with a target variable, So totally our dataset contains 8 features.

Initially, we read the dataset through “read_csv” method and print the head of the dataset as below:

file = pd.read_csv("../input/ecoli.csv")
print(file.head())

Output:

   Mcg   Gvh   Lip  Chg   Aac  Alm1  Alm2     Class
0  0.49  0.29  0.48  0.5  0.56  0.24  0.35  positive
1  0.07  0.40  0.48  0.5  0.54  0.35  0.44  positive
2  0.56  0.40  0.48  0.5  0.49  0.37  0.46  positive
3  0.59  0.49  0.48  0.5  0.52  0.45  0.36  positive
4  0.23  0.32  0.48  0.5  0.55  0.25  0.35  positive

Next, we need to find how many categories are there in the target variable “class”. So for that:

file['Class'].describe()

Output:

count          220
unique           2
top       positive
freq           143
Name: Class, dtype: object

As you can see, there are two unique categories in the “class” feature. Now we need to find the exact counts of the two categories, to do that:

f = file.groupby("Class")
f.count()

Output:
           Mcg Gvh Lip Chg Aac Alm1 Alm2 
Class
negative   77  77  77   77  77   77   77 
positive   143 143 143  143 143 143   143

Well, its pretty straight forward that our target feature in dataset has more number of “positive” classes than negative.

So, now we can visualize this in a histogram plot, so to do that, we need to convert the “object” type of Class to int:

file['Class'] = file['Class'].map({'positive': 1, 'negative': 0})
file['Class'].hist()

 

It’s easy to understand when you visualize your data like this, Isn’t it ? Well, yes our dataset has more number of positive classes(1’s) and less negative classes(0’s).

Before training our model, we need to find the most important features in our dataset, so that it helps to increase the accuracy of our model and to discard the useless features that does not contribute to the overall accuracy of the model. To do that, we have our own classifier “RandomForest“.

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
model = clf.fit(features_train, labels_train)
feature_labels = ['Mcg','Gvh','Lip','Chg','Aac','Alm1','Alm2']
for feature in zip(feature_labels,model.feature_importances_):
    print(feature)

Output:
('Mcg', 0.11586269275979075)
('Gvh', 0.012807906652840087)
('Lip', 0.0)
('Chg', 0.0)
('Aac', 0.0117212198350282)
('Alm1', 0.48041880476655613)
('Alm2', 0.3791893759857849)

As you can see, the feature ‘Chg‘ and ‘Lip‘ are contributing very low. So we can slice them and make the dataset with only limited features.

new_file = file[['Mcg','Gvh','Aac','Alm1','Alm2','Class']]
new_file.head()

Output:
      Mcg  Gvh   Aac  Alm1  Alm2  Class 
0     0.49 0.29  0.56 0.24  0.35  1 
1     0.07 0.40  0.54 0.35  0.44  1
2     0.56 0.40  0.49 0.37  0.46  1 
3     0.59 0.49  0.52 0.45  0.36  1  
4     0.23 0.32  0.55 0.25  0.35  1
       

Now, to make the things clear, we split our dataset into train and split and evaluate to witness how our model predicts biased results. Let’s dive in:

from sklearn.cross_validation import train_test_split 
train, test = train_test_split(new_file,test_size=0.2) 
features_train = train[['Mcg','Gvh','Aac','Alm1','Alm2']] 
features_test = test[['Mcg','Gvh','Aac','Alm1','Alm2']] 
labels_train = train.Class 
labels_test = test.Class 
print(train.shape) 
print(test.shape)

Output:
(176, 6) 
(44, 6)

We split our dataset into train (80%) to train our model and test (20%) to evaluate our model. So we train our model with 176 samples and test our model on 44 samples.

Now it’s time to train our model using “RandomForest” Classifier, we can train our model by:

clf = RandomForestClassifier()
model = clf.fit(features_train, labels_train)
print("Accuracy of Randomforest Classifier:",clf.score(features_test,labels_test))

Output:
Accuracy of Randomforest Classifier: 1.0

As explained previously, RandomForest classifier produces accuracy of 100% , which is biased due to the fact that there are more Positive classes than the Negative class ( 143 POSITIVE classes and 77 NEGATIVE classes. )So this creates the biased results.

So, to handle this, we have two approcahes:

  1. Over Sampling
  2. Under Sampling

Over Sampling:

It is nothing but Sampling the minority class and making it equivalent to the majority class.

Ex:

before sampling: Counter({1: 111, 0: 65})

after sampling: Counter({1: 111, 0: 111})

Note:The counts of 1’s and 0’s before and after sampling.

Under Sampling:

It is nothing but Sampling the majority class and making it equivalent to the minority class

Ex:

before sampling: Counter({1: 111, 0: 65})

after sampling: Counter({0: 65, 1: 65})

There are several algorithms for over sampling and under sampling. The one we use here is,

Over Sampling Algorithm:

  1. SMOTE – “Synthetic Minority Over Sampling Technique”. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models.

Under Sampling Algorithm:

  1. RandomUnderSampler – Random Undersampling aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out.
  2. NearMiss – selects the majority class samples whose average distances to three closest minority class samples are the smallest.

Hope with all these tiny descriptions, you might have understood the overall picture of the sampling algorithms, let’s implement them in our code and check the accuracy of our model:

from collections import Counter
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE(kind='borderline1').fit_sample(features_train, labels_train)
print("before sampling:",format(Counter(labels_train)))
print("after sampling:",format(Counter(y_resampled)))

Output:
before sampling: Counter({1: 115, 0: 61})
after sampling: Counter({1: 115, 0: 115})

Now the counts of Positive and Negative classes are equal, as we over-sampled the Negative class to match the counts of the positive class using the SMOTE sampling algorithm. Now, let’s train our model again with the re-sampled data and evaluate.

clf1 = RandomForestClassifier()()
clf1.fit(X_resampled, y_resampled)
print('Accuracy:',clf1.score(features_test,labels_test))

Output:
Accuracy: 0.9545454545454546

Well, the accuracy is reduced to 95% which is reasonable when compared to the biased accuracy of 100%.

What you wait for ? Hurrayyyy we have learnt what is imbalanced classes in a dataset and how to handle them using various sampling algorithms practically. 🙂

Data Mining

What is Data Mining?

Data mining is process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

Data Mining Process:

  1. Develop understanding of application, goals
  2. Create dataset for study (often from Data Warehouse)
  3. Data Cleaning and Preprocessing
  4. Data Reduction and projection
  5. Choose Data Mining task
  6. Choose Data Mining algorithms
  7. Use algorithms to perform task
  8. Interpret and iterate through 1-7 if necessary
  9. Deploy: integrate into operational systems.

As you can see, the core steps of data mining is from step 4 – step 8. Well, on discussing about data mining process, it leads to an important methodology of data mining called “CRISP-DM“.

CRISP-DM:

“Cross Industry Standard Process for Data Mining” – a 6-phase model of the entire data mining process, from start to finish, that is broadly applicable across industries for a wide array of data mining projects.

As there are 6 phases, I will give short description about each phases.

  1. Business Understanding – Identifying the project objectives
  2. Data Understanding – Collect and review data
  3. Data Preparation – Select and clean data
  4. Modelling – Manipulate data and draw conclusions
  5. Evaluation – Evaluate model
  6. Deployment – Apply conclusions to business

Introduction to Pandas

In this blog, you will get to know about the working of pandas library in python with real-time examples.

Pandas is one of the most powerful toolkit for data manipulation and analysis built over Numpy.

In Panda, namely there are two terminologies :

1.Series

2.DataFrame

Series:

Series is nothing but the 1-Dimensional array or (1-D array).

Example:
Import Pandas as pd
obj = Series([1,2,3,4,5])
print(obj)

Output:
0    1
1    2
2    3
3    4
4    5
dtype: int64

As you can see, the type of the “obj” variable is an array of “int64” values. It’s simple as that to create a series object.

Now we can do some basic arithmetic operations, like:

Adding two series objects:

x = pd.Series([2, 4, 6, 8, 10])
y = pd.Series([1, 3, 5, 7, 9])
add = x + y
print("Add:")
print(add)

Output:
Add:
0     3
1     7
2    11
3    15
4    19
dtype: int64

Same way as above, we can do other arithmetic operations like Subtraction, Multiplication, Division, Modulo Operations.

Another exciting feature of series is that, you can easily convert the Python dictionary(dict) into a series object as below:

data = {'India': 5000, 'America': 2500, 'Europe': 1000}
seriesobj = pd.Series(data)
print(seriesobj)

output:

India      5000
America    2500
Europe     1000
dtype: int64

We can also check if any values in the series object is “NULL” using the isnull() function:

seriesobj.isnull()

output:
India      False
America    False
Europe     False
dtype: bool

As you can see, the result of the above operation is of type “Boolean”, Series is super easy and flexible to use.

DataFrame:

DataFrame on the other hand is the 2-Dimensional array with rows and columns that represents a tabular, spread-sheet like data structures.

Creating a data frame is as simple as below:

import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

f = pd.DataFrame(exam_data,index=labels)

print(f)

Output:
attempts name       qualify score 
a 1     Anastasia     yes   12.5
b 3     Dima          no    9.0
c 2     Katherine     yes   16.5
d 3     James         no    NaN
e 2     Emily         no    9.0
f 3     Michael       yes   20.0
g 1     Matthew       yes   14.5
h 1     Laura         no    NaN
i 2     Kevin         no    8.0
j 1     Jonas         yes   19.0

We can play with dataframes using different functions and methods. For example in order to get the basic information about a dataframe, we can use a function called “info()“.

f.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
attempts    10 non-null int64
name        10 non-null object
qualify     10 non-null object
score       8 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes

Now as you will be familiar with creating a data frame, we can play with “Sub-setting / Slicing” the data frames.

Subsetting:

It is a powerful indexing feature using which we can “select and exclude variables / feature columns ” from the data frame. We can subset / slice a data frame using various means like

a. Sub-setting by specifying number of rows

First 3 rows of the dataframe

f[:3]

Output:
attempts    name       qualify score 
a 1         Anastasia   yes    12.5
b 3         Dima        no     9.0
c 2         Katherine   yes    16.5

b. Sub-setting using the column names

f_new = f[['name','score']]
f_new

Output:
  name       score
a Anastasia   12.5
b Dima        9.0
c Katherine   16.5
d James       NaN
e Emily       9.0
f Michael     20.0
g Matthew     14.5
h Laura       NaN
i Kevin       8.0
j Jonas       19.0

c. Sub-setting only the rows[1,3,5,6] of the specific columns from the data frame.

f.ix[[1,3,5,6],['name','score']]

Output:
  name     score
b Dima      9.0
d James     NaN
f Michael   20.0
g Matthew   14.5

d. Sub-setting based on some Logical Conditions

Selecting the rows with 'score' values between 15 and 20(both inclusive)
Example:
f[f['score'].between(15,20)]

Output:
attempts  name       qualify   score 
c 2       Katherine    yes      16.5
f 3       Michael      yes      20.0
j 1       Jonas        yes      19.0
Selecting the rows with 'attempts' < 2 and 'score' > 15
Example:

f[(f['score']>15) & (f['attempts']<2)]

Output:
  attempts    name     qualify    score
j   1         Jonas      yes       19.0

As you can see, the data frame is more powerful and flexible to work with structured data. We can also explore some more features of data frame like “adding and dropping” rows and columns in the data frame.

a. Adding a new row to the data frame:

f.loc['k'] = [1,"Suresh",'yes',15.5]
f

Output:
attempts  name   qualify   score 
a 1 Anastasia      yes      12.5
b 3 Dima           no       9.0
c 2 Katherine      yes      16.5
d 3 James          no       NaN
e 2 Emily          no       9.0
f 3 Michael        yes      20.0
g 1 Matthew        yes      14.5
h 1 Laura          no       NaN
i 2 Kevin          no       8.0
j 1 Jonas          yes      19.0
k 1 Suresh         yes      15.5

b. Dropping the newly added row in the data frame

f = f.drop('k')
f
Output:
attempts name   qualify   score 
a 1 Anastasia     yes      12.5
b 3 Dima          no       9.0
c 2 Katherine     yes      16.5
d 3 James         no       NaN
e 2 Emily         no       9.0
f 3 Michael       yes      20.0
g 1 Matthew       yes      14.5
h 1 Laura         no       NaN
i 2 Kevin         no       8.0
j 1 Jonas         yes      19.0 

c. Dropping the columns from the data frame.

f = f.drop('attempts',1)
f

Output:
    name      qualify   score 
a  Anastasia    yes      12.5
b  Dima         no       9.0
c  Katherine    yes      16.5
d  James        no       NaN
e  Emily        no       9.0
f  Michael      yes      20.0
g  Matthew      yes      14.5
h  Laura        no       NaN
i  Kevin        no       8.0
j  Jonas        yes      19.0 

d. Adding a new column to the data frame.

color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red']
f['color'] = color
f

Output:
     name        qualify     score 
a     Anastasia    yes        12.5
b     Dima         no         9.0
c     Katherine    yes        16.5
d     James        no         NaN
e     Emily        no         9.0
f     Michael      yes        20.0
g     Matthew      yes        14.5
h     Laura        no         NaN
i     Kevin        no         8.0
j     Jonas        yes        19.0

So with all these stuffs, I hope you might have gained something about the Pandas library and how it facilitates the data analysts for data manipulation and analysis. It’s just the beginning and lots more to come and you can make your hands dirty by looking at the official documentation of the Series(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) and Dataframe(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)