Getting Started with Docker !

In this new blog, lets get our hands immersed in the Docker containers and what is the difference between the Docker and Virtual Machines and Why Docker is more powerful than Virtual machines. The internal working of the docker is explained in simple terms for easy understanding.

So, Lets get started !

What is Docker ?

Lets break it down into simple words, Docker is a platform for developing and deploying applications by isolating them.

What is Virtual machine ?

Virtual machines is like a emulator, which allows to run an operating system in an app window on our desktop that behaves like a full, separate computer allowing the developers to develop and deploy applications.

Difference between Virtual machines and Docker ?

As you can see, virtual machines isolate the entire system whereas docker container isolates the application.

Virtual machines Architecture

From the above architecture,

  1. Infrastructure – refers to Laptops, Systems
  2. Host OS – refers to Operating system (Linux, Windows, Mac OS)
  3. Hypervisor – refers to managing director who manages and allocates resources and provides access to the applications
  4. Guest OS – refers to the guest operating system which the developer wish to run (various varieties of Linux)
  5. Bins/Libs – refers to binaries and libraries associated with the guest operating system which occupies more space
  6. App1, App2, App3 – refers to the application running on different guest operating system

Docker Architecture

1 Infrastructure, Host OS,Bins/Libs and Apps are same as the Virtual machines.

2 Docker Daemon – similar to hypervisor which provides interface and isolates  the applications from the host operating system.

With these, you might have gained the difference between Virtual machines and Docker containers.

Lets make it more clear by diving into simple “hello-world” example :

Running Docker “hello-world”:

Docker Desktop should be installed based on the operating system you are using.

The simple working of the docker is explained in the above diagram.

After installing, try running the below command from your favourite command prompt :

$ docker run hello-world

“Hello-world” is the official docker image which is available in the Docker Hub. It is similar to running “hello-world” program.

When you run this command, the docker searches for the “docker-image” locally and the image wont be available in your local system, so it pulls the images from the docker hub and streams the output in the terminal as follows:

Hello from Docker!

By this, you come to know what is docker and its internal working. In the next tutorial, we can explore more about the terminologies in the docker world in detail !

Cheers 🙂

Linear Discriminant Analysis (LDA)

Its Linear Discriminant Analysis (LDA) to start off with this new year.  To make the points clear and easy to understand, the definitions used for explaining the concepts is kept short.

To understand the Linear Discriminant Analysis, first we need to understand the term called “Dimensionality Reduction“.

What is Dimensionality Reduction ?

The technique to reduce dimensions by removing the redundant and dependent features by transforming the features from higher dimensional space to lower dimensional space.

There are 2 major types of Dimensionality Reduction :

  1. Supervised Dimensionality Reduction Technique
  2. Unsupervised Dimensionality Reduction Technique

Supervised Dimensionality Reduction Technique

Supervised technique is one, where the labels are taken into consideration for dimensionality reduction.

Examples – Neural Networks (NN) , Mixture Discriminant Analysis (MDA), Linear Discriminant Analysis (LDA)

Unsupervised Dimensionality Reduction Technique

Unsupervised technique is one, where there is no need for class labels for dimensionality reduction.

Examples – Principal Component Analysis (PCA), Independent Component Analysis (ICA), Non-negative matrix factorisation (NMF).

What is Linear Discriminant Analysis ?

LDA – Linear Discriminant Analysis, a dimensionality reduction technique, which transforms the features into lower dimensional space that maximises the ratio of between-class variance to within-class variance, thereby maximum class separability.

There are 2 types of LDA:

  • Class-dependent LDA
  • Class-Independent LDA

Class-Dependent LDA

It is one of the technique where, one separate low dimensional space is created for each class to project its data.

Class-Independent LDA

The method where each class is considered as separate class and one low dimensional space is created for all the classes to project it data points.

Steps to calculate LDA :

  1. Calculate the separability between the different classes – Between-class variance
  2. Calculate the distance between mean & samples of each class – Within-class variance
  3. Construct the low dimensional space which maximises the between-class variance and minimises the within-class variance.

By default, the Linear Discriminant Analysis (LDA) uses Class-Independent method for dimensionally reduction the features.

Problems with Class-dependent LDA method

  • It requires more CPU time, as separate low dimension class is created for each class.
  • It leads to Small Sample Size problem

Problems with LDA

  • Linear Discriminant Analysis (LDA) fails to find the low dimensional space, if number of dimensions are higher than the number of samples in the data (Singularity).


  • Removing the null space within the class matrix
  • Using intermediate subspace – Principal Component Analysis (PCA)
  • Regularisation


Extracting Feature Vectors using VGG16

In this new exciting blog, I’m gonna help you to extract the features vectors from images and use that features vectors to build a Random Forest (RF) model.

So, we are going to use the VGG16 model ( you are free to use any pre-trained models based on your problem statement ) to extract the feature vectors of images.

The extraction part begins with specifying the directory of images and using VGG16 model to predict the feature vectors and appending the feature vectors in to the list.

img_path = r'E:\Thesis\Try1\green'
feature_green = []

for each in os.listdir(img_path):
    path = os.path.join(img_path,each)
    img = image.load_img(path, target_size=(224, 224))
    img_data = image.img_to_array(img)
    img_data = np.expand_dims(img_data, axis=0)
    img_data = preprocess_input(img_data)
    feature = model.predict(img_data)

Since we have totally 3 classes of images we need to repeat this for all the three classes and write them into a dataframe along with their labels.

After that, it’s as usual to train and test the data and build the random forest model and evaluate its accuracy.

Cheers 🙂

Transfer Learning for Image Classification

In this blog post, we are going to explore how Transfer Learning technique helps us to overcome the computation challenges for building a neural network from scratch and training it with images.

Generally, its computationally hard and expensive to train images, which requires GPU support.

But Transfer Learning is a technique which makes this training computation simple, super cool and handy.

Oxford’s Visual Geometry Group developed and trained the so called (VGG16) model with Imagenet database, which contains hundreds and thousands of images.

Lets dive into transfer learning,

Lets begin with importing the VGG16 model and keras layers for building the fully connected layers.

from keras.layers import Dense,Conv2D,MaxPooling2D,Dropout,Flatten,Input
from keras.models import Sequential, Model
from keras.preprocessing.image import ImageDataGenerator
from keras.applications.vgg16 import VGG16

In this classification, we are going to classify three classes of traffic light signals – Red, Green and Yellow. The shape of the input images is expected to be 224,224 with RGB color channel at the last.

image_input = Input(shape=(224, 224, 3))
model = VGG16(input_tensor=image_input, include_top=False, weights= 'imagenet')

As you can see, we are not including the top layers (fully connected layers) and we are using the imagenet weights for our VGG16 model, which reduces the training computation.

Now, we need to build our own new network layers to append it on top of the base VGG16 model for our classification problem.

top_model = Sequential()
top_model.add(Conv2D(32, kernel_size=(3, 3),activation='relu',input_shape=model.output_shape[1:],padding='same'))
top_model.add(MaxPooling2D((2,2), padding='same'))
top_model.add(Conv2D(64, kernel_size= (3,3),activation='relu',padding='same'))
top_model.add(MaxPooling2D((2,2), padding='same'))

Since, the base VGG16 model is already trained, it is good at extracting the patterns, edges and textures from the images, so we don’t need to train the base VGG16 model, so we are freezing the base model and training only the newly appended fully connected layers.

or layer in top_model.layers[:-8]:
    layer.trainable = False


So, after building the model, its time to fit our training and test data to evaluate our model’s accuracy. We are using “Accuracy” as our evaluation metric and “Adam” , “categorical_crossentropy” as optimisers and loss metrics respectively.

top_model.compile(optimizer = 'Adam', loss='categorical_crossentropy', metrics=['accuracy'])
from keras.callbacks import EarlyStopping

earlystop = EarlyStopping(monitor='val_acc', min_delta=0.0001, patience=3,
                          verbose=1, mode='auto')
callbacks_list = [earlystop]


data_generator = ImageDataGenerator(rescale=1./255)

data_generator_with_aug = ImageDataGenerator(horizontal_flip=True,

train_generator = data_generator_with_aug.flow_from_directory(r'E:\Thesis\Try1\Train',

validation_generator = data_generator.flow_from_directory(r'E:\Thesis\Try1\Test',
history = top_model.fit_generator(train_generator,
                   validation_data = validation_generator,
                   validation_steps = 1,

Early Stopping” – is a callback function, which is used to reduce overfitting by monitoring the “validation loss“. If the validation loss doesn’t reduces for 2 or 3 iterations, the this Early Stopping stops the training process.

Hope, u guys have learnt something about Transfer Learning.

Cheers 🙂

AutoKeras – a tough competitor for Google’s AutoML

Auto-Keras is an open source software library for automated machine learning (AutoML). It is developed by DATA Lab at Texas A&M University and community contributors. The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background. Auto-Keras provides functions to automatically search for architecture and hyperparameters of deep learning models.

We can create deep learning models in just 4 lines of code :

import autokeras as ak 
clf = ak.ImageClassifier(), y_train) 
results = clf.predict(x_test)

Simple right ? The preview version has been released and awaiting for its final official release.

I’m damn sure that this will make the newbies in DeepLearning to make their hands wet by creating complex deep learning models at ease.

TensorFlow 2.0 !!!!

TensorFlow has become the world’s most widely adopted machine learning framework, catering to a broad spectrum of users and use-cases. In this time, TensorFlow has evolved along with rapid developments in computing hardware, machine learning research, and commercial deployment.

The latest we hear from Martin Wicke is that :

TensorFlow 2.0 is coming with  major updates !!!!

Main Features of TensorFlow 2.0 include :

  • Eager Execution – which makes TensorFlow easier to learn and apply.
  • Support for more platforms and languages.
  • Removal of deprecated API’s.

Another major thing is “tf.contrib“, which will be stopped distributing as a part of release of TensorFlow 2.0

Preview version will be released this year 2018 lately.

Handling Imbalanced Classes in the Dataset

What is Imbalanced Dataset ?

The dataset may contain uneven samples /instances , so that it makes the algorithm to predict with accuracy of 1.0 each time u run the model. For example, if u have simple dataset with 4 features and output(target) feature with 2 class, then total no. of instances/samples be 100. Now, out of 100, 80 instances belongs to category1 of the output(target) feature and only 20 instances contribute to the category2 of the output(target) feature. So, obviously, this makes bias in training and predicting the model. So, this dataset refers to Imbalanced dataset.

Let’s get our hands dirty by exploring the Imbalanced dataset and measures to handle the imbalanced classes.

First, for instance we can take a dataset with 7 features along with a target variable, So totally our dataset contains 8 features.

Initially, we read the dataset through “read_csv” method and print the head of the dataset as below:

file = pd.read_csv("../input/ecoli.csv")


   Mcg   Gvh   Lip  Chg   Aac  Alm1  Alm2     Class
0  0.49  0.29  0.48  0.5  0.56  0.24  0.35  positive
1  0.07  0.40  0.48  0.5  0.54  0.35  0.44  positive
2  0.56  0.40  0.48  0.5  0.49  0.37  0.46  positive
3  0.59  0.49  0.48  0.5  0.52  0.45  0.36  positive
4  0.23  0.32  0.48  0.5  0.55  0.25  0.35  positive

Next, we need to find how many categories are there in the target variable “class”. So for that:



count          220
unique           2
top       positive
freq           143
Name: Class, dtype: object

As you can see, there are two unique categories in the “class” feature. Now we need to find the exact counts of the two categories, to do that:

f = file.groupby("Class")

           Mcg Gvh Lip Chg Aac Alm1 Alm2 
negative   77  77  77   77  77   77   77 
positive   143 143 143  143 143 143   143

Well, its pretty straight forward that our target feature in dataset has more number of “positive” classes than negative.

So, now we can visualize this in a histogram plot, so to do that, we need to convert the “object” type of Class to int:

file['Class'] = file['Class'].map({'positive': 1, 'negative': 0})


It’s easy to understand when you visualize your data like this, Isn’t it ? Well, yes our dataset has more number of positive classes(1’s) and less negative classes(0’s).

Before training our model, we need to find the most important features in our dataset, so that it helps to increase the accuracy of our model and to discard the useless features that does not contribute to the overall accuracy of the model. To do that, we have our own classifier “RandomForest“.

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
model =, labels_train)
feature_labels = ['Mcg','Gvh','Lip','Chg','Aac','Alm1','Alm2']
for feature in zip(feature_labels,model.feature_importances_):

('Mcg', 0.11586269275979075)
('Gvh', 0.012807906652840087)
('Lip', 0.0)
('Chg', 0.0)
('Aac', 0.0117212198350282)
('Alm1', 0.48041880476655613)
('Alm2', 0.3791893759857849)

As you can see, the feature ‘Chg‘ and ‘Lip‘ are contributing very low. So we can slice them and make the dataset with only limited features.

new_file = file[['Mcg','Gvh','Aac','Alm1','Alm2','Class']]

      Mcg  Gvh   Aac  Alm1  Alm2  Class 
0     0.49 0.29  0.56 0.24  0.35  1 
1     0.07 0.40  0.54 0.35  0.44  1
2     0.56 0.40  0.49 0.37  0.46  1 
3     0.59 0.49  0.52 0.45  0.36  1  
4     0.23 0.32  0.55 0.25  0.35  1

Now, to make the things clear, we split our dataset into train and split and evaluate to witness how our model predicts biased results. Let’s dive in:

from sklearn.cross_validation import train_test_split 
train, test = train_test_split(new_file,test_size=0.2) 
features_train = train[['Mcg','Gvh','Aac','Alm1','Alm2']] 
features_test = test[['Mcg','Gvh','Aac','Alm1','Alm2']] 
labels_train = train.Class 
labels_test = test.Class 

(176, 6) 
(44, 6)

We split our dataset into train (80%) to train our model and test (20%) to evaluate our model. So we train our model with 176 samples and test our model on 44 samples.

Now it’s time to train our model using “RandomForest” Classifier, we can train our model by:

clf = RandomForestClassifier()
model =, labels_train)
print("Accuracy of Randomforest Classifier:",clf.score(features_test,labels_test))

Accuracy of Randomforest Classifier: 1.0

As explained previously, RandomForest classifier produces accuracy of 100% , which is biased due to the fact that there are more Positive classes than the Negative class ( 143 POSITIVE classes and 77 NEGATIVE classes. )So this creates the biased results.

So, to handle this, we have two approcahes:

  1. Over Sampling
  2. Under Sampling

Over Sampling:

It is nothing but Sampling the minority class and making it equivalent to the majority class.


before sampling: Counter({1: 111, 0: 65})

after sampling: Counter({1: 111, 0: 111})

Note:The counts of 1’s and 0’s before and after sampling.

Under Sampling:

It is nothing but Sampling the majority class and making it equivalent to the minority class


before sampling: Counter({1: 111, 0: 65})

after sampling: Counter({0: 65, 1: 65})

There are several algorithms for over sampling and under sampling. The one we use here is,

Over Sampling Algorithm:

  1. SMOTE – “Synthetic Minority Over Sampling Technique”. A subset of data is taken from the minority class as an example and then new synthetic similar instances are created. These synthetic instances are then added to the original dataset. The new dataset is used as a sample to train the classification models.

Under Sampling Algorithm:

  1. RandomUnderSampler – Random Undersampling aims to balance class distribution by randomly eliminating majority class examples. This is done until the majority and minority class instances are balanced out.
  2. NearMiss – selects the majority class samples whose average distances to three closest minority class samples are the smallest.

Hope with all these tiny descriptions, you might have understood the overall picture of the sampling algorithms, let’s implement them in our code and check the accuracy of our model:

from collections import Counter
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE(kind='borderline1').fit_sample(features_train, labels_train)
print("before sampling:",format(Counter(labels_train)))
print("after sampling:",format(Counter(y_resampled)))

before sampling: Counter({1: 115, 0: 61})
after sampling: Counter({1: 115, 0: 115})

Now the counts of Positive and Negative classes are equal, as we over-sampled the Negative class to match the counts of the positive class using the SMOTE sampling algorithm. Now, let’s train our model again with the re-sampled data and evaluate.

clf1 = RandomForestClassifier()(), y_resampled)

Accuracy: 0.9545454545454546

Well, the accuracy is reduced to 95% which is reasonable when compared to the biased accuracy of 100%.

What you wait for ? Hurrayyyy we have learnt what is imbalanced classes in a dataset and how to handle them using various sampling algorithms practically. 🙂

Data Mining

What is Data Mining?

Data mining is process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

Data Mining Process:

  1. Develop understanding of application, goals
  2. Create dataset for study (often from Data Warehouse)
  3. Data Cleaning and Preprocessing
  4. Data Reduction and projection
  5. Choose Data Mining task
  6. Choose Data Mining algorithms
  7. Use algorithms to perform task
  8. Interpret and iterate through 1-7 if necessary
  9. Deploy: integrate into operational systems.

As you can see, the core steps of data mining is from step 4 – step 8. Well, on discussing about data mining process, it leads to an important methodology of data mining called “CRISP-DM“.


“Cross Industry Standard Process for Data Mining” – a 6-phase model of the entire data mining process, from start to finish, that is broadly applicable across industries for a wide array of data mining projects.

As there are 6 phases, I will give short description about each phases.

  1. Business Understanding – Identifying the project objectives
  2. Data Understanding – Collect and review data
  3. Data Preparation – Select and clean data
  4. Modelling – Manipulate data and draw conclusions
  5. Evaluation – Evaluate model
  6. Deployment – Apply conclusions to business

Introduction to Pandas

In this blog, you will get to know about the working of pandas library in python with real-time examples.

Pandas is one of the most powerful toolkit for data manipulation and analysis built over Numpy.

In Panda, namely there are two terminologies :




Series is nothing but the 1-Dimensional array or (1-D array).

Import Pandas as pd
obj = Series([1,2,3,4,5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

As you can see, the type of the “obj” variable is an array of “int64” values. It’s simple as that to create a series object.

Now we can do some basic arithmetic operations, like:

Adding two series objects:

x = pd.Series([2, 4, 6, 8, 10])
y = pd.Series([1, 3, 5, 7, 9])
add = x + y

0     3
1     7
2    11
3    15
4    19
dtype: int64

Same way as above, we can do other arithmetic operations like Subtraction, Multiplication, Division, Modulo Operations.

Another exciting feature of series is that, you can easily convert the Python dictionary(dict) into a series object as below:

data = {'India': 5000, 'America': 2500, 'Europe': 1000}
seriesobj = pd.Series(data)


India      5000
America    2500
Europe     1000
dtype: int64

We can also check if any values in the series object is “NULL” using the isnull() function:


India      False
America    False
Europe     False
dtype: bool

As you can see, the result of the above operation is of type “Boolean”, Series is super easy and flexible to use.


DataFrame on the other hand is the 2-Dimensional array with rows and columns that represents a tabular, spread-sheet like data structures.

Creating a data frame is as simple as below:

import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

f = pd.DataFrame(exam_data,index=labels)


attempts name       qualify score 
a 1     Anastasia     yes   12.5
b 3     Dima          no    9.0
c 2     Katherine     yes   16.5
d 3     James         no    NaN
e 2     Emily         no    9.0
f 3     Michael       yes   20.0
g 1     Matthew       yes   14.5
h 1     Laura         no    NaN
i 2     Kevin         no    8.0
j 1     Jonas         yes   19.0

We can play with dataframes using different functions and methods. For example in order to get the basic information about a dataframe, we can use a function called “info()“.


<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
attempts    10 non-null int64
name        10 non-null object
qualify     10 non-null object
score       8 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes

Now as you will be familiar with creating a data frame, we can play with “Sub-setting / Slicing” the data frames.


It is a powerful indexing feature using which we can “select and exclude variables / feature columns ” from the data frame. We can subset / slice a data frame using various means like

a. Sub-setting by specifying number of rows

First 3 rows of the dataframe


attempts    name       qualify score 
a 1         Anastasia   yes    12.5
b 3         Dima        no     9.0
c 2         Katherine   yes    16.5

b. Sub-setting using the column names

f_new = f[['name','score']]

  name       score
a Anastasia   12.5
b Dima        9.0
c Katherine   16.5
d James       NaN
e Emily       9.0
f Michael     20.0
g Matthew     14.5
h Laura       NaN
i Kevin       8.0
j Jonas       19.0

c. Sub-setting only the rows[1,3,5,6] of the specific columns from the data frame.


  name     score
b Dima      9.0
d James     NaN
f Michael   20.0
g Matthew   14.5

d. Sub-setting based on some Logical Conditions

Selecting the rows with 'score' values between 15 and 20(both inclusive)

attempts  name       qualify   score 
c 2       Katherine    yes      16.5
f 3       Michael      yes      20.0
j 1       Jonas        yes      19.0
Selecting the rows with 'attempts' < 2 and 'score' > 15

f[(f['score']>15) & (f['attempts']<2)]

  attempts    name     qualify    score
j   1         Jonas      yes       19.0

As you can see, the data frame is more powerful and flexible to work with structured data. We can also explore some more features of data frame like “adding and dropping” rows and columns in the data frame.

a. Adding a new row to the data frame:

f.loc['k'] = [1,"Suresh",'yes',15.5]

attempts  name   qualify   score 
a 1 Anastasia      yes      12.5
b 3 Dima           no       9.0
c 2 Katherine      yes      16.5
d 3 James          no       NaN
e 2 Emily          no       9.0
f 3 Michael        yes      20.0
g 1 Matthew        yes      14.5
h 1 Laura          no       NaN
i 2 Kevin          no       8.0
j 1 Jonas          yes      19.0
k 1 Suresh         yes      15.5

b. Dropping the newly added row in the data frame

f = f.drop('k')
attempts name   qualify   score 
a 1 Anastasia     yes      12.5
b 3 Dima          no       9.0
c 2 Katherine     yes      16.5
d 3 James         no       NaN
e 2 Emily         no       9.0
f 3 Michael       yes      20.0
g 1 Matthew       yes      14.5
h 1 Laura         no       NaN
i 2 Kevin         no       8.0
j 1 Jonas         yes      19.0 

c. Dropping the columns from the data frame.

f = f.drop('attempts',1)

    name      qualify   score 
a  Anastasia    yes      12.5
b  Dima         no       9.0
c  Katherine    yes      16.5
d  James        no       NaN
e  Emily        no       9.0
f  Michael      yes      20.0
g  Matthew      yes      14.5
h  Laura        no       NaN
i  Kevin        no       8.0
j  Jonas        yes      19.0 

d. Adding a new column to the data frame.

color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red']
f['color'] = color

     name        qualify     score 
a     Anastasia    yes        12.5
b     Dima         no         9.0
c     Katherine    yes        16.5
d     James        no         NaN
e     Emily        no         9.0
f     Michael      yes        20.0
g     Matthew      yes        14.5
h     Laura        no         NaN
i     Kevin        no         8.0
j     Jonas        yes        19.0

So with all these stuffs, I hope you might have gained something about the Pandas library and how it facilitates the data analysts for data manipulation and analysis. It’s just the beginning and lots more to come and you can make your hands dirty by looking at the official documentation of the Series( and Dataframe(