## Introduction to Pandas

In this blog, you will get to know about the working of pandas library in python with real-time examples.

Pandas is one of the most powerful toolkit for data manipulation and analysis built over Numpy.

In Panda, namely there are two terminologies :

1.Series

2.DataFrame

### Series:

Series is nothing but the 1-Dimensional array or (1-D array).

```Example:
Import Pandas as pd
obj = Series([1,2,3,4,5])
print(obj)

Output:
0    1
1    2
2    3
3    4
4    5
dtype: int64```

As you can see, the type of the “obj” variable is an array of “int64” values. It’s simple as that to create a series object.

Now we can do some basic arithmetic operations, like:

Adding two series objects:

```x = pd.Series([2, 4, 6, 8, 10])
y = pd.Series([1, 3, 5, 7, 9])
add = x + y
print("Add:")
print(add)

Output:
Add:
0     3
1     7
2    11
3    15
4    19
dtype: int64```

Same way as above, we can do other arithmetic operations like Subtraction, Multiplication, Division, Modulo Operations.

Another exciting feature of series is that, you can easily convert the Python dictionary(dict) into a series object as below:

```data = {'India': 5000, 'America': 2500, 'Europe': 1000}
seriesobj = pd.Series(data)
print(seriesobj)

output:

India      5000
America    2500
Europe     1000
dtype: int64```

We can also check if any values in the series object is “NULL” using the isnull() function:

```seriesobj.isnull()

output:
India      False
America    False
Europe     False
dtype: bool```

As you can see, the result of the above operation is of type “Boolean”, Series is super easy and flexible to use.

DataFrame:

DataFrame on the other hand is the 2-Dimensional array with rows and columns that represents a tabular, spread-sheet like data structures.

Creating a data frame is as simple as below:

```import numpy as np

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

f = pd.DataFrame(exam_data,index=labels)

print(f)

Output:
attempts name       qualify score
a 1     Anastasia     yes   12.5
b 3     Dima          no    9.0
c 2     Katherine     yes   16.5
d 3     James         no    NaN
e 2     Emily         no    9.0
f 3     Michael       yes   20.0
g 1     Matthew       yes   14.5
h 1     Laura         no    NaN
i 2     Kevin         no    8.0
j 1     Jonas         yes   19.0```

We can play with dataframes using different functions and methods. For example in order to get the basic information about a dataframe, we can use a function called “info()“.

```f.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
attempts    10 non-null int64
name        10 non-null object
qualify     10 non-null object
score       8 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes```

Now as you will be familiar with creating a data frame, we can play with “Sub-setting / Slicing” the data frames.

Subsetting:

It is a powerful indexing feature using which we can “select and exclude variables / feature columns ” from the data frame. We can subset / slice a data frame using various means like

a. Sub-setting by specifying number of rows

```First 3 rows of the dataframe

f[:3]

Output:
attempts    name       qualify score
a 1         Anastasia   yes    12.5
b 3         Dima        no     9.0
c 2         Katherine   yes    16.5```

b. Sub-setting using the column names

```f_new = f[['name','score']]
f_new

Output:
name       score
a Anastasia   12.5
b Dima        9.0
c Katherine   16.5
d James       NaN
e Emily       9.0
f Michael     20.0
g Matthew     14.5
h Laura       NaN
i Kevin       8.0
j Jonas       19.0```

c. Sub-setting only the rows[1,3,5,6] of the specific columns from the data frame.

```f.ix[[1,3,5,6],['name','score']]

Output:
name     score
b Dima      9.0
d James     NaN
f Michael   20.0
g Matthew   14.5```

d. Sub-setting based on some Logical Conditions

```Selecting the rows with 'score' values between 15 and 20(both inclusive)
Example:
f[f['score'].between(15,20)]

Output:
attempts  name       qualify   score
c 2       Katherine    yes      16.5
f 3       Michael      yes      20.0
j 1       Jonas        yes      19.0```
```Selecting the rows with 'attempts' < 2 and 'score' > 15
Example:

f[(f['score']>15) & (f['attempts']<2)]

Output:
attempts    name     qualify    score
j   1         Jonas      yes       19.0```

As you can see, the data frame is more powerful and flexible to work with structured data. We can also explore some more features of data frame like “adding and dropping” rows and columns in the data frame.

a. Adding a new row to the data frame:

```f.loc['k'] = [1,"Suresh",'yes',15.5]
f

Output:
attempts  name   qualify   score
a 1 Anastasia      yes      12.5
b 3 Dima           no       9.0
c 2 Katherine      yes      16.5
d 3 James          no       NaN
e 2 Emily          no       9.0
f 3 Michael        yes      20.0
g 1 Matthew        yes      14.5
h 1 Laura          no       NaN
i 2 Kevin          no       8.0
j 1 Jonas          yes      19.0
k 1 Suresh         yes      15.5```

b. Dropping the newly added row in the data frame

```f = f.drop('k')
f
Output:
attempts name   qualify   score
a 1 Anastasia     yes      12.5
b 3 Dima          no       9.0
c 2 Katherine     yes      16.5
d 3 James         no       NaN
e 2 Emily         no       9.0
f 3 Michael       yes      20.0
g 1 Matthew       yes      14.5
h 1 Laura         no       NaN
i 2 Kevin         no       8.0
j 1 Jonas         yes      19.0 ```

c. Dropping the columns from the data frame.

```f = f.drop('attempts',1)
f

Output:
name      qualify   score
a  Anastasia    yes      12.5
b  Dima         no       9.0
c  Katherine    yes      16.5
d  James        no       NaN
e  Emily        no       9.0
f  Michael      yes      20.0
g  Matthew      yes      14.5
h  Laura        no       NaN
i  Kevin        no       8.0
j  Jonas        yes      19.0 ```

d. Adding a new column to the data frame.

```color = ['Red','Blue','Orange','Red','White','White','Blue','Green','Green','Red']
f['color'] = color
f

Output:
name        qualify     score
a     Anastasia    yes        12.5
b     Dima         no         9.0
c     Katherine    yes        16.5
d     James        no         NaN
e     Emily        no         9.0
f     Michael      yes        20.0
g     Matthew      yes        14.5
h     Laura        no         NaN
i     Kevin        no         8.0
j     Jonas        yes        19.0```

So with all these stuffs, I hope you might have gained something about the Pandas library and how it facilitates the data analysts for data manipulation and analysis. It’s just the beginning and lots more to come and you can make your hands dirty by looking at the official documentation of the Series(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) and Dataframe(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)