Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (4 votes)
3K views

Python For Data Science Extended Ebook PDF

The document discusses Python and its suitability for data science and machine learning. It provides context on how the need for Python libraries evolved as machine learning transitioned from an academic tool to one used for real-world business solutions. Key points covered include: - Python became popular for data science due to powerful statistical and data visualization libraries built on top of its existing libraries. - It serves as a "one stop solution" that can be used across the entire machine learning workflow and different platforms. - Popular Python libraries like NumPy, Pandas, and Matplotlib make it attractive for data science tasks due to capabilities like numerical computing, data processing, and visualization.

Uploaded by

hrishikeshsp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
3K views

Python For Data Science Extended Ebook PDF

The document discusses Python and its suitability for data science and machine learning. It provides context on how the need for Python libraries evolved as machine learning transitioned from an academic tool to one used for real-world business solutions. Key points covered include: - Python became popular for data science due to powerful statistical and data visualization libraries built on top of its existing libraries. - It serves as a "one stop solution" that can be used across the entire machine learning workflow and different platforms. - Popular Python libraries like NumPy, Pandas, and Matplotlib make it attractive for data science tasks due to capabilities like numerical computing, data processing, and visualization.

Uploaded by

hrishikeshsp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Python For

Data
Science/
Machine
Learning
- Vidya Kurada

aCubeIT
Python for Data Science/Machine
Learning
Python has become an essential part of the learning process for data science community. I
am planning to make this topic as a series of articles. I want to carefully cover all the
essential topics with sufficient examples to understand the respective concepts. Broadly
in the series, I will cover

  Why choose Python for data sciences?

  What are the essential libraries?


  Concepts on numerical computing python library – Numpy
  Concepts on data processing library - Pandas
 Concepts on visualization library – Matplotlib

In the present article, I would like to give some insight into how the need to develop these
libraries evolved and why Python is the preferred choice in the data science community.

How the need to develop Python libraries evolved?

Before machine learning engineer or data scientist became job titles in the industries like
retail, banking, manufacturing etc. Machine learning algorithms are mostly a
mathematical research tool. Academicians, Statisticians and Research Scholars used these
algorithms to future proof their results. When machine learning was only a laboratory
tool, statistical programming languages like R and numerical computing environments
like MATLAB served the purpose.

These languages are easy to understand but mathematically intensive. But when machine
learning algorithms started their journey in providing real time business solutions, data
scientists and software developers had no common interface to work with.

The business solution cycle looked like this

1. Programmer: Get the business data out using query languages


2. Data Scientist: Took the data, did the necessary analysis, found the model
3. Programmer: Deployed the model in real time, did the predictions

Since we know machine learning models can learn from data, so again

4. Programmer: Get the data out


5. Data Scientist: Added the data to the algorithms, improved the model
6. Programmer: Re-deployed the model in real time.
This became a monotonous cycle. They do not have a one stop solution, a single
programming language that can do it all. It is then, a few data scientists, and programmers
developed powerful statistical and data visualization libraries on top of Python and it’s then
existing libraries .

They chose Python because; it is a high level language which is

  Easy to understand for a data scientist.


 It handles different data structures that a developer needs.

These contributions from data science and programming communities added with the
advantage of being open source made Python a powerful language for machine learning.

Why Python for data science over other languages?

This is a much debated question across industries. As mentioned earlier, Python stands out
as a one stop solution. It could be used across different platforms such as web development,
windows applications, machine learning and as a general purpose programming language.

 Availability of data science libraries: The success of Python can be attributed to


availability of data science libraries that too open source. A few significant libraries
include:

o Numpy: Numerical computing library that performs mathematically intense
operations on multi-dimensional arrays/matrices.

o Pandas: It is developed on top of Numpy. It’s basically a data-base query
language but crafted to accept advanced numerical computations on data bases.

o Matplotlib: It is a 2D plotting library. It is a data visualization tool kit which gives
histograms, bar charts, scatter plots and many more. It also appeals to the
scientific community where frequency domain analysis like spectral analysis is
preferred.

These are only a few and famous in the bucket. I am planning to discuss these libraries
as part of the future articles in the aforementioned series.

 Extreme Scalability and speed: Python is a high level language. So being the fastest is
not what it promises. But for the data science community, it is much faster than high end
computing tools like Matlab or R. More than speed, Scalability is its strong suite.
Anyone can develop an end to end application using Python.


 Extensive community support: For any open source language, community support is the
key to its success. It helps new aspirants to quickly resolve the problems they face. It also
helps develop more sophisticated libraries with ease.
 Easy to learn for any non-computer science graduates: It is primarily because of the
flexibility in language. The syntaxes are more close to English semantics than any other
programming language. Apart from that, the community is becoming instrumental in
creating extensive course materials which are accessible and easy to understand.

With this I conclude the article. There is only so much one can take at a time particularly
when we are new to it. So hang on there, I will come with detailed notes on Python
libraries for data sciences/machine learning in the upcoming articles.

Articles next in the series:


In the immediate future, we will see comprehensive beginner tutorials on:
  NumPy in parts
  Pandas in parts
 Matplotlib in parts
I would like to take this further ahead but this is our immediate future.

Next! NumPy (Numerical Python Library) for beginners

I provided link to install anaconda distribution of python in next section. Install it if you
do not have it already in place.

Installing Anaconda:

Anaconda distribution is the most trusted one for data science. It’s an open
source distribution.
Anaconda installation is given in detail on its web page. Just make sure, while installing you
install Python 3.x .Typically, any latest version available to you but of Python 3 .

Link for the same:


Anaconda Install
Python for data sciences: NumPy for
Beginners
This is a first set of articles in the series of Python for data sciences. Previously, we discussed
‘why python is a suitable language for learning data science’.

From this article, we start learning python for data sciences. We are going to start with
NumPy library, short for Numerical Python library. NumPy is a library used for performing
arithmetic operations as well as linear algebraic operations and other mathematical
operations on arrays. Since the mathematics involved in computing machine learning
algorithms is inherently linear algebra. Almost all the machine learning packages like Scipy
(Scientific Python), Scikit-learn and the data pre-processing library, Pandas are all built on
top of Numpy.

In this set of articles, keeping in mind the beginners in data science, we will cover
the following. I plan to divide this NumPy for beginners into 3 parts.

1. What is a NumPy array?


PART 1: 2. How to create and inspect NumPy arrays?
3. Array indexing
PART 2: 4. Array manipulations and Operations
5. What is broadcasting?
PART 3: 6. Speed test: Lists Vs NumPy array

Advantages of NumPy array over a list is its speed and the compact nature of the code. At
the end, in part 3 of NumPy series, we will do a small speed test and see how fast an
arithmetic operation is performed on NumPy over lists or for that matter, any other data
structure. I preferred to keep it in the last, so that the reader can appreciate it.

In this course, we will be using the Jupyter notebook as our editor. So let’s start!

Installing NumPy:

I assume you have python installed on your laptops already. If not, check our previous
article to see how we install Anaconda. If you have Anaconda, you can simply install
NumPy from your terminal or command prompt using:

conda install numpy

If you do not have Anaconda on your computer, install NumPy from your terminal using:

pip install numpy

Once you have NumPy installed, launch your Jupyter notebook and get started.
Don’t worry; we will dive easy and simple. The articles are meticulously articulated for
conceptual understanding together with hands-on.
NumPy for Beginners: PART - 1
How to create and inspect NumPy Array
This is the first in set of 3 parts of NumPy tutorials.

1. What is a NumPy array?


PART 1: 2. How to create and inspect NumPy arrays?
3. Array indexing/Slicing
PART 2: 4. Array Operations
5. What is broadcasting?
PART 3: 6. Speed test: Lists Vs NumPy array

Here we start very basic. We start learning from what is a NumPy array. By the end of this
part, we get an idea about:

  What are the different ways of creating a NumPy array?


  How to inspect an array object?
 Why do we inspect an array?

What is a NumPy array?


The most basic object in NumPy is an ndarray or simply an array. "ndarray” means n
dimensional array. It is a homogeneous array which means all the elements of the array
have same data type. Typically, data type will be numeric in nature (float or integer).

Before learning how to create a NumPy array, we will see how an array looks like? The most
common arrays are one dimensional (1-D) or two dimensional (2-D). One dimensional (1-D)
arrays are nothing but vectors. Two or more dimensional arrays in the context of linear
algebra are called matrices.

A typical array looks like this. In NumPy terminology, for 2-D arrays:

 axis = 0 refers to the rows


 axis = 1 refers to the columns
Let’s start by importing NumPy into our Jupyter notebooks.

import numpy as np

np is just an alias. One can use any other alias but np is quite standard alias.

Creating NumPy array:


One can create NumPy arrays in multiple ways. A NumPy array can be created from

  Other python data structures like lists, tuples


  Using built in functions
 Simply by giving the values.

There are several built in functions available to create NumPy arrays to make things easy for
us. Let us have a look at them one by one.

Creating NumPy arrays from Lists and Tuples:


The most frequently used syntax for creating an array is np.array.

In the above example, we simply converted python list or tuple to a NumPy array object
and it is a one dimensional (1-D) array. Similarly, to create a 2-D array from list or tuple, we
should have a list of lists or tuple of tuples or tuple of lists.
Only then NumPy understands that we need a 2-D array. We can see that here. Can we give
list of tuples as an input? Please do check that.

Or the syntax might simply be like this for (1-D) and (2-D) arrays.

Creating NumPy arrays using built-in functions: Ex: np.ones() and np.zeros()

There are many built in functions available for creating NumPy arrays. The most common
built-in functions used to initialize are listed here. These functions can only be used when
you know the size of the array.

Try the above syntaxes and see what each syntax results in. In the initialization of ’a’, the
tuple (5,3) as a parameter suggests that we want to generate a matrix of all ones with 5
rows and 3 columns.
By default, an array will have a float data type and while initializing ’b’, we explicitly
mentioned the data type ‘int’ as parameter. np.zeros() used to initialize ’c’ is also a similar
function which gives matrix of all zeros.

Creating NumPy arrays using built-in functions: np.arange()

np.arange() is similar to python built-in range() function. Let us create a NumPy array using
arange(). arange(start,stop,step) typically takes 3 parameters similar to range(). Third
parameter step is optional, if we do not mention any step size, NumPy takes step = 1 as
step size. But in the present query we specified step = 5. So arange() generated numbers
from 10 to 100 with a difference of 5.

Along with arange(), an additional function reshape() is used here. As the name suggests,
reshape() helps in changing the dimensions of the existing array. reshape() is an array
manipulation function.

Note 1: Observe carefully, it created an array with values between 10 and 95 as python takes values
from 0 to n-1 elements in general.

Note 2: The reshape() can only convert in dimensions that when multiplied result in the total no of
elements. In the example above, numbers contained 18 elements hence can be reshaped to a 3X6
matrix whose product is 18. Can we reshape the ‘numbers’ in any other dimensions?
Creating NumPy arrays using built-in functions: np.linspace()

The linspace() function returns numbers evenly spaced over a specified intervals. Say
we want 15 evenly spaced points from 1 to 3, we can easily use:

This gives us a one dimensional vector. Unlike the arange() function which takes the third
argument as the number of steps, linspace() takes the third argument as the number of
data points to be created.

Creating identity matrices using built-in functions: np.eye():

In linear algebra, identity matrices and its properties are widely used. It’s a square matrix
with diagonal elements equal to one and rest all are zeros. Identity matrix usually takes a
single argument. Here’s how we create one.

Identity matrix is religiously used in linear algebra. Whenever you try to matrix multiply
two dimensionally incompatible arrays, a simple thing to do would be first to transform
matrices using identity matrix (np.eye()).

Creating random number arrays using built-in functions:


Random number generator is a separate package in NumPy. We have to call out np.random
before asking for a particular type of random number to be generated. The syntax for
different types of random numbers would be np.random.rand() or np.random.randn() or
np.random.randint(). Each syntax calls for random numbers from a particular distribution.
For example, when we use randn(), it calls for normally distributed random numbers with
mean = 0 and standard deviation (std) = 1. randint() that we used in the syntax above is
distributing numbers 1 to 9 uniformly in the matrix.

Apart from the methods mentioned above, there are a few more NumPy functions that
you can use to create special NumPy arrays:

 np.full(): Create a constant array of any number ‘n’



 np.tile(): Create a new array by repeating an existing array for a particular number
of times

Inspecting the structure and content of arrays:

Typically, any real time problem that we need to apply ML algorithms will have thousands
to lakhs of rows and hundreds of columns. So, it’s helpful to inspect the structure of arrays.
We cannot make any sense of the data merely by printing the data and it’s time consuming
too. There are few built-in functions to quickly inspect the arrays.

Let's say you are working with a moderately large array of size 1000 x 300.

Here, we cannot make sense of data merely by displaying a 1000 x 300 random numbers.
Using simple functions like

 shape: Gives an idea of how many (rows, columns) are there in a given array.

 dtype: To get the data type of the array. (Remember, we discussed that an array will
 have same data type for all its elements.)
  ndim: To get the dimensionality of the array.
 itemsize: To get the size of the array in ‘kB’.
These functions are the basis for inspection functions we use in Pandas library. We soon
get into Pandas library.

While pre-processing data in data science projects, it becomes part of the process to inspect
data every time we make data transformations. Let me elaborate it. We generally get
unclean data which means some column values are missing or have duplicate values. Every
time we delete a duplicate row or fill an empty column, we inspect the array object as part of
pre-processing.

Summary: In this part, we learnt 5 to 6 ways of creating an array object and how to inspect
them. This extensive learning of array creation will help you make your work simpler and
faster while working.

Next, Part-2 of the series talks about ‚Array Indexing‛, ‚Array Manipulations and
Operations‛.
NumPy for Beginners: PART - 2
Array Indexing and Operations
This is the second in set of 3 parts of NumPy tutorials.

1. What is a NumPy array?


PART 1: 2. How to create and inspect NumPy arrays?
3. Array indexing/Slicing
PART 2: 4. Array Operations
5. What is broadcasting?
PART 3: 6. Speed test: Lists Vs NumPy array

In this part (part - 2), Let us see how we can

  Take parts of arrays out.

  How to manipulate an array?


 What kind of logical, arithmetic and mathematical operations one can perform?

Let us cut the chase and start working!

Array Indexing/Slicing:

Array slicing is similar to other data structures in Python. We simply pass the index we
want and get an element or group of elements out. Similar to regular python, elements of an
array are indexed as (0, n-1).

1-D Array Slicing:

One dimensional (1-D) array slicing is same as in python lists. So try these syntaxes to get
some practice. ‘ : ’ is used to get a range of values just like in lists. Example: Suppose we
have ‘2:5’, NumPy interprets this as a request to pull out elements from 2nd to (5 -1 = 4)th
elements.
Observe the notion [2,:] used in the second cell in the picture. It’s asking to print every
element from and beyond third element. Try guessing the results for the remaining cells.
Hints are already given in the cells itself

2-D Array Slicing:

Multidimensional arrays are indexed using as many indices as the number of dimensions or
axes. For instance, to index a 2-D array, you need two indices - array[x, y]. In [x, y], x is for
rows and y is for columns. Each axis has an index starting at 0. The following figure shows the
axes and their indices for a 2-D array.

Let us create a 2-D array and do some slicing. First cell in the picture creates a 2-D array.
Now say, we want a specific element or part of the given array. It’s done using closed
brackets and a comma, ‘[ , ]’ . Comma (’ , ’) separates row slicing from column slicing.
Now start guessing the results for the given syntaxes and compare them with the results. ‘ :
’ without mentioning the range is used to retrieve all the elements of a particular row or
column.

Guess the outputs for the following:

Operations on NumPy Arrays:

In NumPy arrays, we can perform almost all mathematical and logical operations one can
perform on data structures like lists and tuples in python.

On top of that, we can extensively perform linear algebra and trigonometry calculations on
array objects. In fact, the purpose of NumPy is to provide scientific computing ability to
python.

The learning objectives of this part of the article is broadly classified as

  Manipulating arrays
 Mathematical and Logical operations on arrays

Manipulating arrays:

Reshaping arrays: Function np.reshape() is already discussed when creating an ‘arrange()’


array. But it is more appropriate to discuss here. So let us see ‘reshape()’ at length here. By
definition, reshape() is used to transform an array from one particular dimension to
another dimension.
The only limitation is the product of dimensions in the given array should be equal to the
product of dimensions in transformed array. For example if we have a ‘(5,4)’ array, we
can transform that into a new array of these 4 dimensions only
‘(2,10)’,’(10,2)’,’(1,20)’,’(20,1)’ because 5*4 equals 2*10 and so on.

Let us see a few syntaxes to understand.

In this example, an array of [0, 11] is taken and reshaped in 3 different ways. You can find
last one really interesting. If we know only the number of rows we want, we can simply
give reshape(4,-1), NumPy understands we want 4 rows and it automatically calculates the
columns using the product rule we discussed earlier. Try the syntax in third cell replacing
rows with columns.

Think what will be the easier way to reshape columns to rows and vice –versa?
Connect dots with matrices.

Stacking arrays: Stacking is done using the np.hstack() and np.vstack() methods. For
horizontal stacking, the number of rows should be the same, while for vertical stacking;
the number of columns should be the same.
Try these. While vstack() places array_2 below array_1, hstack() places the array in the
second argument below the other. Note: Arrays should be passed as list or tuple of arrays.

Logical operations on arrays: We can also perform conditional and logical selections on
arrays using &(AND), | (OR), <, > and == operators to compare the values in the array with
the given value.

In the first cell, we have taken values ranging from 5 to 14 in array_logical. When we asked
whether array_logical > 10 , result is a boolean array where it compared each element is
greater than or equal to 10. Try the syntaxes in the second cell. But before that, guess what
will be the result to check your understanding.

Mathematical operations on arrays:

Basic Arithmetic Operations: There is no introduction needed to arithmetic operations.


They are simple addition (+), subtraction (-), multiplication (*) and division (/). When two
arrays have same size, they are just similar to regular python. Try the following.
Note: Do not mistake these with matrix operations. These are element-wise operations like we do on
lists or tuples.

Linear Algebraic Operations: NumPy provides the np.linalg package to apply common
linear algebra operations, such as:

 np.linalg.inv: Inverse of a matrix


  np.linalg.det: Determinant of a matrix
 np.linalg.eig: Eigenvalues and eigenvectors of a matrix
These linear algebra functions are largely the basis for most machine learning algorithms.
Access to these kinds of functions is the reason why packages like sci-kit learn are built on
top of NumPy. ‘np.dot(a,b)’ is used to compute the matrix multiplication.

Let me give you an example, Linear Regression is one of the most talked examples to explain
machine learning. It says, given the data of [X,Y] , I can compute relationship between X and
Y as ‘A’ of the below equation:

While ML algorithms use statistical techniques to give more precise results, We can simply
compute ‘A’ by

Functions like linalg.inv() comes handy here. Not only this, np.linalg.eig() used to compute
eigenvalues and vectors is repeatedly used by algorithms to perform principal component
analysis (PCA) or in support vector machines (SVM).

Apply User Defined Functions: We can also apply our own functions on arrays. For e.g.
applying the function x/(x+1) to each element of an array. One way to do that is by looping
through the array, which is the non-numpy way.
We would rather prefer to write vectorised code. The simplest way to do that is to
vectorise the function you want, and then apply it on the array. Numpy provides the
np.vectorize() method to vectorise functions.

Let's look at how we do it.

These kind of functions come handy when we are doing any new calculation on the array.

Universal Functions:

Like any other programing language, NumPy has access to universal functions.

Among these, functions like sum(), std(), count() are repeatedly used while pre-processing
data in data science projects.

Summary: In this part, we saw simple ways to

  Slice parts of arrays


 Performing simple to complex mathematical operations using built-in functions.

You are going well. We are almost done.

In the next part (Part-3): We have an interesting topic specific to NumPy. It’s broadcasting.
For people who work more on Arrays/Matrices, broadcasting is a gift. As an end touch to it,
we will do a speed test for lists Vs NumPy arrays.
NumPy for Beginners: PART – 3
Broadcasting and Lists Vs Arrays Speed
test
This is the third and final in set of 3 parts of NumPy tutorials.

1. What is a NumPy array?


PART 1: 2. How to create and inspect NumPy arrays?
3. Array indexing/Slicing
PART 2: 4. Array Operations
5. What is broadcasting?
PART 3: 6. Speed test: Lists Vs NumPy array

In this part, we will learn

  What is broadcasting with a simple example?


 How fast are arithmetic computations on Arrays over Lists?

Broadcasting:
In general, arrays of different dimensions cannot be added or subtracted. NumPy has a
smart way to overcome this problem by duplicating the smaller dimension array to be
the size of a higher dimension array and then performs the operation.

Ex: If we want to add array([3]) to array([1,2,3]). By simply giving array([3]) + array([1,2,3]),


numpy understands that your idea is to add [3] to every element of [1,2,3]. Immediately, it
duplicates the value [3] as many times as it is in the larger array, in this case, array([3,3,3])
and now performs the addition operation.

To simplify, Broadcasting is the name given to the method that numpy uses to allow array
arithmetic between arrays with a different shape or size. To quote the exact definition as in
scipy.org

‚The term broadcasting describes how numpy treats arrays with different shapes during arithmetic
operations. Subject to certain constraints, the smaller array is ‚broadcast‛ across the larger array
so that they have compatible shapes.‛

We can take 3 types of examples to efficiently convey the concept of broadcasting. Let us
see examples first and try to interpret them.
In the first cell, we added a scalar value (means single value) to a vector (means 1-D array).
Like we discussed earlier, value of scalar ‘b’ is duplicated so that both array dimensions
are equal and then addition is performed. Observe the output to see each value is
incremented by a value ‘b =2’.

Similarly in the second cell, we added a scalar value to a 2-D array. Until now, the smaller
array is a single value hence broadcasting works without any limitations. But the actual
purpose of broadcasting is to add 2 arrays of different dimensions > 1. Let us see how that
works and what the limitations are.

The first example here is a row-wise broadcasting. This means each row in ‘a_2d’ is
added with the 1-D array ‘b_1d.’ This happened because ‘b_1d’ is a unit row vector.
Similarly, second example is a column wise broadcasting.
Broadcasting is an interesting concept but it comes with its own limitations. Broadcasting
expects at least any one dimension (row or column) to be equal in both the arrays. Let us
check by giving different dimensions.

If we observe carefully, neither column nor row dimensions are equal and hence the ‘value
error’.

Broadcasting is definitely a handy shortcut but imposes some rules to allow the operations.
The examples we covered here are very basic and are intended for beginners.

You can go ahead and further explore these articles to know more:

Broadcasting Scipy.org
Array broadcasting in NumPy

Speed
 test: Lists Vs Arrays:
We will keep it extremely simple.
I simply took 10 lakh numbers of different sequences, ‘i’ is simple numbers and ‘j’ are
squares of ‘i’. I divided ‘j’ by ‘i’ element wise (j/i).

I simply recorded time taken to do the same operation on lists and arrays and computed a
ratio of it. It says NumPy performs 22 times faster than lists.

Now, I leave it to your imagination, what it will be like if we do computations as heavy as


ML algorithms over lists and not on arrays.

Some reasons for such difference in speed are:

 NumPy is written in C, which is basically being executed behind the scenes



 NumPy arrays are more compact than lists, i.e. they take much lesser storage
space than lists

Go through these discussions if you want further read these


For some more examples:
Why are numpy arrays so fast?
Why numpy instead of python lists?

Well! This is all the NumPy you need to kick start your journey through Python for
data sciences.

Congratulations! You are one step closer.

Next Series in line is, Pandas Library: Data Pre-processing library that is built on top
of NumPy. We discuss more data science examples when we discuss that.
Python for data Science – Pandas for
beginners
This is the second set of articles in Python for data science series. Earlier we discussed
‘why python for data science?’ and we completed a set of tutorials on ‘NumPy for beginners’. If
someone does not know anything about NumPy, I would recommend that you learn
NumPy first but it is not mandatory.

‘Pandas’ is the most popular python library for data sciences. We can think of it as a well-
equipped querying language used for exploratory data analysis (EDA). Pandas library is
built on top of NumPy. The idea here is to make all the arithmetic and complex
mathematical operations one can perform on arrays in NumPy accessible to a querying
language.

This set of tutorials/articles is divided into 6 parts.

1. Creating Pandas data structures


2. Inspecting dataframes
3. Indexing and Selecting data
4. Merge and Concat
5. Arithmetic Operations
6. Grouping and Summarizing

In this course, we will be using Jupyter Notebook as our editor. I particularly suggest using
Jupyter Notebook from here on because when you start data science projects, it gives an
easy access to the results you want to see while doing the analysis.

So let’s start!

Installing Pandas:

I assume you have python installed on your laptops already. If not, check the previous
article, I shared a link to install Anaconda. If you have Anaconda, you can simply ‘install
Pandas’ from your terminal or command prompt using:

conda install pandas

If you do not have Anaconda on your computer, ‘install Pandas’ from your terminal using:

pip install pandas

Once you have Pandas installed, launch your Jupyter notebook and get started.
Don’t worry; we will dive easy and simple. The articles are meticulously articulated for
conceptual understanding together with hands-on. In between, I will try to correlate how we
use these concepts while doing data pre-processing in data science projects.

Next! First tutorial: Creating Pandas data structures (Part 1)


Pandas for beginners: Part -1
Creating Pandas data structures
This is the first in the set of 6 parts of Pandas tutorials.

1. Creating Pandas data structures


2. Inspecting dataframes
3. Indexing and Selecting data
4. Merge and Append
5. Arithmetic Operations
6. Grouping and Summarizing

Here we start from basics and learn through hands-on:

  What are Pandas data structures?


 How do we create one?

What are Pandas data structures?


There are 2 main data structures in pandas - Series and Dataframes. These are the basic
objects we work with, in Pandas library.

Series: Series is a one dimensional (1-D) NumPy array (vector). The main difference is,
series can be indexed and arrays can’t be. Index is nothing but an access label. Simply put,
we can name rows as per our interest.

Dataframe: Dataframe is a two dimensional (2-D) array where each column can have a
different datatype. We saw in NumPy that an array can only take single datatype for all
its columns or elements. Similar to series, it takes access labels. We can label both rows
and columns here. Typically, one can imagine it as a table.

Now that we know what they are, let us open our Jupyter Notebooks and start by importing
NumPy and Pandas packages.

import pandas as pd
import numpy as np

‘pd’ and ‘np’ are standard aliases used. You can use any other aliases or just import without

alias.

We will briefly look at Series object first and then get into dataframes. Eventually, you will
get to work largely on dataframe objects since manipulating dataframes efficiently and
quickly is the most important skill set if you choose python for your data science projects.
Creating Series:
The basic syntax for creating a series object is: ‚series_1 = pd.Series(data, index)‛

While calling series function, ‘S’ is always uppercase. ‘Index’ is optional and generates
default index when unspecified. A series can be created taking input data as a:

  Dictionary, Lists, Tuples (Any Python data structure)

  Array (A NumPy object)


 Scalar value

Creating basic series object:


Let us create some series and look at the important properties.

We can observe that, while creating our first series object, we passed our data as a ‘list’. So, it
takes another data structure/array object as an argument for ‘data’. Try giving a Tuple. Since
we did not specify any index, it generated default index. In this example, we are inspecting
data type of the variable, ‘s’ and it’s an object of class series while each element is an ‘int’.

Creating series with index:


Let us see some examples by including ‘index’ as an argument.
In the first cell of the example, we are giving each and every row index distinctly. You can
observe that, similar to ‘data’ argument, index is also passed as a ‘list’. In the second cell, we
passed array object as data and a built-in function ‘range()’ as an index. This shows the level
of flexibility pandas gives while creating an object.

Creating date series:


Next, we will see an example that might have some interesting applications.

There is a function ‘date_range()’ to create a series of dates. We simply gave start and end
dates as arguments and the function created a sequence of dates. Our input argument has
a date in the form of ‘MM-DD-YYYY’ but the output is in the standard form of ‘YYYY-
MM-DD’. Replace end = ’11-16-2017’ with periods = 3, freq = ‘M’ and try to understand the
outcome.
In the next part, we can see, I used the same dates as index for another series. We passed
one series as an index for another series. Try this! Create two different series and pass one as
an index to other.

Note: This example is a simple way to show how custom indexing can help in data analysis.
Suppose you have student’s attendance sheet with no dates. Simply adding date series as an index
will help understand the behaviour of any student in attending the class.

Creating series from dictionaries:

While creating series using dictionaries, pandas understands ‘key’ from dictionary as
a ‘label’ in index. Try giving some external index argument and see what happens?

Creating DataFrames:
The syntax for creating a dataframe object is: ‚df = pd.DataFrame(data, index, columns,
dtype)‛

While creating a ‘DataFrame’, ‘D’ and ‘F’ are uppercase. We can give one or more
arguments, with basic argument being ‘data’. Here, ‘index’ is same as in series while
‘columns’ take list of values for ‘column labels’. ‘columns’ also take default values of ‘0-n-1’
columns when unspecified.

A dataframe can be created taking input data:


 From other data structures in Python and NumPy :
 
 Dictionaries, Lists, Tuple (Python data structures)
 
 List of lists, Tuple of tuples (To create 2-D dataframes)
 
 1-D or 2-D NumPy Arrays
 
 Series

 From external sources like:


 
 Excel Spreadsheets
 
 .csv extension files
 
 .txt extension files
 
 Direct web sources
 
Many others

Since, Pandas is largely used as a data pre-processing tool, it provides the ability to read
data from several file types to dataframes directly.

Creating dataframes from dictionaries:

We see that, DataFrame is nothing but a combination of series objects but each column can
take a different data type. What can be the data type of each column in this dataframe?

It looks just like a table with columns and rows. It is important to understand that
commonly each row is a different observation and each column is an attribute.

Creating dataframes from lists:


Can you observe any difference in syntax with lists from dictionaries? While creating a
dataframe from lists, we need to give data ‘row -wise’ unlike dictionaries where it takes
column wise information. Here, column names have to be given as a separate argument.

Try creating dataframes from tuples and series. Hint: Just similar to dataframes from lists.

Now that we saw how to create a simple dataframe from other data structures, we will go
ahead and see how we read data from other sources such as excel, text or web pages
directly into dataframes.

Creating dataframes from ‘.csv’, ‘.txt’ files:

Typically, at any given time while working on a data science project, we work with data
having lakhs of rows and hundreds of columns. So it is highly likely that we read data from
other files than make a dictionary and convert it into dataframe.

The most commonly used syntax is:

df = pd.read_csv(‘Path/filename.file_extension’, delimiter, index_col, names)

In the given syntax, ‘pd.read_csv()’ is the most commonly used function not only for ‘csv
extension’ files but also while reading text files as well as files while reading data from
web sources. We can use ‘index_col’ argument to suggest which column to use as row
label. ‘names’ is used to manually give column labels.

Delimiter suggests where to partition data for each column in a given row. For Example: We
have ‘5|6’, if we give argument ‘delimiter = |’. While making a dataframe, ‘Pandas’
understands that, 5 and 6 should be separated into 2 different columns.

I created 3 small files ‘Read_file.csv’, ‘Read_file.xlsx’, ‘Read_file.txt’ for which I will provide
access to, try to upload files while following the notes to get some practice. Remember!
Practice is the key.

Let us see some examples.


In the first cell, we look at a very basic syntax of reading a file by giving ‘path+file name’
in quotes (‘ ’). To re-iterate, ‘read_csv()’ is used to load data from multiple file types (csv,
txt..).

In the second example, we read data from text file using the same syntax and added two
new arguments. You can go and observe the text file provided. I used the symbol ‘ | ’ to
separate data in the text file. So, pandas only understands ‘ , ’ as delimiter by default but for
every other delimiter, we have to provide the argument. In place of delimiter, we can also
use an alias called sep = ‘|’.

We added another argument ‘index_col’. We can very well understand from the name that
column number should be given to make it a row label/index of a dataframe. Try making
column ‘Name’ as index for the dataframe. Lastly, every empty cell in the loaded data file
will be filled with ‘NaN’ (not a number) while making a dataframe.

Creating dataframes from excel files:


While reading an excel file, ‘read_csv()’ is replaced by ‘read_excel()’. The column names in
the first output are not informative. So, we changed the column names/labels using the
argument ‘names’. Sometimes, we may not know what the most ideal index column is until
the data is explored a bit. Function ‘set_index()’ is used to apply index to the dataframe at
a later time. ‘inplace = True’ makes the change to the dataframe permanent. Check what
happens if ‘inplace = False’?

Similarly, we can read data from webpage using syntax: df = read_csv(‘link address’).

In summary, we have seen how to create a pandas ‘series’ and ‘dataframes’ from different
sources. We now have the data available in the form understandable to pandas. Now let’s
get started with different concepts involved in data pre-processing.

Next! Second tutorial: Inspecting Pandas dataframes (Part 2)


Pandas for beginners: Part -2
Inspecting Pandas dataframes
This is the second in the set of 6 parts of Pandas tutorials.

1. Creating Pandas data structures


2. Inspecting dataframes
3. Indexing and Selecting data
4. Merge and Append
5. Arithmetic Operations
6. Grouping and Summarizing

Pre-processing data is one of the most time consuming task of any data science assignment.
Data comes in all different forms. It might have some missing values as well as duplicate
rows. It is desirable to take an action on such data elements while performing exploratory
data analysis (EDA) and before applying machine learning (ML) algorithms. So every time,
we fill a value or delete a row, it is advisable to re-visit the properties of the data to verify
whether the changes are reflected as desired.

In this part, we will look at 5 or 6 fundamental built-in functions that can give an overview
of any given dataframe and its properties. Since, ‘Series’ object is a subset of ‘dataframes’.
We will work only on dataframes and discuss series only when required.

Inspecting dataframes:
Inspecting a dataframe is the most repeated activity while pre-processing data. Most
important ones are: (Guess the output for each one of them)

  shape: Hint: Relates to dimensions

  info(): Hint: Talks about columns and their data types


  describe(): Hint: Descriptive statistics (Aggregate functions)
  head() or tail(): Hint: Displays the dataframe
 Columns:

As a starting point, let’s inspect one of the dataframes we created. Once we get familiar
with what each function does, we will see another more realistic data set to understand the
significance of these functions.
Inspecting dataframes: Part-1 - Shape, info(), columns:

Mostly the functions are self-explanatory. But let us discuss anyways. ‘shape’ gives the
dimensions (rows, columns) of a given dataframe. In case of series object, since it is a 1-D
object, shape will give the no of rows. Similarly, ‘columns’ function displays all the column
names/labels. ‘df_excel_new.values’ is another such function. Guess the output and then try.

‘Info()’ displays a collection of information. It gives us the column names, data types, no of
non-null values and the memory of the dataframe. One look at the output for info(), we can
answer questions like ‘How large is the dataframe?’, ‘ What are all the columns having missing
values?’ ‘Which column has more number of missing values?’

This kind of information helps us decide whether to drop a column or fill the values
while performing exploratory data analysis (EDA). For example, we have a column ‘X’
where 4 out of 5 values are ‘NaN’. In such cases, it is appropriate to drop the column
since 80% of data is missing.

Inspecting dataframes: describe()


‘describe()’ function by default picks up all the columns with datatype of numbers (int, float
..) and summarize the distribution of the each column. In our dataframe, we have only
‘Age’ column with integer datatype.

Inspecting dataframes: head() or tail():

By default, ‘head()’ displays the first 5 rows. If we specify an argument like ‘head(2)’ or
‘head(20)’, it displays the requested number of rows from first. ‘tail()’ does the same from
the last row. ‘head()’ is typically the first function that is called after a dataframe is created.
It gives us an idea of how our dataframe looks like.

Now let us pick a dataset where we understand the significance of these functions.
This is a global superstore data which is generally used for learning visualization. We
can download data
from http://www.tableau.com/sites/default/files/training/global_superstore.zip. Here, I want to
see how big the data is. A simple ‘df_superstore.shape’ is showing it has more than 50
thousand rows. We see that there is no point in displaying the entire data set.

One look at the columns and we know this data is the records of all orders made by
customers across different outlets of same store in many countries.

Try ‘head()’ and ‘info()’ and learn.

 What column has highest number of missing values and whether it can be dropped?

 Guess what can be made as an index in place of ‘row id’? Remember! It’s
good practice to have unique index for each row.

Let us see what ‘describe()’ function tells us.


‘describe()’ simply picked up all the integer and float datatype columns and summarized
the statistics. Now what can we infer?

  Statistics for Row ID and Postal code are useless. (Common sense!)
  In most cases, profit is USD 9. (check 50th percentile in profits)
 Not much discounts are running in the stores.

Note: 50th percentile is the median of the data.

Likewise, try and infer insights of your own.

In summary, inspecting dataframe is an important part and the most repeated activity while
performing pre-processing. A set of simple functions presents a better idea of the data at
hand than data itself.

Next! Third tutorial: Indexing and Selecting Pandas dataframes (Part 3)


Pandas for beginners: Part -3 Indexing and
Selecting data from data frame

This is the third in the set of 6 parts of Pandas tutorials.

1. Creating Pandas data structures


2. Inspecting dataframes
3. Indexing and Selecting data
4. Merge and Append
5. Arithmetic Operations
6. Grouping and Summarizing

In this part, we will learn

 Regular python way of indexing a dataframe


 Position based indexing (iloc)
‘Pandas’ recommends this way of
 Label based indexing (loc)

Indexing and Selecting data:

Selecting rows: We look at 2 ways of selecting rows from a dataframe. It is similar to NumPy
arrays or other data structures in python. While using pythonic way of indexing, the basic
syntax is df[start_index: end_index].
In the first cell, we selected rows using default indices. For example, when we request for
‘2:5’ rows, python interprets that as a request to pull out rows from 2 to (5 -1= 4)th row.

In the second cell, we selected rows using row labels. The basic syntax is df[start_label:
end_label]. While using row labels, pandas interpret that as a request to retrieve data
including both start_label and stop_label. For instance, in the above example: since the
given range is 'Applicant 3': 'Applicant 5', it selected rows with row labels: 'Applicant 3',
‘Applicant 4’ and 'Applicant 5'.

Selecting columns: Selecting columns is simply by calling the column name. There are two
ways of selecting single column from a dataframe. Syntax is df*‘column_name’+ or
df.column_name.

While retrieving single column, it should be noted that pandas makes it a series object since
it has only one column. It is a matter of choice which syntax to use while selecting single
column. Throughout the tutorial, I will use the syntax df*‘ column name ‘+ to avoid
ambiguity.

To select 2 or more columns, column names should be given as list of lists, **‘ ‘ , ’ ‘ ++.
Syntax is df**‘column name 1 ’ , ’column name 2 ‘++. Check the type of the object retrieved
with multiple columns.

Try these:
Position and label based indexing: (using df.loc and df.iloc)

We have seen some ways of selecting rows and columns from dataframes. Let's now see
some other ways of indexing dataframes that pandas recommends, since they are more
explicit (and less ambiguous).There are two main ways of indexing here:

1. Position based indexing using df.iloc: ‘iloc’ extends to integer location based
indexing
2. Label based indexing using df.loc: ‘loc’ is simply label based indexing.

Using both the methods, we will do the following indexing operations on a dataframe:

 Selecting single elements/cells


  Selecting single and multiple rows
 Selecting single and multiple columns
 Selecting multiple rows and columns

The syntaxes are similar to what we discussed earlier, we simply add ‘iloc’ and ‘loc’
functions before indexing. These functions make it easier for pandas to understand the style
of indexing.

Position based indexing using ‘iloc’: Even when we have labels to our rows and columns,
inherently, objects are saved using the traditional style of indexing that python uses. By
calling the function ‘iloc’, we are explicitly asking pandas to access the dataframe using the
default indexing style of python/pandas from ‘0-n-1’.

Position based indexing ‘iloc’: Selecting single element/cell

In this example we are simply selecting the second row and third element by
accessing ‘*1,2+’. Guess which element do we access using iloc.[2,4].

Position based indexing ‘iloc’: Selecting single and multiple rows


The general syntax for selecting a row or a set of rows using ‘iloc’ is ‘df.iloc*i:j , :+’ or
simply ‘df.iloc*i:j+’. ‘i : j’ refer to range of rows and ‘ : ’ without range refers to all columns. This
indexing style is exactly same as we discussed in NumPy arrays. The only difference is the
use of ‘iloc’.

Now let us look at the examples, first cell refers to selecting 5th row (single row) of a
dataframe. Second cell refers to selecting 1st and 2nd rows.

Remember! This syntax selects only till j-1 rows.

Position based indexing ‘iloc’: Selecting single and multiple columns


Selecting columns is similar to selecting rows. Only the ranges are reversed in the syntax.
The general syntax for selecting a row or a set of rows using ‘iloc’ is ‘df.iloc*: , i : j+’. ‘i : j’
refer to range of columns and ‘ : ’ without range refers to all rows. Try to access Name and Visa-
type columns only.

Position based indexing ‘iloc’: Selecting multiple rows and columns

Here, we are simply selecting the subset of a dataframe. The syntax is ‘df.iloc*i:j,x:y+’. In
the example, we are selecting (1st and 2nd) rows and (3rd and 4th) columns. Try selecting (3rd
and 5th ) rows and (1st and 5th) columns.

Label based indexing, ‘loc’: It is a more user friendly syntax than iloc that we discussed just
now. It is simply because we can look at the label in a dataframe but not the integer index.
Whenever we want a column or a row, it is easy to access it simply by calling it with a label.

The general syntax is df.loc*‘row label range’ , ‘column label range’ +

Label based indexing, ‘loc’: Selecting single element/cell

Once we know both column label and row label, the syntax is actually straight forward. In
the example, we want to know ‘what the ‘visa-type’ of ‘applicant 3’ is’? The general syntax is
df.loc*‘row label’ , ‘column label’+. Try working on query in second cell.
Label based indexing, ‘loc’: Selecting single and multiple rows

Selecting single row using ‘loc’ is similar to using ‘iloc’. Simply, ‘Index’ in iloc is replaced by
‘label’ in loc.
There is a small difference while selecting multiple rows. While using iloc[i : j], rows are
selected till j-1 rows but loc*‘row label i’ : ‘row label j’+ works slightly different. Here,
rows are selected including jth row. We can also select by giving a list of lists.

It is similar in all other cases while:


 Selecting single and multiple columns: ‘df.loc[: , column label x : column label y ]’.

 Selecting rows and columns: ‘df.loc[‘row label i’ : ‘row label j’ , column label x : column
label y ]’
Try these:

Logic based indexing:


This is an extremely powerful syntax that we use while doing exploratory data analysis
(EDA). Here, it simply selects all rows where the specified logic is ‘true’. The general
syntax is: df[df*‘column/row name’+ <condition> <value>]

Let us see one more example, which can give more data science perspective. In the
‘superstore data’ we saw earlier, let us look at all the orders that gave profit more than 5000
usd.

When we work on this query, we understand that,


 There are only 3 orders where profit is higher than 5000 usd.

 All the products are in the ‘sub category’ of ‘copiers’. This should give us an idea
that, these orders are all bulk orders from companies where they need stationary.

 One glance at ‘Sales’ and ‘profit’ columns tell us that, these bulk orders are giving 40
-50% profit in all three cases.
If I have to give some concluding remarks on this analysis, I would say:
 It’s better for sales people to target this customer segment since less volume
 of customers provides more profits.
 ‘copiers’ sub-category have more bulk orders.

I want to repeat that these are the kind of queries you repeatedly work on when you have a
data science assignment.

In summary, we have seen different types of selecting subset dataframes. Now, we go in


other direction, we will try to merge and append data in the next section.

Next! Forth tutorial: Merge and Append (Part 4)


Pandas for beginners: Part -4
Merge and Concat dataframes
This is the forth in the set of 6 parts of Pandas tutorials.

1. Creating Pandas data structures


2. Inspecting dataframes
3. Indexing and Selecting data

5. Arithmetic Operations
6. Grouping and Summarizing

Any business problem needs data from multiple departments to analyse in order to get
some perspective. Each department uses a different database according to their
functionality. Every time we do some analysis, it is inevitable that we combine
dataframes from different databases.

For example: ‘X’ engineering college wants to understand the relation between a student’s
‘grades’ and their ‘salary after 5 years of passing out’. Student grades are available in the
respective department databases while their current salaries are available in the alumni
database. In order to do this analysis, we have to combine both the databases.

In this tutorial, we will learn how to ‘merge’ and ‘concatenate’ multiple dataframes.

Merging dataframes: ‘Merge’


The basic syntax is: df_merged = pd.merge(df_1, df_2, how='inner/outer', on='Column
name')

Merge: ‘inner’

Let us look at an example:


In the example with visa-applicants data (df_excel_new), we want to add ‘Period of stay’ for
each applicant. This data is available in a different dataframe, ‘df’ . We merged both the
dataframes using the argument ‘how = inner’. This works same as ‘inner join’ in MySql. The
word ‘inner’ makes sure that only common data in both the dataframes is combined in the
resultant dataframe.

The other argument on = ‘Applicant No’ is the common column on which the dataframe
has to merge.

Merge: ‘Outer’

With argument, how = ‘outer’, we combine dataframes including all rows from both
dataframes. Please observe that, if an ‘Applicant no’ is not present in one of the dataframes, the
values are populated as ‘NaN’. Check ‘Applicant 6’ to understand the result.
Similarly, we can perform ‘how = left’ and ‘how = right’ merges. Try and interpret the
results.

Concatenating dataframes: ‘concat’


The basic syntax is: pd.concat([df1, df2], axis = 0/1)

‘pd.concat’ takes list of dataframes as first argument and axis to define row or column
wise operation as the other argument. We can take more than 2 dataframes to combine.

Note: Repeating row labels is not a good practice.

Guess and try the following:

Hint 1: ‘Axis = 1’ performs column wise operation.


Hint 2: We can concatenate subset dataframes.

Try to find the resemblance of query 1 with ‘merge’.

In summary, we learnt how to ‘merge’ and ‘concat’ two dataframes. ‘concat’ can be used in
place of ‘merge’ but the reverse is not true. There is another syntax ‘df1.appned(df2)’ which
we did not discuss in this tutorial which works similar to ‘concat’.

Next we shall discuss ‘Arithmetic Operations’ on dataframes. This is similar to the


arithmetic operations on NumPy arrays.

Next! Fifth tutorial: Arithmetic Operations (Part 5)


Pandas for beginners: Part -5
Arithmetic Operations
This is the fifth in the set of 6 parts of Pandas tutorials.

1. Creating Pandas data structures


2. Inspecting dataframes
3. Indexing and Selecting data
4. Merge and Concat
5. Arithmetic Operations
6. Grouping and Summarizing

While doing exploratory data analysis (EDA), we often work with derived metrics. For
example: If we have profit and price, we want to see the profit percentage. If we have cricket
matches won and lost, we might want to look at total matches played. These kinds of
metrics require arithmetic operations to be performed on columns.

Here, I created two simple datasets with electronic product sales; let us see how we can
derive metrics from these dataframes.

Datasets are:

The data is the total sale of 5 important electronic categories in an outlet for first 2 weeks of
a month.
Note: Observe that we have set 2 columns as label.

‘add()’ operator:

Suppose we want to calculate the total sales in 2 weeks. We can simply add the 2
dataframes using ‘df1.add(df2, fill_value = 0)’.

This operation works same as adding 2 NumPy arrays. Argument ‘fill_value’ is given to
handle ‘NaN’ values. If there are any such values, we are asking to replace them with 0. In
our case, there are no such values.

div() operator:

Suppose, we want to calculate profit per quantity?

Here, there are 2 new syntaxes to learn. First, we did column wise division. We picked up
individual columns from a dataframe and made a division on them. Second, we added a
new column (‘Profit per quantity’) to the existing dataframe while performing the operation.

We can also do such kind of operations on columns from 2 different dataframes. Suppose,
we want to see what is the percentage of profit per week in total profit:
In this example, we used data from 3 different dataframes to compute profit percentages.
Try to add this data as additional columns in the ‘df_sales’ without creating a new
dataframe.

Likewise, try to calculate the following:

1. Difference in sale from week 1 to 2 in each category. Use ‘Total Sale in Lakhs’
column.
2. What is the cost price per quantity? Use ‘Total Sale in Lakhs’, ‘Profit in lakhs’
and ‘Quantity’ columns.
While working on machine learning algorithms, these kind of measurable and meaningful
properties are derived as ‘features’ to make algorithms work efficiently.

Apart from these, there are also other operator-equivalent mathematical functions that you
can use on Dataframes. Below is a list of all the functions that you can use to perform
operations on two or more dataframes.

 sub(): -
  mul(): *
 floordiv(): //
  mod(): %
 pow(): **

In summary, we have seen how arithmetic operations are performed on dataframes and
their columns. Next, Let us go a bit deeper and see how we do categorical analysis using
group by and summarizations using simple built in functions.

Next! Sixth tutorial: Grouping and Summarizing (Part 6)


Pandas for beginners: Part -6
Grouping and Summarizing
This is the sixth in the set of 6 parts of Pandas tutorials.

1. Creating Pandas data structures


2. Inspecting dataframes
3. Indexing and Selecting data
4. Merge and Concat
5. Arithmetic Operations
6. Grouping and Summarizing

Grouping and summarizing are some of the most frequently used operations in data
analysis, especially while doing exploratory data analysis (EDA), where comparing
summary statistics across groups of data is common. Grouping together with
summarization is used for answering categorical questions.

For e.g., in the superstore sales data we are working with, you may want tolook at most
profitable shipping mode or most sold category products. This kind of information is
captured using ‘groupby’ and summarization functions like ‘avg()’, value_counts(), etc..

Grouping analysis can be thought of as having three parts:

1. Separating the data into groups (e.g. groups of customer segments, product
categories, etc.)
2. Applying a function to each group (e.g. mean or total sales of each customer
segment)
3. Transforming the results into a data structure showing the summary statistics
(Optional, only if we want to further act upon data.)

This tutorial we will work through by answering a few analytical questions. In the
superstore data, let us see what is the shipping mode with highest profit?
We answered the question diving it into parts. First, I want to know how many shipping
modes are there. ‘unique()’ helps to identify the distinct values in a particular column. Here,
we understand there are 4 types of shipping modes. Try to find the distinct categories of
products.

Step 1: let us divide the data into groups by using ‘ship mode’. The ‘groupby’ does that for
us. It divides data into categories. It creates a groupby object which cannot be viewed unless
we apply an aggregate/summery function on that.

Step 2,3: Calculate the total amount of profit in each category using sum() and make
a dataframe out of it. .

Step 4: Since the question is about the highest profit shipping mode, we sorted the values
of dataframe using sort_values() in descending order. Remember! These are the kind of
questions we constantly work with while doing EDA in data science projects.

Aggregating functions are the ones that reduce the dimension of the returned objects.
Some common aggregating functions are tabulated below:

Function Description
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
min() Compute min of group values
max() Compute max of group values

Answer the following questions on shipping mode using the aggregate functions
given above.
  Average profit by ‘ship mode’
  Get the descriptive statistics of groups. (Hint: Refer to inspecting dataframe tutorial.)
 Count the values in each group.

Try to interpret results. Interpreting results develops intuition which is a much needed skill
while doing these kinds of projects.

Grouping is a very important topic for which we only covered basics. Please refer this
material to have an extensive read on the topic.

Well! This is all the Pandas you need to kick start your journey through Python for data
sciences.

Congratulations! You are almost there.

Next Series in line is, Matplotlib Library: Data Visualization library.

You might also like