Python For Data Science Extended Ebook PDF
Python For Data Science Extended Ebook PDF
Data
Science/
Machine
Learning
- Vidya Kurada
aCubeIT
Python for Data Science/Machine
Learning
Python has become an essential part of the learning process for data science community. I
am planning to make this topic as a series of articles. I want to carefully cover all the
essential topics with sufficient examples to understand the respective concepts. Broadly
in the series, I will cover
In the present article, I would like to give some insight into how the need to develop these
libraries evolved and why Python is the preferred choice in the data science community.
Before machine learning engineer or data scientist became job titles in the industries like
retail, banking, manufacturing etc. Machine learning algorithms are mostly a
mathematical research tool. Academicians, Statisticians and Research Scholars used these
algorithms to future proof their results. When machine learning was only a laboratory
tool, statistical programming languages like R and numerical computing environments
like MATLAB served the purpose.
These languages are easy to understand but mathematically intensive. But when machine
learning algorithms started their journey in providing real time business solutions, data
scientists and software developers had no common interface to work with.
Since we know machine learning models can learn from data, so again
These contributions from data science and programming communities added with the
advantage of being open source made Python a powerful language for machine learning.
This is a much debated question across industries. As mentioned earlier, Python stands out
as a one stop solution. It could be used across different platforms such as web development,
windows applications, machine learning and as a general purpose programming language.
These are only a few and famous in the bucket. I am planning to discuss these libraries
as part of the future articles in the aforementioned series.
Extreme Scalability and speed: Python is a high level language. So being the fastest is
not what it promises. But for the data science community, it is much faster than high end
computing tools like Matlab or R. More than speed, Scalability is its strong suite.
Anyone can develop an end to end application using Python.
Extensive community support: For any open source language, community support is the
key to its success. It helps new aspirants to quickly resolve the problems they face. It also
helps develop more sophisticated libraries with ease.
Easy to learn for any non-computer science graduates: It is primarily because of the
flexibility in language. The syntaxes are more close to English semantics than any other
programming language. Apart from that, the community is becoming instrumental in
creating extensive course materials which are accessible and easy to understand.
With this I conclude the article. There is only so much one can take at a time particularly
when we are new to it. So hang on there, I will come with detailed notes on Python
libraries for data sciences/machine learning in the upcoming articles.
I provided link to install anaconda distribution of python in next section. Install it if you
do not have it already in place.
Installing Anaconda:
Anaconda distribution is the most trusted one for data science. It’s an open
source distribution.
Anaconda installation is given in detail on its web page. Just make sure, while installing you
install Python 3.x .Typically, any latest version available to you but of Python 3 .
From this article, we start learning python for data sciences. We are going to start with
NumPy library, short for Numerical Python library. NumPy is a library used for performing
arithmetic operations as well as linear algebraic operations and other mathematical
operations on arrays. Since the mathematics involved in computing machine learning
algorithms is inherently linear algebra. Almost all the machine learning packages like Scipy
(Scientific Python), Scikit-learn and the data pre-processing library, Pandas are all built on
top of Numpy.
In this set of articles, keeping in mind the beginners in data science, we will cover
the following. I plan to divide this NumPy for beginners into 3 parts.
Advantages of NumPy array over a list is its speed and the compact nature of the code. At
the end, in part 3 of NumPy series, we will do a small speed test and see how fast an
arithmetic operation is performed on NumPy over lists or for that matter, any other data
structure. I preferred to keep it in the last, so that the reader can appreciate it.
In this course, we will be using the Jupyter notebook as our editor. So let’s start!
Installing NumPy:
I assume you have python installed on your laptops already. If not, check our previous
article to see how we install Anaconda. If you have Anaconda, you can simply install
NumPy from your terminal or command prompt using:
If you do not have Anaconda on your computer, install NumPy from your terminal using:
Once you have NumPy installed, launch your Jupyter notebook and get started.
Don’t worry; we will dive easy and simple. The articles are meticulously articulated for
conceptual understanding together with hands-on.
NumPy for Beginners: PART - 1
How to create and inspect NumPy Array
This is the first in set of 3 parts of NumPy tutorials.
Here we start very basic. We start learning from what is a NumPy array. By the end of this
part, we get an idea about:
Before learning how to create a NumPy array, we will see how an array looks like? The most
common arrays are one dimensional (1-D) or two dimensional (2-D). One dimensional (1-D)
arrays are nothing but vectors. Two or more dimensional arrays in the context of linear
algebra are called matrices.
A typical array looks like this. In NumPy terminology, for 2-D arrays:
import numpy as np
np is just an alias. One can use any other alias but np is quite standard alias.
There are several built in functions available to create NumPy arrays to make things easy for
us. Let us have a look at them one by one.
In the above example, we simply converted python list or tuple to a NumPy array object
and it is a one dimensional (1-D) array. Similarly, to create a 2-D array from list or tuple, we
should have a list of lists or tuple of tuples or tuple of lists.
Only then NumPy understands that we need a 2-D array. We can see that here. Can we give
list of tuples as an input? Please do check that.
Or the syntax might simply be like this for (1-D) and (2-D) arrays.
Creating NumPy arrays using built-in functions: Ex: np.ones() and np.zeros()
There are many built in functions available for creating NumPy arrays. The most common
built-in functions used to initialize are listed here. These functions can only be used when
you know the size of the array.
Try the above syntaxes and see what each syntax results in. In the initialization of ’a’, the
tuple (5,3) as a parameter suggests that we want to generate a matrix of all ones with 5
rows and 3 columns.
By default, an array will have a float data type and while initializing ’b’, we explicitly
mentioned the data type ‘int’ as parameter. np.zeros() used to initialize ’c’ is also a similar
function which gives matrix of all zeros.
np.arange() is similar to python built-in range() function. Let us create a NumPy array using
arange(). arange(start,stop,step) typically takes 3 parameters similar to range(). Third
parameter step is optional, if we do not mention any step size, NumPy takes step = 1 as
step size. But in the present query we specified step = 5. So arange() generated numbers
from 10 to 100 with a difference of 5.
Along with arange(), an additional function reshape() is used here. As the name suggests,
reshape() helps in changing the dimensions of the existing array. reshape() is an array
manipulation function.
Note 1: Observe carefully, it created an array with values between 10 and 95 as python takes values
from 0 to n-1 elements in general.
Note 2: The reshape() can only convert in dimensions that when multiplied result in the total no of
elements. In the example above, numbers contained 18 elements hence can be reshaped to a 3X6
matrix whose product is 18. Can we reshape the ‘numbers’ in any other dimensions?
Creating NumPy arrays using built-in functions: np.linspace()
The linspace() function returns numbers evenly spaced over a specified intervals. Say
we want 15 evenly spaced points from 1 to 3, we can easily use:
This gives us a one dimensional vector. Unlike the arange() function which takes the third
argument as the number of steps, linspace() takes the third argument as the number of
data points to be created.
In linear algebra, identity matrices and its properties are widely used. It’s a square matrix
with diagonal elements equal to one and rest all are zeros. Identity matrix usually takes a
single argument. Here’s how we create one.
Identity matrix is religiously used in linear algebra. Whenever you try to matrix multiply
two dimensionally incompatible arrays, a simple thing to do would be first to transform
matrices using identity matrix (np.eye()).
Apart from the methods mentioned above, there are a few more NumPy functions that
you can use to create special NumPy arrays:
Typically, any real time problem that we need to apply ML algorithms will have thousands
to lakhs of rows and hundreds of columns. So, it’s helpful to inspect the structure of arrays.
We cannot make any sense of the data merely by printing the data and it’s time consuming
too. There are few built-in functions to quickly inspect the arrays.
Let's say you are working with a moderately large array of size 1000 x 300.
Here, we cannot make sense of data merely by displaying a 1000 x 300 random numbers.
Using simple functions like
shape: Gives an idea of how many (rows, columns) are there in a given array.
dtype: To get the data type of the array. (Remember, we discussed that an array will
have same data type for all its elements.)
ndim: To get the dimensionality of the array.
itemsize: To get the size of the array in ‘kB’.
These functions are the basis for inspection functions we use in Pandas library. We soon
get into Pandas library.
While pre-processing data in data science projects, it becomes part of the process to inspect
data every time we make data transformations. Let me elaborate it. We generally get
unclean data which means some column values are missing or have duplicate values. Every
time we delete a duplicate row or fill an empty column, we inspect the array object as part of
pre-processing.
Summary: In this part, we learnt 5 to 6 ways of creating an array object and how to inspect
them. This extensive learning of array creation will help you make your work simpler and
faster while working.
Next, Part-2 of the series talks about ‚Array Indexing‛, ‚Array Manipulations and
Operations‛.
NumPy for Beginners: PART - 2
Array Indexing and Operations
This is the second in set of 3 parts of NumPy tutorials.
Array Indexing/Slicing:
Array slicing is similar to other data structures in Python. We simply pass the index we
want and get an element or group of elements out. Similar to regular python, elements of an
array are indexed as (0, n-1).
One dimensional (1-D) array slicing is same as in python lists. So try these syntaxes to get
some practice. ‘ : ’ is used to get a range of values just like in lists. Example: Suppose we
have ‘2:5’, NumPy interprets this as a request to pull out elements from 2nd to (5 -1 = 4)th
elements.
Observe the notion [2,:] used in the second cell in the picture. It’s asking to print every
element from and beyond third element. Try guessing the results for the remaining cells.
Hints are already given in the cells itself
Multidimensional arrays are indexed using as many indices as the number of dimensions or
axes. For instance, to index a 2-D array, you need two indices - array[x, y]. In [x, y], x is for
rows and y is for columns. Each axis has an index starting at 0. The following figure shows the
axes and their indices for a 2-D array.
Let us create a 2-D array and do some slicing. First cell in the picture creates a 2-D array.
Now say, we want a specific element or part of the given array. It’s done using closed
brackets and a comma, ‘[ , ]’ . Comma (’ , ’) separates row slicing from column slicing.
Now start guessing the results for the given syntaxes and compare them with the results. ‘ :
’ without mentioning the range is used to retrieve all the elements of a particular row or
column.
In NumPy arrays, we can perform almost all mathematical and logical operations one can
perform on data structures like lists and tuples in python.
On top of that, we can extensively perform linear algebra and trigonometry calculations on
array objects. In fact, the purpose of NumPy is to provide scientific computing ability to
python.
Manipulating arrays
Mathematical and Logical operations on arrays
Manipulating arrays:
In this example, an array of [0, 11] is taken and reshaped in 3 different ways. You can find
last one really interesting. If we know only the number of rows we want, we can simply
give reshape(4,-1), NumPy understands we want 4 rows and it automatically calculates the
columns using the product rule we discussed earlier. Try the syntax in third cell replacing
rows with columns.
Think what will be the easier way to reshape columns to rows and vice –versa?
Connect dots with matrices.
Stacking arrays: Stacking is done using the np.hstack() and np.vstack() methods. For
horizontal stacking, the number of rows should be the same, while for vertical stacking;
the number of columns should be the same.
Try these. While vstack() places array_2 below array_1, hstack() places the array in the
second argument below the other. Note: Arrays should be passed as list or tuple of arrays.
Logical operations on arrays: We can also perform conditional and logical selections on
arrays using &(AND), | (OR), <, > and == operators to compare the values in the array with
the given value.
In the first cell, we have taken values ranging from 5 to 14 in array_logical. When we asked
whether array_logical > 10 , result is a boolean array where it compared each element is
greater than or equal to 10. Try the syntaxes in the second cell. But before that, guess what
will be the result to check your understanding.
Linear Algebraic Operations: NumPy provides the np.linalg package to apply common
linear algebra operations, such as:
Let me give you an example, Linear Regression is one of the most talked examples to explain
machine learning. It says, given the data of [X,Y] , I can compute relationship between X and
Y as ‘A’ of the below equation:
While ML algorithms use statistical techniques to give more precise results, We can simply
compute ‘A’ by
Functions like linalg.inv() comes handy here. Not only this, np.linalg.eig() used to compute
eigenvalues and vectors is repeatedly used by algorithms to perform principal component
analysis (PCA) or in support vector machines (SVM).
Apply User Defined Functions: We can also apply our own functions on arrays. For e.g.
applying the function x/(x+1) to each element of an array. One way to do that is by looping
through the array, which is the non-numpy way.
We would rather prefer to write vectorised code. The simplest way to do that is to
vectorise the function you want, and then apply it on the array. Numpy provides the
np.vectorize() method to vectorise functions.
These kind of functions come handy when we are doing any new calculation on the array.
Universal Functions:
Like any other programing language, NumPy has access to universal functions.
Among these, functions like sum(), std(), count() are repeatedly used while pre-processing
data in data science projects.
In the next part (Part-3): We have an interesting topic specific to NumPy. It’s broadcasting.
For people who work more on Arrays/Matrices, broadcasting is a gift. As an end touch to it,
we will do a speed test for lists Vs NumPy arrays.
NumPy for Beginners: PART – 3
Broadcasting and Lists Vs Arrays Speed
test
This is the third and final in set of 3 parts of NumPy tutorials.
Broadcasting:
In general, arrays of different dimensions cannot be added or subtracted. NumPy has a
smart way to overcome this problem by duplicating the smaller dimension array to be
the size of a higher dimension array and then performs the operation.
To simplify, Broadcasting is the name given to the method that numpy uses to allow array
arithmetic between arrays with a different shape or size. To quote the exact definition as in
scipy.org
‚The term broadcasting describes how numpy treats arrays with different shapes during arithmetic
operations. Subject to certain constraints, the smaller array is ‚broadcast‛ across the larger array
so that they have compatible shapes.‛
We can take 3 types of examples to efficiently convey the concept of broadcasting. Let us
see examples first and try to interpret them.
In the first cell, we added a scalar value (means single value) to a vector (means 1-D array).
Like we discussed earlier, value of scalar ‘b’ is duplicated so that both array dimensions
are equal and then addition is performed. Observe the output to see each value is
incremented by a value ‘b =2’.
Similarly in the second cell, we added a scalar value to a 2-D array. Until now, the smaller
array is a single value hence broadcasting works without any limitations. But the actual
purpose of broadcasting is to add 2 arrays of different dimensions > 1. Let us see how that
works and what the limitations are.
The first example here is a row-wise broadcasting. This means each row in ‘a_2d’ is
added with the 1-D array ‘b_1d.’ This happened because ‘b_1d’ is a unit row vector.
Similarly, second example is a column wise broadcasting.
Broadcasting is an interesting concept but it comes with its own limitations. Broadcasting
expects at least any one dimension (row or column) to be equal in both the arrays. Let us
check by giving different dimensions.
If we observe carefully, neither column nor row dimensions are equal and hence the ‘value
error’.
Broadcasting is definitely a handy shortcut but imposes some rules to allow the operations.
The examples we covered here are very basic and are intended for beginners.
You can go ahead and further explore these articles to know more:
Broadcasting Scipy.org
Array broadcasting in NumPy
Speed
test: Lists Vs Arrays:
We will keep it extremely simple.
I simply took 10 lakh numbers of different sequences, ‘i’ is simple numbers and ‘j’ are
squares of ‘i’. I divided ‘j’ by ‘i’ element wise (j/i).
I simply recorded time taken to do the same operation on lists and arrays and computed a
ratio of it. It says NumPy performs 22 times faster than lists.
‘Pandas’ is the most popular python library for data sciences. We can think of it as a well-
equipped querying language used for exploratory data analysis (EDA). Pandas library is
built on top of NumPy. The idea here is to make all the arithmetic and complex
mathematical operations one can perform on arrays in NumPy accessible to a querying
language.
In this course, we will be using Jupyter Notebook as our editor. I particularly suggest using
Jupyter Notebook from here on because when you start data science projects, it gives an
easy access to the results you want to see while doing the analysis.
So let’s start!
Installing Pandas:
I assume you have python installed on your laptops already. If not, check the previous
article, I shared a link to install Anaconda. If you have Anaconda, you can simply ‘install
Pandas’ from your terminal or command prompt using:
If you do not have Anaconda on your computer, ‘install Pandas’ from your terminal using:
Once you have Pandas installed, launch your Jupyter notebook and get started.
Don’t worry; we will dive easy and simple. The articles are meticulously articulated for
conceptual understanding together with hands-on. In between, I will try to correlate how we
use these concepts while doing data pre-processing in data science projects.
Series: Series is a one dimensional (1-D) NumPy array (vector). The main difference is,
series can be indexed and arrays can’t be. Index is nothing but an access label. Simply put,
we can name rows as per our interest.
Dataframe: Dataframe is a two dimensional (2-D) array where each column can have a
different datatype. We saw in NumPy that an array can only take single datatype for all
its columns or elements. Similar to series, it takes access labels. We can label both rows
and columns here. Typically, one can imagine it as a table.
Now that we know what they are, let us open our Jupyter Notebooks and start by importing
NumPy and Pandas packages.
import pandas as pd
import numpy as np
‘pd’ and ‘np’ are standard aliases used. You can use any other aliases or just import without
alias.
We will briefly look at Series object first and then get into dataframes. Eventually, you will
get to work largely on dataframe objects since manipulating dataframes efficiently and
quickly is the most important skill set if you choose python for your data science projects.
Creating Series:
The basic syntax for creating a series object is: ‚series_1 = pd.Series(data, index)‛
While calling series function, ‘S’ is always uppercase. ‘Index’ is optional and generates
default index when unspecified. A series can be created taking input data as a:
We can observe that, while creating our first series object, we passed our data as a ‘list’. So, it
takes another data structure/array object as an argument for ‘data’. Try giving a Tuple. Since
we did not specify any index, it generated default index. In this example, we are inspecting
data type of the variable, ‘s’ and it’s an object of class series while each element is an ‘int’.
There is a function ‘date_range()’ to create a series of dates. We simply gave start and end
dates as arguments and the function created a sequence of dates. Our input argument has
a date in the form of ‘MM-DD-YYYY’ but the output is in the standard form of ‘YYYY-
MM-DD’. Replace end = ’11-16-2017’ with periods = 3, freq = ‘M’ and try to understand the
outcome.
In the next part, we can see, I used the same dates as index for another series. We passed
one series as an index for another series. Try this! Create two different series and pass one as
an index to other.
Note: This example is a simple way to show how custom indexing can help in data analysis.
Suppose you have student’s attendance sheet with no dates. Simply adding date series as an index
will help understand the behaviour of any student in attending the class.
While creating series using dictionaries, pandas understands ‘key’ from dictionary as
a ‘label’ in index. Try giving some external index argument and see what happens?
Creating DataFrames:
The syntax for creating a dataframe object is: ‚df = pd.DataFrame(data, index, columns,
dtype)‛
While creating a ‘DataFrame’, ‘D’ and ‘F’ are uppercase. We can give one or more
arguments, with basic argument being ‘data’. Here, ‘index’ is same as in series while
‘columns’ take list of values for ‘column labels’. ‘columns’ also take default values of ‘0-n-1’
columns when unspecified.
Since, Pandas is largely used as a data pre-processing tool, it provides the ability to read
data from several file types to dataframes directly.
We see that, DataFrame is nothing but a combination of series objects but each column can
take a different data type. What can be the data type of each column in this dataframe?
It looks just like a table with columns and rows. It is important to understand that
commonly each row is a different observation and each column is an attribute.
Try creating dataframes from tuples and series. Hint: Just similar to dataframes from lists.
Now that we saw how to create a simple dataframe from other data structures, we will go
ahead and see how we read data from other sources such as excel, text or web pages
directly into dataframes.
Typically, at any given time while working on a data science project, we work with data
having lakhs of rows and hundreds of columns. So it is highly likely that we read data from
other files than make a dictionary and convert it into dataframe.
In the given syntax, ‘pd.read_csv()’ is the most commonly used function not only for ‘csv
extension’ files but also while reading text files as well as files while reading data from
web sources. We can use ‘index_col’ argument to suggest which column to use as row
label. ‘names’ is used to manually give column labels.
Delimiter suggests where to partition data for each column in a given row. For Example: We
have ‘5|6’, if we give argument ‘delimiter = |’. While making a dataframe, ‘Pandas’
understands that, 5 and 6 should be separated into 2 different columns.
I created 3 small files ‘Read_file.csv’, ‘Read_file.xlsx’, ‘Read_file.txt’ for which I will provide
access to, try to upload files while following the notes to get some practice. Remember!
Practice is the key.
In the second example, we read data from text file using the same syntax and added two
new arguments. You can go and observe the text file provided. I used the symbol ‘ | ’ to
separate data in the text file. So, pandas only understands ‘ , ’ as delimiter by default but for
every other delimiter, we have to provide the argument. In place of delimiter, we can also
use an alias called sep = ‘|’.
We added another argument ‘index_col’. We can very well understand from the name that
column number should be given to make it a row label/index of a dataframe. Try making
column ‘Name’ as index for the dataframe. Lastly, every empty cell in the loaded data file
will be filled with ‘NaN’ (not a number) while making a dataframe.
Similarly, we can read data from webpage using syntax: df = read_csv(‘link address’).
In summary, we have seen how to create a pandas ‘series’ and ‘dataframes’ from different
sources. We now have the data available in the form understandable to pandas. Now let’s
get started with different concepts involved in data pre-processing.
Pre-processing data is one of the most time consuming task of any data science assignment.
Data comes in all different forms. It might have some missing values as well as duplicate
rows. It is desirable to take an action on such data elements while performing exploratory
data analysis (EDA) and before applying machine learning (ML) algorithms. So every time,
we fill a value or delete a row, it is advisable to re-visit the properties of the data to verify
whether the changes are reflected as desired.
In this part, we will look at 5 or 6 fundamental built-in functions that can give an overview
of any given dataframe and its properties. Since, ‘Series’ object is a subset of ‘dataframes’.
We will work only on dataframes and discuss series only when required.
Inspecting dataframes:
Inspecting a dataframe is the most repeated activity while pre-processing data. Most
important ones are: (Guess the output for each one of them)
As a starting point, let’s inspect one of the dataframes we created. Once we get familiar
with what each function does, we will see another more realistic data set to understand the
significance of these functions.
Inspecting dataframes: Part-1 - Shape, info(), columns:
Mostly the functions are self-explanatory. But let us discuss anyways. ‘shape’ gives the
dimensions (rows, columns) of a given dataframe. In case of series object, since it is a 1-D
object, shape will give the no of rows. Similarly, ‘columns’ function displays all the column
names/labels. ‘df_excel_new.values’ is another such function. Guess the output and then try.
‘Info()’ displays a collection of information. It gives us the column names, data types, no of
non-null values and the memory of the dataframe. One look at the output for info(), we can
answer questions like ‘How large is the dataframe?’, ‘ What are all the columns having missing
values?’ ‘Which column has more number of missing values?’
This kind of information helps us decide whether to drop a column or fill the values
while performing exploratory data analysis (EDA). For example, we have a column ‘X’
where 4 out of 5 values are ‘NaN’. In such cases, it is appropriate to drop the column
since 80% of data is missing.
By default, ‘head()’ displays the first 5 rows. If we specify an argument like ‘head(2)’ or
‘head(20)’, it displays the requested number of rows from first. ‘tail()’ does the same from
the last row. ‘head()’ is typically the first function that is called after a dataframe is created.
It gives us an idea of how our dataframe looks like.
Now let us pick a dataset where we understand the significance of these functions.
This is a global superstore data which is generally used for learning visualization. We
can download data
from http://www.tableau.com/sites/default/files/training/global_superstore.zip. Here, I want to
see how big the data is. A simple ‘df_superstore.shape’ is showing it has more than 50
thousand rows. We see that there is no point in displaying the entire data set.
One look at the columns and we know this data is the records of all orders made by
customers across different outlets of same store in many countries.
What column has highest number of missing values and whether it can be dropped?
Guess what can be made as an index in place of ‘row id’? Remember! It’s
good practice to have unique index for each row.
Statistics for Row ID and Postal code are useless. (Common sense!)
In most cases, profit is USD 9. (check 50th percentile in profits)
Not much discounts are running in the stores.
In summary, inspecting dataframe is an important part and the most repeated activity while
performing pre-processing. A set of simple functions presents a better idea of the data at
hand than data itself.
Selecting rows: We look at 2 ways of selecting rows from a dataframe. It is similar to NumPy
arrays or other data structures in python. While using pythonic way of indexing, the basic
syntax is df[start_index: end_index].
In the first cell, we selected rows using default indices. For example, when we request for
‘2:5’ rows, python interprets that as a request to pull out rows from 2 to (5 -1= 4)th row.
In the second cell, we selected rows using row labels. The basic syntax is df[start_label:
end_label]. While using row labels, pandas interpret that as a request to retrieve data
including both start_label and stop_label. For instance, in the above example: since the
given range is 'Applicant 3': 'Applicant 5', it selected rows with row labels: 'Applicant 3',
‘Applicant 4’ and 'Applicant 5'.
Selecting columns: Selecting columns is simply by calling the column name. There are two
ways of selecting single column from a dataframe. Syntax is df*‘column_name’+ or
df.column_name.
While retrieving single column, it should be noted that pandas makes it a series object since
it has only one column. It is a matter of choice which syntax to use while selecting single
column. Throughout the tutorial, I will use the syntax df*‘ column name ‘+ to avoid
ambiguity.
To select 2 or more columns, column names should be given as list of lists, **‘ ‘ , ’ ‘ ++.
Syntax is df**‘column name 1 ’ , ’column name 2 ‘++. Check the type of the object retrieved
with multiple columns.
Try these:
Position and label based indexing: (using df.loc and df.iloc)
We have seen some ways of selecting rows and columns from dataframes. Let's now see
some other ways of indexing dataframes that pandas recommends, since they are more
explicit (and less ambiguous).There are two main ways of indexing here:
1. Position based indexing using df.iloc: ‘iloc’ extends to integer location based
indexing
2. Label based indexing using df.loc: ‘loc’ is simply label based indexing.
Using both the methods, we will do the following indexing operations on a dataframe:
The syntaxes are similar to what we discussed earlier, we simply add ‘iloc’ and ‘loc’
functions before indexing. These functions make it easier for pandas to understand the style
of indexing.
Position based indexing using ‘iloc’: Even when we have labels to our rows and columns,
inherently, objects are saved using the traditional style of indexing that python uses. By
calling the function ‘iloc’, we are explicitly asking pandas to access the dataframe using the
default indexing style of python/pandas from ‘0-n-1’.
In this example we are simply selecting the second row and third element by
accessing ‘*1,2+’. Guess which element do we access using iloc.[2,4].
Now let us look at the examples, first cell refers to selecting 5th row (single row) of a
dataframe. Second cell refers to selecting 1st and 2nd rows.
Here, we are simply selecting the subset of a dataframe. The syntax is ‘df.iloc*i:j,x:y+’. In
the example, we are selecting (1st and 2nd) rows and (3rd and 4th) columns. Try selecting (3rd
and 5th ) rows and (1st and 5th) columns.
Label based indexing, ‘loc’: It is a more user friendly syntax than iloc that we discussed just
now. It is simply because we can look at the label in a dataframe but not the integer index.
Whenever we want a column or a row, it is easy to access it simply by calling it with a label.
Once we know both column label and row label, the syntax is actually straight forward. In
the example, we want to know ‘what the ‘visa-type’ of ‘applicant 3’ is’? The general syntax is
df.loc*‘row label’ , ‘column label’+. Try working on query in second cell.
Label based indexing, ‘loc’: Selecting single and multiple rows
Selecting single row using ‘loc’ is similar to using ‘iloc’. Simply, ‘Index’ in iloc is replaced by
‘label’ in loc.
There is a small difference while selecting multiple rows. While using iloc[i : j], rows are
selected till j-1 rows but loc*‘row label i’ : ‘row label j’+ works slightly different. Here,
rows are selected including jth row. We can also select by giving a list of lists.
Let us see one more example, which can give more data science perspective. In the
‘superstore data’ we saw earlier, let us look at all the orders that gave profit more than 5000
usd.
I want to repeat that these are the kind of queries you repeatedly work on when you have a
data science assignment.
5. Arithmetic Operations
6. Grouping and Summarizing
Any business problem needs data from multiple departments to analyse in order to get
some perspective. Each department uses a different database according to their
functionality. Every time we do some analysis, it is inevitable that we combine
dataframes from different databases.
For example: ‘X’ engineering college wants to understand the relation between a student’s
‘grades’ and their ‘salary after 5 years of passing out’. Student grades are available in the
respective department databases while their current salaries are available in the alumni
database. In order to do this analysis, we have to combine both the databases.
In this tutorial, we will learn how to ‘merge’ and ‘concatenate’ multiple dataframes.
Merge: ‘inner’
The other argument on = ‘Applicant No’ is the common column on which the dataframe
has to merge.
Merge: ‘Outer’
With argument, how = ‘outer’, we combine dataframes including all rows from both
dataframes. Please observe that, if an ‘Applicant no’ is not present in one of the dataframes, the
values are populated as ‘NaN’. Check ‘Applicant 6’ to understand the result.
Similarly, we can perform ‘how = left’ and ‘how = right’ merges. Try and interpret the
results.
‘pd.concat’ takes list of dataframes as first argument and axis to define row or column
wise operation as the other argument. We can take more than 2 dataframes to combine.
In summary, we learnt how to ‘merge’ and ‘concat’ two dataframes. ‘concat’ can be used in
place of ‘merge’ but the reverse is not true. There is another syntax ‘df1.appned(df2)’ which
we did not discuss in this tutorial which works similar to ‘concat’.
While doing exploratory data analysis (EDA), we often work with derived metrics. For
example: If we have profit and price, we want to see the profit percentage. If we have cricket
matches won and lost, we might want to look at total matches played. These kinds of
metrics require arithmetic operations to be performed on columns.
Here, I created two simple datasets with electronic product sales; let us see how we can
derive metrics from these dataframes.
Datasets are:
The data is the total sale of 5 important electronic categories in an outlet for first 2 weeks of
a month.
Note: Observe that we have set 2 columns as label.
‘add()’ operator:
Suppose we want to calculate the total sales in 2 weeks. We can simply add the 2
dataframes using ‘df1.add(df2, fill_value = 0)’.
This operation works same as adding 2 NumPy arrays. Argument ‘fill_value’ is given to
handle ‘NaN’ values. If there are any such values, we are asking to replace them with 0. In
our case, there are no such values.
div() operator:
Here, there are 2 new syntaxes to learn. First, we did column wise division. We picked up
individual columns from a dataframe and made a division on them. Second, we added a
new column (‘Profit per quantity’) to the existing dataframe while performing the operation.
We can also do such kind of operations on columns from 2 different dataframes. Suppose,
we want to see what is the percentage of profit per week in total profit:
In this example, we used data from 3 different dataframes to compute profit percentages.
Try to add this data as additional columns in the ‘df_sales’ without creating a new
dataframe.
1. Difference in sale from week 1 to 2 in each category. Use ‘Total Sale in Lakhs’
column.
2. What is the cost price per quantity? Use ‘Total Sale in Lakhs’, ‘Profit in lakhs’
and ‘Quantity’ columns.
While working on machine learning algorithms, these kind of measurable and meaningful
properties are derived as ‘features’ to make algorithms work efficiently.
Apart from these, there are also other operator-equivalent mathematical functions that you
can use on Dataframes. Below is a list of all the functions that you can use to perform
operations on two or more dataframes.
sub(): -
mul(): *
floordiv(): //
mod(): %
pow(): **
In summary, we have seen how arithmetic operations are performed on dataframes and
their columns. Next, Let us go a bit deeper and see how we do categorical analysis using
group by and summarizations using simple built in functions.
Grouping and summarizing are some of the most frequently used operations in data
analysis, especially while doing exploratory data analysis (EDA), where comparing
summary statistics across groups of data is common. Grouping together with
summarization is used for answering categorical questions.
For e.g., in the superstore sales data we are working with, you may want tolook at most
profitable shipping mode or most sold category products. This kind of information is
captured using ‘groupby’ and summarization functions like ‘avg()’, value_counts(), etc..
1. Separating the data into groups (e.g. groups of customer segments, product
categories, etc.)
2. Applying a function to each group (e.g. mean or total sales of each customer
segment)
3. Transforming the results into a data structure showing the summary statistics
(Optional, only if we want to further act upon data.)
This tutorial we will work through by answering a few analytical questions. In the
superstore data, let us see what is the shipping mode with highest profit?
We answered the question diving it into parts. First, I want to know how many shipping
modes are there. ‘unique()’ helps to identify the distinct values in a particular column. Here,
we understand there are 4 types of shipping modes. Try to find the distinct categories of
products.
Step 1: let us divide the data into groups by using ‘ship mode’. The ‘groupby’ does that for
us. It divides data into categories. It creates a groupby object which cannot be viewed unless
we apply an aggregate/summery function on that.
Step 2,3: Calculate the total amount of profit in each category using sum() and make
a dataframe out of it. .
Step 4: Since the question is about the highest profit shipping mode, we sorted the values
of dataframe using sort_values() in descending order. Remember! These are the kind of
questions we constantly work with while doing EDA in data science projects.
Aggregating functions are the ones that reduce the dimension of the returned objects.
Some common aggregating functions are tabulated below:
Function Description
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
min() Compute min of group values
max() Compute max of group values
Answer the following questions on shipping mode using the aggregate functions
given above.
Average profit by ‘ship mode’
Get the descriptive statistics of groups. (Hint: Refer to inspecting dataframe tutorial.)
Count the values in each group.
Try to interpret results. Interpreting results develops intuition which is a much needed skill
while doing these kinds of projects.
Grouping is a very important topic for which we only covered basics. Please refer this
material to have an extensive read on the topic.
Well! This is all the Pandas you need to kick start your journey through Python for data
sciences.