Introduction To Python Libraries
Introduction To Python Libraries
Numpy
Introduction
NumPy stands for ‘Numerical Python’. It is a package for data analysis and scientific computing with
Python. NumPy uses a multidimensional array object, and has functions and tools for working with
these arrays. The powerful n-dimensional array in NumPy speeds-up data processing. NumPy can be
easily interfaced with other Python packages and provides tools for integrating with other
programming languages like C, C++ etc.
Installing NumPy
NumPy can be installed by typing following command:
pip install NumPy
Array
We have learnt about various data types like list, tuple, and dictionary. In this
chapter we will discuss another datatype ‘Array’. An array is a data type used to
store multiple values using a single identifier (variable name). An array contains
an ordered collection of data elements where each element is of the same type and
can be referenced by its index (position).
The important characteristics of an array are:
Each element of the array is of same data type, though the values stored in
them may be different.
The entire array is stored contiguously in memory. This makes operations
on array fast.
Each element of the array is identified or referred using the name of the
Array along with the index of that element, which is unique for each element. The
index of an element is an integral value associated with the element, based on the
element’s position in the array. For example consider an array with 5 numbers:
[ 10, 9, 99, 71, 90 ]
Here, the 1st value in the array is 10 and has the index value [0] associated with
it; the 2nd value in the array is 9 and has the index value [1] associated with it, and
so on. The last value (in this case the 5 th value) in this array has an index [4]. This
is called zero based indexing. This is very similar to the indexing of lists in Python.
The idea of arrays is so important that almost all programming languages support
it in one form or another.
NumPy Array
NumPy arrays are used to store lists of numerical data, vectors and matrices. The
NumPy library has a large set of routines (built-in functions) for creating,
manipulating, and transforming NumPy arrays. Python language also has an array
data structure, but it is not as versatile, efficient and useful as the NumPy array.
The NumPy array is officially called ndarray but commonly known as array. In rest
of the chapter, we will be referring to NumPy array whenever we use “array”.
following are few differences between list and Array.
Notes
Summary
Array is a data type that holds objects of same datatype (numeric, textual,
etc.). The elements of an array are stored contiguously in memory.
Each element of an array has an index or position value.
NumPy is a Python library for scientific computing which stores data in a
powerful n-dimensional ndarray object for faster calculations.
Each element of an array is referenced by the array name along with the
index of that element.
numpy.array() is a function that returns an object of type numpy.ndarray.
All arithmetic operations can be performed on arrays when shape of the two
arrays is same.
NumPy arrays are not expandable or extendable. Once a numpy array is
defined, the space it occupies in memory is fixed and cannot be changed.
numpy.split() slices apart an array into multiple sub-arrays along an
axis.
numpy.concatenate() function can be used to concatenate arrays.
numpy.loadtxt() and numpy.genfromtxt() are functions used to load
data from files. The savetxt() function is used to save a NumPy array to a text file.
Pandas
PANDAS (PANel DAta) is a high-level data manipulation tool used for analysing
data. It is very easy to import and export data using Pandas library which has
a very rich set of functions. It is built on packages like NumPy and Matplotlib
and gives us a single, convenient place to do most of our data analysis and
visualisation work. Pandas has three important data structures, namely –
Series, DataFrame and Panel to make the process of analysing data organised,
effective and efficient. The Matplotlib library in Python is used for plotting
graphs and visualisation. Using Matplotlib, with just a few lines of code we can
generate publication quality plots, histograms, bar charts, scatterplots, etc. It
is
also built on Numpy, and is designed to work well with Numpy and Pandas.
You may think what the need for Pandas is when NumPy can be used for data
analysis. Following aresome of the differences between Pandas and
Numpy:
1. A Numpy array requires homogeneous data, while
a Pandas DataFrame can have different data types
(float, int, string, datetime, etc.).
2. Pandas have a simpler interface for operations like
file loading, plotting, selection, joining, GROUP
BY, which come very handy in data-processing
applications.
3. Pandas DataFrames (with column names) make it
very easy to keep track of data.
4. Pandas is used when data is in Tabular Format,
whereas Numpy is used for numeric array based
data manipulation.
Installing Pandas
Installing Pandas is very similar to installing NumPy. To install Pandas from
command line, we need to type in:
pip install pandas
Note that both NumPy and Pandas can be installed only when Python is
already installed on that system. The same is true for other libraries of Python.
Data Structure in Pandas
A data structure is a collection of data values and
operations that can be applied to that data. It enables
efficient storage, retrieval and modification to the data.
For example, we have already worked with a data
structure ndarray in NumPy in Class XI. Recall the ease
with which we can store, access and update data using
a NumPy array. Two commonly used data structures in
Pandas that we will cover in this book are:
• Series
• DataFrame
2.2 Series
A Series is a one-dimensional array containing a
sequence of values of any data type (int, float, list,
string, etc) which by default have numeric data labels
starting from zero. The data label associated with a
particular value is called its index. We can also assign
values of other data types as index. We can imagine a
Pandas Series as a column in a spreadsheet. Example
of a series containing names of students is given below:
Index Value
0 Arnab
1 Samridhi
2 Ramit
3 Divyam
4 Kritika
2.2.1 Creation of Series
There are different ways in which a series can be created
in Pandas. To create or use series, we first need to import
the Pandas library.
(A) Creation of Series from Scalar Values
A Series can be created using scalar values as shown in
the example below:
>>> import pandas as pd #import Pandas with alias pd
>>> series1 = pd.Series([10,20,30]) #create a Series
>>> print(series1) #Display the series
Output:
0 10
1 20
2 30
dtype: int64
Observe that output is shown in two columns - the
index is on the left and the data value is on the right. If
we do not explicitly specify an index for the data values
while creating a series, then by default indices range
from 0 through N – 1. Here N is the number of data
elements.
We can also assign user-defined labels to the index
and use them to access elements of a Series. The
following example has a numeric index in random order.
>>> series2 = pd.Series(["Kavi","Shyam","Ra
vi"], index=[3,5,1])
>>> print(series2) #Display the series
Output:
3 Kavi
5 Shyam
1 Ravi
dtype: object
Here, data values Kavi, Shyam and Ravi have index
values 3, 5 and 1, respectively. We can also use letters
or strings as indices, for example:
>>> series2 = pd.Series([2,3,4],index=["Feb","M
ar","Apr"])
>>> print(series2) #Display the series
Output:
Feb 2
Mar 3
Apr 4
dtype: int64
Here, data values 2,3,4 have index values Feb, Mar
and Apr, respectively.
(B) Creation of Series from NumPy Arrays
We can create a series from a one-dimensional (1D)
NumPy array, as shown below:
Activity 2.1Think and Reflect
>>> import numpy as np # import NumPy with alias np
>>> import pandas as pd
>>> array1 = np.array([1,2,3,4])
>>> series3 = pd.Series(array1)
>>> print(series3)
Output:
0 1
1 2
2 3
3 4
dtype: int32
The following example shows that we can use letters
or strings as indices:
>>> series4 = pd.Series(array1, index = ["Jan",
"Feb", "Mar", "Apr"])
>>> print(series4)
Jan 1
Feb 2
Mar 3
Apr 4
dtype: int32
When index labels are passed with the array, then
the length of the index and array must be of the same
size, else it will result in a ValueError. In the example
shown below, array1 contains 4 values whereas there
are only 3 indices, hence ValueError is displayed.
>>> series5 = pd.Series(array1, index = ["Jan",
"Feb", "Mar"])
ValueError: Length of passed values is 4, index
implies 3
(C) Creation of Series from Dictionary
Recall that Python dictionary has key: value pairs and
a value can be quickly retrieved when its key is known.
Dictionary keys can be used to construct an index for a
Series, as shown in the following example. Here, keys of
the dictionary dict1 become indices in the series.
>>> dict1 = {'India': 'NewDelhi', 'UK':
'London', 'Japan': 'Tokyo'}
>>> print(dict1) #Display the dictionary
{'India': 'NewDelhi', 'UK': 'London', 'Japan':
'Tokyo'}
>>> series8 = pd.Series(dict1)
>>> print(series8) #Display the series
India NewDelhi
UK London
Japan Tokyo
dtype: object
DataFrame
Sometimes we need to work on multiple columns at
a time, i.e., we need to process the tabular data. For
example, the result of a class, items in a restaurant’s
menu, reservation chart of a train, etc. Pandas store
such tabular data using a DataFrame. A DataFrame is
a two-dimensional labelled data structure like a table
of MySQL. It contains rows and columns, and therefore
has both a row and column index. Each column can
have a different type of value such as numeric, string,
boolean, etc., as in tables of a database.
Creation of DataFrame
There are a number of ways to create a DataFrame.
Some of them are listed in this section.
(A) Creation of an empty DataFrame
An empty DataFrame can be created as follows:
>>> import pandas as pd
>>> dFrameEmt = pd.DataFrame()
>>> dFrameEmt
Empty DataFrame
Columns: []
Index: []
(B) Creation of DataFrame from NumPy ndarrays
Consider the following three NumPy ndarrays. Let us
create a simple DataFrame without any column labels,
using a single ndarray:
>>> import numpy as np
>>> array1 = np.array([10,20,30])
>>> array2 = np.array([100,200,300])
>>> array3 = np.array([-10,-20,-30, -40])
>>> dFrame4 = pd.DataFrame(array1)
>>> dFrame4
0
0 10
1 20
2 30
We can create a DataFrame using more than one
ndarrays, as shown in the following example:
>>> dFrame5 = pd.DataFrame([array1, array3,
array2], columns=[ 'A', 'B', 'C', 'D'])
>>> dFrame5
A B C D
0 10 20 30 NaN
1 -10 -20 -30 -40.0
2 100 200 300 NaN
(C) Creation of DataFrame from List of Dictionaries
We can create DataFrame from a list of Dictionaries, for
example:
# Create list of dictionaries
>>> listDict = [{'a':10, 'b':20}, {'a':5,
'b':10, 'c':20}]
>>> dFrameListDict = pd.DataFrame(listDict)
>>> dFrameListDict
a b c
0 10 20 NaN
1 5 10 20.0
Here, the dictionary keys are taken as column
labels, and the values corresponding to each key are
taken as rows. There will be as many rows as the
number of dictionaries present in the list. In the above
example there are two dictionaries in the list. So, the
DataFrame consists of two rows.
Summary
• NumPy, Pandas and Matplotlib are Python
libraries for scientific and analytical use.
• pip install pandas is the command to install
Pandas library.
• A data structure is a collection of data values
and the operations that can be applied to that
data. It enables efficient storage, retrieval and
modification to the data.
• Two main data structures in Pandas library
are Series and DataFrame. To use these
data structures, we first need to import the
Pandas library.
• A Series is a one-dimensional array containing a
sequence of values. Each value has a data label
associated with it also called its index.
• The two common ways of accessing the elements
of a series are Indexing and Slicing.
• There are two types of indexes: positional index
and labelled index. Positional index takes an
integer value that corresponds to its position in
the series starting from 0, whereas labelled index
takes any user-defined label as index.
• When positional indices are used for slicing, the
value at end index position is excluded, i.e., only
(end - start) number of data values of the series
are extracted. However with labelled indexes the
value at the end index label is also included in
the output.
• All basic mathematical operations can be
performed on Series either by using the
operator or by using appropriate methods of the
Series object.
• While performing mathematical operations index
matching is implemented and if no matching
indexes are found during alignment, Pandas
returns NaN so that the operation does not fail.
• A DataFrame is a two-dimensional labeled data
structure like a spreadsheet. It contains rows
and columns and therefore has both a row and
column index.
• When using a dictionary to create a DataFrame,
keys of the Dictionary become the column labels
of the DataFrame. A DataFrame can be thought of
as a dictionary of lists/ Series (all Series/columns
sharing the same index label for a row).
• Data can be loaded in a DataFrame from a file on
the disk by using Pandas read_csv function.
• Data in a DataFrame can be written to a text
file on disk by using the pandas.DataFrame.to_
csv() function.
• DataFrame.T gives the transpose of a DataFrame.
• Pandas haves a number of methods that support
label based indexing but every label asked for
must be in the index, or a KeyError will be raised.
• DataFrame.loc[ ] is used for label based indexing
of rows in DataFrames.
• Pandas.DataFrame.append() method is used to
merge two DataFrames.
• Pandas supports non-unique index values. Only
if a particular operation that does not support
duplicate index values is attempted, an exception
is raised at that time.
• The basic difference between Pandas Series and
NumPy ndarray is that operations between Series
automatically align the data based on labels. Thus,
we can write computations without considering
whether all Series involved have the same label or
not whereas in case of ndarrays it raises an error.
Matplotlib
Visualisation also helps to effectively communicate
information to intended users. Traffic symbols,
ultrasound reports, Atlas book of maps, speedometer
of a vehicle, tuners of instruments are few examples
of visualisation that we come across in our daily lives.
Visualisation of data is effectively used in fields like
health, finance, science, mathematics, engineering, etc.
In this chapter, we will learn how to visualise data using
Matplotlib library of Python by plotting charts such
as line, bar, scatter with respect to the various types
of data.
Plotting using Matplotlib
Matplotlib library is used for creating static, animated,
and interactive 2D- plots or figures in Python. It can
be installed using the following pip command from the
command prompt:
pip install matplotlib
For plotting using Matplotlib, we need to import its
Pyplot module using the following command:
import matplotlib.pyplot as plt
Here, plt is an alias or an alternative name for
matplotlib.pyplot. We can use any other alias also.
Figure
In program 4-1, plot() is provided with two parameters,
which indicates values for x-axis and y-axis, respectively.
The x and y ticks are displayed accordingly. As shown
in Figure 4.2, the plot() function by default plots a line
chart. We can click on the save button on the output
window and save the plot as an image. A figure can also
be saved by using savefig() function. The name of the
figure is passed to the function as parameter.
For example: plt.savefig('x.png').
In the previous example, we used plot() function
to plot a line graph. There are different types of data
available for analysis.
List of Pyplot functions to plot different charts
plot(\*args[, scalex, scaley, data]) Plot x versus y as lines and/or
markers.
bar(x, height[, width, bottom, align, data]) Make a bar plot.
boxplot(x[, notch, sym, vert, whis, ...]) Make a box and whisker plot.
hist(x[, bins, range, density, weights, ...]) Plot a histogram.
pie(x[, explode, labels, colors, autopct, ...]) Plot a pie chart.
scatter(x, y[, s, c, marker, cmap, norm, ...]) A scatter plot of x versus
y.
Summary
• A plot is a graphical representation of a data set
which is also interchangeably known as a graph or
chart. It is used to show the relationship between
two or more variables.
• In order to be able to use Python’s Data
Visualisation library, we need to import the
pyplot module from Matplotlib library using the
following statement: import matplotlib.pyplot as
plt, where plt is an alias or an alternative name
for matplotlib.pyplot. You can keep any alias of
your choice.
• The pyplot module houses functions to create a
figure(plot), create a plotting area in a figure, plot
lines, bars, hist. etc., in a plotting area, decorate
the plot with labels, etc.
• The various components of a plot are: Title,
Legend, Ticks, x label, ylabel
• plt.plot() is used to build a plot, where plt is
an alias.
• plt.show() is used to display the figure, where
plt is an alias.
• plt.xlabel() and plt.ylabel() are used to set the x
and y label of the plot.
• plt.title() can be used to display the title of a plot.
• It is possible to plot data directly from the
DataFrame.
• Pandas has a built-in .plot() function as part of
the DataFrame class.
• The general format of plotting a DataFrame
is df.plot(kind = ' ') where df is the name of the
DataFrame and kind can be line, bar, hist,
scatter, box depending upon the type of plot to be
displayed.