Python Unit 3
Python Unit 3
NumPy:
NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. NumPy is an open-source Python library
that facilitates efficient numerical operations on large quantities of data. The main data structure in
this library is the powerful NumPy array, ndarray, which can have any number of dimensions. The
NumPy library contains many useful features for performing mathematical and logical operations
on these special arrays. NumPy is a part of a set of Python libraries that are used for scientific
computing due to its efficient data analysis capabilities.
Features:
1) OPEN SOURCE
2) EASY TO USE
3) PROVIDES HUGE NUMERICAL COMPUTING TOOLS
4) INTEROPERABLE
5) PERFORMANT
6) POWERFUL N-DIMENSIONAL ARRAYS
Installing NumPy:
The only prerequisite for installing NumPy is Python itself. If you don’t have Python yet and
want the simplest way to get started, we recommend you use the Anaconda Distribution.
You can download Anaconda from: https://www.anaconda.com/products/individual
CONDA
If you use conda, you can install NumPy from the defaults or conda-forge channels:
# Best practice, use an environment rather than install in the base env
PIP
To access NumPy and its functions import it in your Python code like this:
import numpy as np
We shorten the imported name to np for better readability of code using NumPy.
NumPy gives you an enormous range of fast and efficient ways of creating arrays and
manipulating numerical data inside them.
While a Python list can contain different data types within a single list, all of the elements in
a NumPy array should be homogeneous.
The mathematical operations that are meant to be performed on arrays would be extremely
inefficient if the arrays weren’t homogeneous.
NumPy arrays are faster and more compact than Python lists. An array consumes less
memory and is convenient to use.
What is an array?
For example:
>>> print(a[0])
[1 2 3 4]
These functions can be split into roughly three categories, based on the dimension of the
array they create:
1) 1D arrays
2) 2D arrays
3) nD arrays
a) numpy.linspace:
numpy.linspace will create arrays with a specified number of elements, and spaced equally between
the specified beginning and end values. For example:
The advantage of this creation function is that you guarantee the number of elements and the
starting and end point.
b) numpy.arange:
numpy.arange creates arrays with regularly incrementing values. Check the documentation for
complete information and examples. A few examples are shown:
>>> np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.arange(2, 10, dtype=float) //dtype is used for specifying the data type.
array([ 2., 3., 4., 5., 6., 7., 8., 9.])
>>> np.arange(2, 3, 0.1)
array([ 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9])
a) numpy.eye:
np.eye(n, m) defines a 2D identity matrix. The elements where i=j (row index and column index are
equal) are 1 and the rest are 0, as such:
>>> np.eye(3)
array([ [1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
>>> np.eye(3, 5)
array([[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.]])
b) numpy.diag:
numpy.diag can define either a square 2D array with given values along the diagonal.
a) numpy.zeros will create an array filled with 0 values with the specified shape. The default
dtype is float64:
Indexing:
Array indexing refers to any use of the square brackets ([]) to index array values. There are many
options to indexing, which give NumPy indexing great power.
Single element indexing for a 1-D array is what one expects. It work exactly like that for other
standard Python sequences. It is 0-based, and accepts negative indices for indexing from the end of
the array.
>>> x = np.arange(10)
>>> x[2]
2
>>> x[-2]
8
Unlike lists and tuples, NumPy arrays support multidimensional indexing for multidimensional
arrays. That means that it is not necessary to separate each dimension’s index into its own set of
square brackets.
Slicing:
The slicing and striding works exactly the same way it does for lists and tuples except that they can
be applied to multiple dimensions as well. A few examples illustrates best:
>>> x = np.arange(10)
>>> x[2:5]
array([2, 3, 4])
>>> x[:-7]
array([0, 1, 2])
As mentioned, one can select a subset of an array to assign to using a single index, slices, and index
and mask arrays. The value being assigned to the indexed array must be shape consistent (the same
shape or broadcastable to the shape the index produces). For example, it is permitted to assign a
constant to a slice:
>>> x = np.arange(10)
>>> x[2:7] = 1
or an array of the right size:
Array mathematics
When standard mathematical operations are used with arrays, they are applied on an element-
by-element basis. This means that the arrays should be the same size during addition,
subtraction, etc.:
>>> a + b
array([6., 4., 9.])
>>> a – b
>>> a * b
For two-dimensional arrays, multiplication remains element wise and does not correspond
to matrix multiplication. There are special functions for matrix math that we will cover later.
>>> a * b
1) flatten():
flatten() can be used for converting 2-D arrays into 1-D array.
Ex:
Transpose() can be used for transforming rows as columns and columns as rows.
Ex:
3) concatenate():
concatenate() can be used for adding(combining) two arrays as one array.
Ex:
4) reshape():
reshape() can be used for change the rows and columns of one matrix.
Ex:
import numpy as np
a=np.arange(6)
a.reshape((2,3))
Generally arange() can create 1-D array. But after applying the reshape() the array has converted
into 2-D array.
Pandas is an easy package to install. Open up your terminal program (for Mac users) or
command line (for PC users) and install it using either of the following commands:
import pandas as pd
The primary two components of pandas are the series and DataFrame.
A series is essentially a column, and a DataFrame is a multi-dimensional table made up
of a collection of Series.
DataFrames and Series are quite similar in that many operations that you can do with
one you can do with the other, such as filling in null values and calculating the mean.
The dataframe index is just the row count, 0 and 1. If you want to use the student name as the
index. Use set_index to do that.
Normally Pandas dataframe operations create a new dataframe. But we can use inplace=True in
some operations to update the existing dataframe without having to make a new one.
Output:
By just using dataframe[‘new column name’] to add the new column. It inserts the new column
into the existing dataframe.
Output:
Ex: df[df['birthdate']=='13-02-1991']
Output:
Here we select one column. This is not called a dataframe, but a series. It’s basically a dataframe of
one column.
Ex: grade=df['grade']
Ex: df3=df.append(df2)
EX: df3.iloc[0:2]
Output:
CSV files contains plain text and is a well know format that can be read by everyone including
Pandas. With CSV files all you need is a single line to load in the data:
EX: df = pd.read_csv('purchases.csv')
df
CSVs don't have indexes like our DataFrames, so all we need to do is just designate
the index_col when reading:
df = pd.read_csv('purchases.csv', index_col=0)
df
OUT:
apples oranges
June 3 0
Robert 2 3
Lily 0 7
David 1 2
JSON is plain text, but has the format of an object, and is well known in the world of programming,
including Pandas. If you have a JSON file — which is essentially a stored Python dict — pandas
can read this just as easily:
EX: df = pd.read_json('purchases.json')
df
OUT:
apples Oranges
David 1 2
June 3 0
Lily 0 7
If you’re working with data from a SQL database you need to first establish a connection using
an appropriate Python library, then pass a query to pandas.
import sqlite3
con = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM purchases", con)
df
output:
One of the most used method for getting a quick overview of the DataFrame, is the head() method.
The head() method returns the headers and a specified number of rows, starting from the top.
Example
Get a quick overview by printing the first 10 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
Note: if the number of rows is not specified, the head() method will return the top 5 rows.
Example
Print the first 5 rows of the DataFrame:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
Example
Print the last 5 rows of the DataFrame:
print(df.tail())
One of the greatest benefits of visualization is that it allows us visual access to huge amounts of
data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter, histogram
etc.
Installation:
Windows, Linux and macOS distributions have matplotlib and most of its dependencies as wheel
packages. Run the following command to install matplotlib package :
Importing matplotlib :
from matplotlib import pyplot as plt
or
import matplotlib.pyplot as plt
Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to
make correlations. They’re typically instruments for reasoning about quantitative information.
Some of the sample plots are covered here.
Line plot :
# x-axis values
x = [5, 2, 9, 4, 7]
# Y-axis values
y = [10, 5, 8, 4, 2]
# Function to plot
plt.plot(x,y)
Bar plot :
# x-axis values
x = [5, 2, 9, 4, 7]
# Y-axis values
y = [10, 5, 8, 4, 2]
Output:
# Y-axis values
y = [10, 5, 8, 4, 2]
Output :
Scatter Plot :
# importing matplotlib module
from matplotlib import pyplot as plt
# x-axis values
x = [5, 2, 9, 4, 7]
# Y-axis values
y = [10, 5, 8, 4, 2]
Output :
Box Plot
A Box Plot is also known as Whisker plot is created to display the summary of the set of data values
having properties like minimum, first quartile, median, third quartile and maximum. In the box plot,
a box is created from the first quartile to the third quartile, a vertical line is also there which goes
through the box at the median. Here x-axis denotes the data to be plotted while the y-axis shows
the frequency distribution.
import numpy as np
np.random.seed(10)
data = np.random.normal(100, 20, 200)
# Creating plot
plt.boxplot(data)
# show plot
plt.show()
EX:
df = pd.read_csv(“data.csv”)
# To display the top 5 rows
df.head(5)
Ex:
# Checking the data type
df.dtypes
EX:
# Dropping the duplicates
df = df.drop_duplicates()
Ex:
7) Detecting Outliers
An outlier is a point or set of points that are different from other points. Sometimes they can be very
high or very low. It’s often a good idea to detect and remove the outliers. Because outliers are one of
the primary reasons for resulting in a less accurate model.
8) Plot different features against one another (scatter), against frequency (histogram)
Hence the above are some of the steps involved in Exploratory data analysis, these are some general
steps that you must follow in order to perform EDA.
There are various steps involved in the Data science life cycle. The below figure represents the life
cycle of Data science.
The complete cycle revolves around the enterprise goal. You need to understand if the customer
desires to minimize savings loss, or if they prefer to predict the rate of a commodity, etc. This is
very important and basic action need to be performed.
2) Data Understanding:
After enterprise understanding, the subsequent step is data understanding. T his step includes
describing the data, their structure, their relevance, their records type. Explore the information
using graphical plots. Basically, extracting any data that you can get about the information
through simply exploring the data.
3) Preparation of Data:
Next comes the data preparation stage. This consists of steps like choosing the applicable data,
integrating the data by means of merging the data sets, cleaning it, treating the lacking values
through either eliminating them or imputing them, treating inaccurate data through eliminating
them, additionally test for outliers the use of box plots and cope with them.
5) Data Modeling:
Data modeling is the heart of data analysis. A model takes the organized data as input and gives
the preferred output. This step consists of selecting the suitable kind of model, whether the
problem is a classification problem, or a regression problem or a clustering problem.
6) Model Evaluation:
Here the model is evaluated for checking if it is geared up to be deployed. The model is examined
on an unseen data, evaluated on a cautiously thought out set of assessment metrics. The model
assessment helps us select and construct an ideal model. If the model is not evaluated properly, it
will fail in the actual world.
7) Model Deployment:
The model after a rigorous assessment is at the end deployed in the preferred structure and
channel. This is the last step in the data science life cycle.
Each step in the data science life cycle defined above must be performed upon carefully. If any
step is performed improperly, and hence, have an effect on the next step and the complete effort
goes to waste.
Central Tendency
Central Tendency is the measure of very basic but very useful statistical functions that represents
a central point or typical value of the dataset. This can be find out by using the following ways:
1. Mean – also known as the average
2. Median – the centermost value of the given dataset
3. Mode – The value which appears most frequently in the given dataset
Depending on what exactly you’re trying to describe, you will use a different measure of central
tendency. Mean and median can only be used for numerical data. The mode can be used with
numerical and nominal data both.
Statistical Dispersion
Dispersion in statistics is a way of describing how spread out a set of data is. When a data set has a
large value, the values in the set are widely scattered; when it is small the items in the set are
tightly clustered. This can be performed by using the following ways:
1. Range – Range gives us the understanding of how spread out the given data is
2. Variance – It gives us the understanding of how the far the measurements are from the
mean.
3. Standard deviation – Square root of the variance is standard deviation, also the
measurement of how far the data deviate from the mean
The Bell Curve – It is a graph of a normal distribution of a variable, it is called a bell curve
because of its shape.
1. Skewness – It is the measure of the asymmetry of a distribution of a variable about its
mean
2. Kurtosis – It is the measure of the “tailedness” of a distribution of a variable. It gives us
the understanding of how closely the data is spread out.
Descriptive statistics is extremely useful in examining the given data. We can get the complete
understanding of the data with the use of descriptive statistics
Charts, on the other hand, is a representation of datasets with the intent of making the user
understand the information in a better manner. Graphs are a good example of charts used for data
visualization.
There are various types of graphs and charts used in data visualization.
1) Bar Chart/Graph:
A bar chart is a graph represented by spaced rectangular bars that describe the data points in a set
of data. It is usually used to plot discrete and categorical data.
Grouped bar charts are used when the datasets have subgroups that need to be visualized on the
graph. Each subgroup is usually differentiated from the other by shading them with distinct colors.
The stacked bar graphs are also used to show subgroups in a dataset. But in this case, the
rectangular bars defining each group are stacked on top of each other.
It does not reveal key assumptions like causes, effects, patterns, etc.
May require further explanation.
2) Pie Chart:
A pie chart is a circular graph used to illustrate numerical proportions in a dataset. This graph is
usually divided into various sectors, where each sector represents the proportion of a particular
numerical element in the set.
This is the most basic type of pie chart and can also be simply called a pie chart.
In an exploded pie chart, one of the sectors of the circle is separated (or exploded) from the chart. It
is used to lay emphasis on a particular element in the data set.
Line graphs are represented by a group of data points joined together by a straight line. Each of
these data points describes the relationship between the horizontal and the vertical axis on the
graph.
When constructing a line chart, you may decide to include the data points or not.
In a simple line graph, only one line is plotted on the graph. One of the axes defines the
independent variables while the other axis contains dependent variables.
Multiple line graphs contain two or more lines representing more than one variable in a dataset.
This type of graph can be used to study two or more variables over the same period of time.
4) Histogram Chart:
Histogram chart visualizes the frequency of discrete and continuous data in a dataset using joined
rectangular bars. Each rectangular bar defines the number of elements that fall into a predefined
class interval.
5) Scatter Plot
Scatter plots are charts used to visualize random variables with dot-like markers that represent
each data point. These markers are usually scattered across the chart area of the plot.
Scatter plots are grouped into different types according to the correlation of the data points. These
correlation types are highlighted below
Positive Correlation
Two groups of data visualized on a scatter plot are said to be positively correlated if an increase in
one implies an increase in the other. A scatter plot diagram can be said to have a high or low
positive correlation.
Negative Correlation
Two groups of data visualized on a scatter plot are said to be negatively correlated if an increase in
one implies a decrease in the other A scatter plot diagram can be said to have a high or low
negative correlation.
No Correlation
Two groups of data visualized on a scatter plot are said to have no correlation if there is no clear
correlation between them.
Disadvantages:
A box and whisker chart is a statistical graph for displaying sets of numerical data through their
quartiles. It displays a frequency distribution of the data.
The box and whisker chart helps you to display the spread and skewness for a given set of data
using the five number summary principle: minimum, maximum, median, lower and upper
quartiles. The ‘five-number summary’ principle allows providing a statistical summary for a
particular set of numbers. It shows you the range (minimum and maximum numbers), the spread
(upper and lower quartiles), and the center (median) for the set of data numbers.
A very simple figure of a box and whisker plot you can see below:
When you want to observe the upper, lower quartiles, mean, median, deviations, etc. for a
large set of data.
When you want to see a quick view of the dataset distribution.
7) Dot Plot:
Dot plot or dot graph is just one of the many types of graphs and charts to organize statistical
data. It uses dots to represent data. A Dot Plot is used for relatively small sets of data and the values
fall into a number of discrete categories.
If a value appears more than one time, the dots are ordered one above the other. That way the
column height of dots shows the frequency for that value.
8) Heat map:
A heat map is a two-dimensional representation of data in which values are represented by colors.
A simple heat map provides an immediate visual summary of information. More elaborate heat
maps allow the viewer to understand complex data sets.
In this election heat map, the red states are Republican and the blue states are Democrat.
Summary statistics are used to summarize a set of observations, in order to communicate the
largest amount of information as simply as possible. Statisticians commonly try to describe the
observations in
A common collection of order statistics used as summary statistics are the five-number summary,
sometimes extended to a seven-number summary.
The five-number summary is a set of descriptive statistics that provides information about a
dataset. It consists of the five most important sample percentiles:
A heatmap contains values representing various shades of the same colour for each value to be
plotted. Usually the darker shades of the chart represent higher values than the lighter shade. For
a very different value a completely different colour can also be used.
The below example is a two-dimensional plot of values which are mapped to the indices and
columns of the chart.
data=[{2,3,4,1},{6,3,5,2},{6,3,5,4},{3,7,5,4},{2,8,1,5}]
Index= ['I1', 'I2','I3','I4','I5']
Cols = ['C1', 'C2', 'C3','C4']
df = DataFrame(data, index=Index, columns=Cols)
plt.pcolor(df)
plt.show()
OUTPUT:
Philosophy of EDA
EDA is done for some of the same reasons it’s done with smaller datasets, but there are additional
reasons to do it with data that has been generated from logs. There are important reasons anyone
working with data should do EDA.
EDA means to gain intuition about the data; to make comparisons between distributions; for sanity
checking to find out where data is missing or if there are outliers; and to summarize the data. In the
context of data generated from logs, EDA also helps with debugging the logging process.
EDA helps you make sure the product is performing as intended. Although there’s lots of
visualization involved in EDA, we distinguish
between EDA and data visualization in that EDA is done toward the beginning of analysis, and data
visualization, is done toward the end to communicate one’s findings. With EDA, the graphics are
solely done for you to understand what’s going on.
With EDA, we can also use the understanding we get to inform and improve the development of
algorithms.
Plotting data and making comparisons can get you extremely far, and is far better to do than getting
a dataset and immediately running a regression just because you know how.
Data Visualization:
1) Scatter plot
Scatterplots use a collection of points placed using Cartesian Coordinates to display values from two
variables. By displaying a variable in each axis, you can detect if a relationship or correlation
between the two variables exists.
Various types of correlation can be interpreted through the patterns displayed on Scatterplots.
These are: positive (values increase together), negative (one value decreases as the other
increases), null (no correlation), linear, exponential and U-shaped. The strength of the correlation
can be determined by how closely packed the points are to each other on the graph. Points that
end up far outside the general cluster of points are known as outliers.
A bar chart displays categorical data with rectangular bars whose length or height corresponds to
the value of each data point.
Bar charts can be visualized using vertical or horizontal bars. Bar charts are best used to compare a
single category of data or several. When comparing more than one category of data, the bars can
be grouped together to created a grouped bar chart.
Bar charts use volume to demonstrate differences between each bar. Because of this, bar charts
should always start at zero. When bar charts do not start at zero, it risks users misjudging the
difference between data values.
differences between each bar. Because of this, bar charts should always start at zero. When bar
charts do not start at zero, it risks users misjudging the difference between data values.
3) Histogram
A histogram is a chart that displays numeric data in ranges, where each bar represents how
frequently numbers fall into a particular range.
Like a bar chart, histograms consist of a series of vertical bars along the x-axis. Histograms are most
commonly used to depict what a set of data looks like in aggregate. At a quick glance, histograms
tell whether a dataset has values that are clustered around a small number of ranges or are more
spread out.
A box and whisker plot is a graph that presents the distribution of a category of data.
Typically, box and whisker plots break the data into four or five points. Four point, or quartile
boxplots, present the “box” as defined by the first and third quartile. The median value is also
depicted in the box and the “whiskers” represent the minimum and maximum values in the data.
In a five point, or quintile boxplot, the points are the minimum. Presenting data in this way is useful
for indicating whether a distribution is skewed and whether there are potential outliers in the data.
Box and whisker plots are also useful for comparing two or more datasets and for representing a
large number of observations.
Box and whisker plots can be displayed horizontally or vertically and displayed side-by-side for
comparisons.
You can quickly grasp the state and impact of a large number of variables at one time by
displaying your data with a heat map visualization. A heat map visualization is a combination of
nested, colored rectangles, each representing an attribute element. Heat Maps are often used
in the financial services industry to review the status of a portfolio.
The rectangles contain a wide variety and many shadings of colors, which emphasize the weight
of the various components. In a heat map visualization:
The size of each rectangle represents its relative weight. The legend provides
information about the minimum and maximum values.
The color of each rectangle represents its relative value. The legend provides the range
of values for each color.
Data is grouped based on the order of the attributes in the Grouping area of the Editor
panel.