02_Appendix_2_Python_Packages (1)
02_Appendix_2_Python_Packages (1)
Python Packages
NumPy is a library of Python and it is a shorthand form of Numerical Python. NumPy, along with
other python packages SciPy and Matplotlib, aims is aiming to replace Matlab, another popular
development environment, for implementing scientific data science applications.
NumPy provides an array of data structure and helps in numerical analysis. NumPy is used to
manipulate arrays. The manipulation includes mathematical and logical operations. It can be used
for variety of tasks like shape manipulation such as Fourier analysis, and linear algebra operations.
Python, though provides list data structure that provides a mechanism of storing homogeneous and
heterogeneous items, has some serious limitations. Python core list by default is a one-dimensional
array. The multidimensional array needs to be implemented as a nested list. Also, the core python
list does not provide operations for elementwise operations. Also, core python stores elements in
non-contiguous manner making it slow.
NumPy is better than the Python list as it reduces coding time, has faster execution and uses less
memory. NumPy array stores data in a continuous manner and consumes less memory. It also
executes fast. It is also very convenient to use it in data applications.
- Data type
- Item size
- Shape – dimensions
- Data
Data type:
Data types are integers, int, float, complex other data types are Boolean, string, datatime and
Python objects.
B.1.1 Installation
Once NumPy is installed, then one can check whether it is installed or not by executing the
command in the python command shell as follows:
If there is no error thrown up, one can concluded that NumPy is installed correctly.
>> print(__version__)
Also, it should be noted that NumPy must be installed before other packages of python such as
scilab, Pandas and Matplotlib.
The elements of the created array x can be displayed by typing the name of the array.
>>> x
Immediately, the command would display the elements of the array x as:
array([1, 2, 3, 4, 5])
The command,
>>> print(type(x))
The command,
>>>x.shape()
There is another way of creating an array using the command arange. The following command
creates an array with elements in the range 1 to 10. The range would be divided by 5 and two
elements would be displayed.
array([1, 6])
Another way to create a NumPy array is to use the command linspace. It creates a range with the
starting value to ending value with exactly specified elements. The following command creates a
range as follows:
>>>np.linspace(1,10,5)
One can read the values for a NumPy array by loading from a file also as shown below:
This command specifies that the data file is sample.txt, data type in uint, delimiter is comma and
skiprows indicates that the header needs to be skipped.
These are some of the command used to manipulate the created array
x= np.array ([1,2,3,4,5,6])
Slicing Operations
Indexing and slicing operations are used to access the elements of the array. Slicing is constructed by
specifying start, stop and step parameters.
>> a = np.arange(15)
>> a[3:9:2]
The command a[3:9:2] would slice the array from 3 to 8 with step 2. The result would be [3,5,7].
We can also mention from start as a[3:]. In this case all the elements starting from 3 to the end, in
this case 14 would be printed. A[3:8] would slice between the indexes 3 and 8.
One can create an array and apply the following commands to perform statistical operations. Let us
create an array x = [1,2,3,4,5,6] with the command x = np.array ([1,2,3,4,5,6]). Similarly, let us create
another array y = [5,6,7,8,9,10]. Then the arithmetic operations on arrays can be done as follows:
>>> print(x+y)
>>> print(x-y)
>>> print(x*y)
>>> print(x/y)
>>> print(np.mean(x))
>>> print(np.median(x))
>>>print(np.max(x))
>>>print(np.min(x))
Or simply by specifying the statistical operators as follows as shown in the Table A.2.
array([[1, 2, 3],
[2, 2, 2]])
The command,
x.sum(axis=1)
would create a result array([6, 6]), by adding the elements of the row wise. It can be observed that it
is 1+2+3=6 and 2+2+2 = 6.
>>>x=array([4,5,6])
>>>print(x)
It can be observed that the tag np. Is missing. Similarly, all vector operations can be done
>>>x=array([4,5,6])
>>>y=([10,10,10])
>>>c = x * y
>>>print(c)
Printing Arrays in 2D
>>>x=([1,0,0],[0,1,0])
>>>print(x)
>>>y=([1,1,1],[1,1,1])
>>>print(y)
>>>print(c)
>> x = np.array([[2,3],[6,7]])
>>y= np.array([[5,6],[8,7]])
>>print(np.add(x,y))
The command,
x.sum(axis=0)
Ravel
The command ravel is used to flatten the data to one-dimensional arrays and is very useful in data
science applications. The syntax is np.ravel(a,order). The options are ‘F” by default. The option ‘C’
can be used for row-major ordering and ‘A” for column major ordering.
>> a1 = np.arange(15).reshape(3, 5)
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
>> a1.ravel()
One of the most useful packages for data visualization is Matplotlib. It makes use of NumPy of
python. It helps to create charts for machine learning. Pylab is a procedural interface for Matplotlib.
John.D.Hunter designed Matplotlib in 2003.
Let us create a simple plot for sine wave. This can be accomplished as follows:
This is how pyplot of matplolLib is imported. plt is an alias created for matplotlib.pyplot. the value 0
to 2n is created with a function.
x= np.arange(0,math.pi*2,0.1)
y= np.sin(x)
plt.plot(x,y)
Plot is the simple command for plotting. For example a function y=x^2 can be plotted as below:
import numpy as np
x= np.linspace(0,1,10)
plt.plot(x,y)
Command Remarks
s
plt.xlabel Creates a label for x axis
plt.ylabel Creates a label for y axis
plt.title Creates a label for the graph
import numpy as np
import math
x = np.linspace(-np.pi,np.pi,100)
y=np.sin(x)
plt.plot(x,y)
plt.xlabel('angle')
plt.ylabel('sine function')
plt.title('sine function')
plt.show()
In Jupyter notebook, % MatplotLib in line can be used to create graph in the notebook itself. Many
graphs can be created in a single plot. The command
plt.subplot(nrows,ncols,Index)
can be used to create a grid with the specified rows and columns.
For example, the command plt.subplot(211) creates space for the 1st plot and plt.subplot(212)
creates space for 2nd plot.
fig,a=plt.subplots(2,2)
matrix
axes.grid(true)
Matplotlib automatically, takes care of the spacing of points on axes. This can be changed by using
the command ticks:
This command marks the point as per the list. The labels for the tick marks can be done by using the
command:
ax.set_xlabels([‘one’,’three’,’five’,’seven’])
Commands Syntax
ax.bar ax.bar(x,height,width,bottom,align)
ax.hist ax.hist(x,bins)
ax.pie ax.pie(x,labels,colours,autopct)
ax.scatter ax.scatter(x,y,color=’r’)
ax.boxplot ax.boxplot(data)
ax.violinplot ax.violinplot(data)
Seaborn has many data sets. The following commands loads a predefined dataset tips:
import seaborn as sb
df= sb.load_dataset(‘tips’)
print df.head
Distplot is used to plot univariate distribution. The following program is useful for plotting a
distribution of sepal-length.
plt.show()
Instead of Kde=false, setting hist as false result in Kernel density estimates plot of the attribute. The
corresponding plots are shown Figure B.5.
Jointplot
sb.jointplot(x=’petal-length’,y=’petal-width’,data=df)
Heatmaps are useful as they colour different numbers in different colours. Therefore, it is better to
visualize the data. The following code segment is used to display the heatmap for the uniformly
distrusted random data in the range 10,20.
>> ax = sns.heatmap(uniform_dt)
Pairplots
Pairplot is used to plot multiple pairwise bivariate distributions. The matrix has diagonal univariate
plot and the rest of the matrix as pairwise bivariate distribution.
seaborn.pairplot(df,hue,palette,kind,diag-kind)
Stripplot
These plots are useful for distribution of data. Boxplots are useful for representing five-point
summary and violin plot represents box plot with kde.
B.3.1 Pandas
Pandas is a name from “panel data” and was designed by Wes McKinney in 2008. Pandas is used for
data manipulation and analysis. Pandas can be used for:
Pandas provide data structure like series, data frame and panel for processing one dimensional, two
dimensional and three-dimensional data, respectively. Higher dimensional data structure are
containers of lower dimensional data.
import pandas as pd
import numpy as np
data=np.array([1,2,3,4,5])
s=pd.Series(data)
Print s
0 1
1 2
2 3
3 4
4 5
Import Pandas as pd
df=pd. read_csv(“sample.csv”)
print df
read-excel()
read_json()
read_json()
read_sql()
df=pd.read_csv(“Sample.csv”,index col=[‘SNO’])
Print df
df=pd.read_csv(“Sample.csv”,names=[‘mark1’,’mark2’])
print df
One can skip the rows, say 3, using the following command
df=pd.read_CSV(“Sample.CSV’,skiprows=3)
print df
The accessing of an element of the created dataframe in Pandas is done through the command iloc
and loc. The rows of the dataframe can be accessed using iloc method. For example, df.iloc[0]
returns the first row of the dataframe.
loc is similar to iloc. But, loc allows to index the column items or labels. The details are given in the
following Table B.6.
Command Results
df.loc[‘row’] Pass row number to .loc to select that row
df.iloc[‘row’] Pass row number to select integer location
df.df[2:3) Select multiple rows using : operator
df.append(row) Append a row to the existing data frame df
df.drop(label) Drop the row with the label
head() returns the first n rows. This can be used as follows to print first 5 elements of the series
Import Pandas as pd
s=pd.Series(np.random.randn(10))
print s(5)
s=pd.Series(data,index=[100,110,120,130,140])
data={‘a’:100,’6’:200,’c’:300}
Then the output would be like with the dictionary key is used as an index.
A 100
B 200
C 300
s=pd.Series(10,index=[a,b,c])
A 10
B 10
C 10
Assuming a series
s=pd.Series([1,2,3,4,5],index=[100,200,300,400,500])
Command Result
Print s[0] Returns the first elements of the array
Print s[:2] Retrieve the first two elements of the array
Print s[-2:] Retrieve the last two elements of the array
Print s[‘100’] Retrieve the element whose index is 100
A data frame is a 2D structure that is useful for data analysis. For example, the following Table B.8
can be created as follows:
Reg.no Marks
100 37
101 40
102 42
103 57
104 60
data=[[100,37],[101,40],[102,42],[103,57],[105,60]]
df=pd.DataFrame( data,columns=[‘reg-no’,’marks’],dtype=float )
print df
import Pandas as pd
data=[(100:37),(101,40),(102,42),(103,57),(104,60)]
df=pd.DataFrame(data)
print df
The following table can be constructed as a set of Pandas series as shown in Table B.9.
import Pandas as pd
'marks2': pd.Series([40,42,43,58,67],index=[100,101,102,103,104])}
df = pd.DataFrame.from_dict(dict)
print(df['marks1'])
df['total']= df['marks1']+df['marks2']
del df['marks1']
Covariance
import Pandas as pd
import NumPy as np
s1= pd.Series([1,2,3,4,5])
s2= pd.Series([4,3,6,7,8])
print (s1.cov(s2))
Aggregation
Pandas Visualization
Let us assume the marks for three subjects for ten students, let us create the data using random
number. Then the bar plot can be created as below:
df=pd.dataframe(np.random.rand(15,2), columns=[‘mark1’,’mark2’]
df.plot.bar()
The following commands can be used to create plots as shown in Table B.10.
Command Remarks
df.plot.bar(Stacked=true) Creates a bar h chart
df.plot.hist(bins=5) Create a histogram with five bins
df.plot.box() Create a box plot to visualize the distributions
df.plot.area() Create an area plot
df.plot.scatter(x=’column1’,y=’column2’) Create a scatter plot
df.plot.pie(subplot=true) Create a pie chart
SciPy is another related python package that can be used for many linear algebra applications. SciPy
is helpful in linear algebra applications. It can be useful to find determinant and inverse of matrices.
>>> myarray=np.array([[1,2],[3,4]])
>>> linalg.det(myarray)
>>>linalg.inv(myarray)
The following code illustrates the method of computing Chisquare test using SciPy.
import Pandas as pd
import NumPy as np
import SciPy as sy
x=chisquare([40,10,20,30],f_exp=[30,20,30,20])
print(x)
Scikit-learn or Sklearn is a popular python package for implementing machine learning algorithms.
Scikit-learn was developed by David Cournapeau as a Google summer project. Scikit-Learn can be
installed using pip command as
or
The Scikit-Learn is built on NumPy, SciPy and Matplotlib, and Pandas. Scikit-Learn can implement
supervised learning algorithms, unsupervised learning algorithms like clustering algorithm. This
entire lab manual is implemented using Scikit-Learn package. Scikit-Learn comes with some ready
datasets like iris and digits. The dataset can be loaded as follows:
Iris= load_iris()
The dataset can be split into training and testing dataset as follows:
joblib.load(‘sample_model.joblib’)
Mean removal
The command MinMaxscaler applies the scaling of input for the given input data. The command is
given as follows:
data_scaler= data_scaler.fit_transform(input_data)
The command normalize can scale the input data to a common scale. The L1 and L2 normalization
can be implemented using the following commands.
Model fitting can be done using .fit command. For example, the following commands are used to
create KNN classifier,
The constructed model can be stored with the help of joblib package,
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)
SVC()
print(clf.predict([[2., 2.]]))
Keras is an API that solves the problem of lack of support of scikit learn for creating neural and deep
learning network. Keras helps to create deep neural network and is built on the frameworks like
tensor flow, CNTK and Theano. Some of the advantages of Keras are as follows:
The input can be given using NumPy or Pandas. The following code shows the way of reading data
using NumPy:
ds = loadtxt(‘Iris.csv’,delimiter =’,’)
the read data can be split into input and target as follows:
X = ds[:,0:4]
y = ds[:,4]
model= Sequential ([
])
Or it can be done by adding layers one by one to form a neural network structure. Keras core API has
a concept of layers. This can be done using a set of
model.add commands
A layer can be visualized as a set of nodes. Some of the layers that are provided by Keras are:
Dense Layer
Dense layer is directly connected to the inputs and output, it is a fully connected layer. A dense layer
specify the number of neurons or nodes as an argument and an activation function. The activation
function can be sigmoid, tanh or Relu. The following layer creates a dense layer with its neurons with
activation function Relu
model.add(Dense(12,input_dim=4,activation=’relu’))
Let us assume that input data has 4 features and three classes. Then a Keras model can be created as
follows:
One hidden layer of 10 neurons with one activation function tanh for 4 features can be created as
follows:
model.add(dense(10,activation=’tanh’,input_dim=4))
model.add(dense(units,input_dim=input_shape.’Activation Function’))
Keras provides many activation functions such as step, linear, sigmoid, tank and ReLu. A sample
keras tanh function can be given as follows: model.add(Activation(‘tan h’)). In the case of
convolutional neural networks(CNN), the model.add command can create additional layers like
below:
Convolutional Layer
Convolutional layer has many filters or kernels. Kernels can be of any dimensions. Kernels are
convolved with input image to produce many features.
Pooling Layers
Pooling is another layer and can be either max pooling or average pooling.
Recurrent Layers
Recurrent layers are used to process sequential data like time series data or natural language
constructs.
Model Compile
Model compile requires specification of additional parameters like loss functions, optimizers and
metrics. The loss function for binary classification problems is Adam. Adam is based on stochastic
gradient methods that gives the best results using auto tuning. The Keras command can be like
model.compile(loss=’binary_crossentropy’,optimizer=’adam’,metrics=[‘accuracy’])
Adam is a default optimizer. The loss functions may be either binary_crossentropy for two class
problems or categorical_crossentropy for multi-class problems. Many metrics can be used. Accuracy
is one metric that is commonly used.
Model Fitting
Once the model is created, then the model can be executed. The execution can be done as follows:
model.fit(X,y,epochs=50,batch_size=10)
Once the model is compiled, then the data is fit into it. The parameters of this command is as
follows:
Keras evaluate command can be used to evaluate the model using test data. The evaluation of the
model is done using the command evaluate(). This can predict the values of the input. This can be
done as follows:
-,accuracy = model.evaluate(X,y)
Predict = model.predict_classes(X)
model.summary()
result= model.fit(X,y,epochs=50,batch_size=10,verbose=1,validation-split=0.2,shuffle=false)
plt.plot(result.result[‘loss’])
plt.plot(result.result[‘accuracy’])
model = Sequential()
model.add(Dense(12, input_dim=8,activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile model