Python Fundamentals For Machine Learning Version1
Python Fundamentals For Machine Learning Version1
My First Program
In [1]:
In [3]:
In [4]:
x = 12.3456789
print('The value of x is %3.2f' %x)
print('The value of x is %3.4f' %x)
In [5]:
In [6]:
table = {'Raju': 9480123526, 'Ravi': 9480123527, 'Rahul': 9480123528}
for name, phone in table.items():
print('{0:10} ==> {1:10d}'.format(name, phone))
In [7]:
import math
print('The value of PI is approximately %5.3f.' % math.pi)
In [10]:
Hello
In [11]:
var = -1
if var < 0:
print(var)
print("the value of var is negative")
# If there is only a single clause then it may go on the same line as the
# header statement
if ( var == -1 ) :
print("the value of var is negative")
-1
the value of var is negative
the value of var is negative
In [12]:
var = 1
if var < 0:
print("the value of var is negative")
print(var)
else:
print("the value of var is positive")
print(var)
In [13]:
score = 95
if score >= 99:
print('A')
elif score >=75:
print('B')
elif score >= 60:
print('C')
elif score >= 35:
print('D')
else:
print('F')
3.0 Iterations
# Usage of For Loop
In [14]:
# First Example
print("First Example")
for item in [1,2,3,4,5]:
print('item :', item)
# Second Example
print("Second Example")
letters = ['A', 'B', 'C']
for index in range(len(letters)):
print('First loop letter :', letters[index])
First Example
item : 1
item : 2
item : 3
item : 4
item : 5
Second Example
First loop letter : A
First loop letter : B
First loop letter : C
#While loop: The while statement repeats a set of code until the condition is true.
In [15]:
#Example code for while loop statement
count = 0
while (count <3):
print('The count is:', count)
count = count + 1
4.1 LISTS
Python’s lists are the most flexible data type. It can be created by writing a list of comma separated values between square brackets.
Note that that the items in the list need not be of the same data type.
In [16]:
# Example code for accessing lists
# Create lists
list_1 = ['Statistics', 'Programming', 2016, 2017, 2018]
list_2 = ['a', 'b', 1, 2, 3, 4, 5, 6, 7 ]
# Accessing values in lists
print("list_1[0]: ", list_1[0])
print("list2_[1:5]: ", list_2[1:5])
list_1[0]: Statistics
list2_[1:5]: ['b', 1, 2, 3]
In [17]:
#Example code for adding new values to lists
print("list_1 values: ", list_1)
# Adding new value to list
list_1.append(2019)
print("list_1 values post append: ", list_1)
In [18]:
In [18]:
#Example code for updating existing values of lists
print("Values of list_1: ", list_1)
# Updating existing value of list
print("Index 2 value : ", list_1[2])
list_1[2] = 2015;
print("Index 2's new value : ", list_1[2])
In [19]:
#Example code for deleting a list element
print("list_1 values: ", list_1)
# Deleting list element
del list_1[5];
print("After deleting value at index 2 : ", list_1)
print("Iteration :")
for x in [1,2,3]: print(x)
# If you dont specify the end explicitly, all elements from the specified
#start index will be printed
print("slicing range: ", list_1[1:])
list_1.extend(list_2)
Length: 5
Concatenation: [1, 2, 3, 4, 5, 6]
Repetition : ['Hello', 'Hello', 'Hello', 'Hello']
Membership : True
Iteration :
1
2
3
slicing : 2017
slicing range: ['Programming', 2015, 2017, 2018]
Max of list: 5
Min of list: 1
Count number of 1 in list: 2
Extended : ['Statistics', 'Programming', 2015, 2017, 2018, 'a', 'b', 1, 2, 3, 4, 5, 6, 7]
Index for Programming: 1
['Statistics', 'Programming', 2015, 2017, 2018, 'a', 'b', 1, 2, 3, 4, 5, 6, 7]
pop last item in list: 7
pop the item with index 2: 2015
removed b from list: ['Statistics', 'Programming', 2017, 2018, 'a', 1, 2, 3, 4, 5, 6]
Reverse: [6, 5, 4, 3, 2, 1, 'a', 2018, 2017, 'Programming', 'Statistics']
Sort ascending: ['a', 'b', 'c']
Sort descending: ['c', 'b', 'a']
4.2 Tuples
A Python tuple is a sequences or series of immutable Python objects very much similar to the lists. However there exist some
essential differences between lists and tuples, which are the following.
In [21]:
# Example code for creating tuple
# Creating a tuple
Tuple = ()
print("Empty Tuple: ", Tuple)
Tuple = (1,)
print("Tuple with single item: ", Tuple)
Tuple = ('a','b','c','d',1,2,3)
print("Sample Tuple :", Tuple)
Empty Tuple: ()
Tuple with single item: (1,)
Sample Tuple : ('a', 'b', 'c', 'd', 1, 2, 3)
In [22]:
In [23]:
#Example code for deleting tuple
# Deleting tuple
print("Sample Tuple: ", Tuple)
del Tuple
print(Tuple) # Will throw an error message as the tuple does not exist
In [24]:
Length of Tuple: 7
Concatinated Tuple: ('a', 'b', 'c', 'd', 1, 2, 3, 7, 8, 9)
Repetition: (1, 'a', 2, 'b', 1, 'a', 2, 'b', 1, 'a', 2, 'b')
Membership check: True
1
2
3
Negative sign will retrieve item from right: 8
Sliced Tuple [2:] ('c', 'd', 1, 2, 3, 7, 8, 9)
Max of the Tuple (1,2,3,4,5,6,7,8,9,10): 10
Min of the Tuple (1,2,3,4,5,6,7,8,9,10): 1
List [1,2,3,4] converted to tuple: <class 'tuple'>
4.3 Dictionary
The Python dictionary will have a key and value pair for each item that is part of it. The key and value should be enclosed in curly
braces. Each key and value is separated using a colon (:), and further each item is separated by commas (,). Note that the keys are
unique within a specific dictionary and must be immutable data types such as strings, numbers, or tuples, whereas values can take
duplicate data of any type.
In [25]:
In [26]:
# Example code for accessing dictionary
# Accessing items in dictionary
print("Value of key Name, from sample dictionary:", dict['Name'])
In [28]:
In [29]:
#Example code for basic operations on dictionary
# Basic operations
dict = {'Name': 'Jivin', 'Age': 6, 'Class': 'First'}
print("Length of dict: ", len(dict))
# Concatenate dicts
dict1 = {'Name': 'Jivin', 'Age': 6}
dict2 = {'Sex': 'male' }
dict1.update(dict2)
print("dict1.update(dict2) = ", dict1)
Length of dict: 3
Copy:
{'Name': 'Jivin', 'Age': 6, 'Class': 'First'}
Value for Age: 6
Value for Age: 6
dict items: dict_items([('Name', 'Jivin'), ('Age', 6), ('Class', 'First')])
dict keys: dict_keys(['Name', 'Age', 'Class'])
Value of dict: dict_values(['Jivin', 6, 'First'])
dict1.update(dict2) = {'Name': 'Jivin', 'Age': 6, 'Sex': 'male'}
Syntax for creating functions without argument: def functoin_name(): 1st block line 2nd block line ...
In [30]:
# Example code for creating functions without argument
# Simple function
def someFunction():
print("Hello World")
Hello World
Syntax for Creating Functions with Argument def functoin_name(parameters): 1st block line 2nd block line ... return [expression]
In [31]:
Scope of Variables The availability of a variable or identifier within the program during and after the execution is determined by the scope
of a variable. There are two fundamental variable scopes in Python. 1. Global variables 2. Local variables
In [32]:
20
In [33]:
#Variable Length Arguments
# Example code for passing argumens as *args
# Simple function to loop through arguments and print them
def sample_function(*args):
for a in args:
print(a)
In [34]:
def sample_function(**args):
for a in args:
print(a, args[a])
# Call the function
sample_function(name='John', age=27)
name John
age 27
In [35]:
#Lambda Function
def add(x, y):
return x + y
print("FUNCTION ADD:\n",add(3,2))
add = lambda x, y : x + y
print("LAMBDA ADD :\n",add(3,2))
FUNCTION ADD:
5
LAMBDA ADD :
5
Data analysis packages: These are the sets of packages that provide us the mathematic and scientific functionalities that are
essential to perform data preprocessing and transformation. Core Machine learning packages: These are the set of packages that
provide us with all the necessary machine learning algorithms and functionalities that can be applied on a given dataset to extract the
patterns.
6.1.1: NumPy
6.1.2: SciPy
6.1.3: Matplotlib
6.1.4: Pandas
6.1.1 : NumPy
NumPy is the core library for scientific computing in Python. It provides a highperformance multidimensional array object, and tools for
working with these array
In [36]:
<class 'numpy.ndarray'>
(3,)
0
1
2
[5 1 2]
In [37]:
(2, 3)
[[0 1 2]
[3 4 5]]
0 1 3
In [38]:
# Create a 3x3 array of all zeros
a = np.zeros((3,3))
print(a)
[[ 0. 0. 0.]
[ 0. 0. 0.]
[ 0. 0. 0.]]
In [39]:
# Create a 2x2 array of all ones
b = np.ones((2,2))
print(b)
[[ 1. 1.]
[ 1. 1.]]
In [40]:
# Create a 3x3 constant array
c = np.full((3,3), 7)
print(c)
[[7 7 7]
[7 7 7]
[7 7 7]]
[7 7 7]]
In [41]:
# Create a 3x3 array filled with random values
d = np.random.random((3,3))
print(d)
In [42]:
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
In [43]:
[2 3 1 0]
In [44]:
# arange() will create arrays with regularly incrementing values
g = np.arange(2,10)
print(g)
[2 3 4 5 6 7 8 9]
In [45]:
# note mix of tuple and lists
h = np.array([[0,1,2.0],[0,0,0],(1+1j,3.,2.)])
print(h)
In [46]:
# create an array of range with float data type
i = np.arange(1, 8, dtype=np.float)
print(i)
[ 1. 2. 3. 4. 5. 6. 7.]
In [47]:
# linspace() will create arrays with a specified number of items which are
# spaced equally between the specified beginning and end values
j = np.linspace(2., 4., 5)
print(j)
[ 2. 2.5 3. 3.5 4. ]
In [48]:
# indices() will create a set of arrays stacked as a one-higher
# dimensioned array, one per dimension with each representing variation
# in that dimension
k = np.indices((3,3))
print(k)
[[[0 0 0]
[1 1 1]
[2 2 2]]
[[0 1 2]
[0 1 2]
[0 1 2]]]
In [49]:
#NumPy datatypes
# Let numpy choose the datatype
x = np.array([0, 1])
y = np.array([2.0, 3.0])
# Force a particular datatype
z = np.array([5, 6], dtype=np.int64)
print(x.dtype, y.dtype, z.dtype)
In [50]:
# Basic slicing : The basic slice syntax is i: j: k,
# where i is the starting index,j is the stopping index,
# and k is the step and k is not equal to 0.
x = np.array([5, 6, 7, 8, 9])
print(x[1:7:2])
print(x[-2:5])
print(x[-1:1:-1])
[6 8]
[8 9]
[9 8 7]
In [51]:
#Boolean array indexing
a=np.array([[1,2], [3, 4], [5, 6]])
# Find the elements of a that are bigger than 2
print (a > 2)
# to get the actual value
print (a[a > 2])
[[False False]
[ True True]
[ True True]]
[3 4 5 6]
In [52]:
import numpy as np
x=np.array([[1,2],[3,4],[5,6]])
y=np.array([[7,8],[9,10],[11,12]])
# Elementwise sum; both produce the array
print(x+y)
print(np.add(x, y))
# Elementwise difference; both produce the array
print(x-y)
print(np.subtract(x, y))
[[ 8 10]
[12 14]
[12 14]
[16 18]]
[[ 8 10]
[12 14]
[16 18]]
[[-6 -6]
[-6 -6]
[-6 -6]]
[[-6 -6]
[-6 -6]
[-6 -6]]
In [53]:
# Elementwise product; both produce the array
print(x*y)
print(np.multiply(x, y))
[[ 7 16]
[27 40]
[55 72]]
[[ 7 16]
[27 40]
[55 72]]
In [54]:
print(x/y)
print(np.divide(x, y))
[[ 0.14285714 0.25 ]
[ 0.33333333 0.4 ]
[ 0.45454545 0.5 ]]
[[ 0.14285714 0.25 ]
[ 0.33333333 0.4 ]
[ 0.45454545 0.5 ]]
In [55]:
print(np.sqrt(x))
[[ 1. 1.41421356]
[ 1.73205081 2. ]
[ 2.23606798 2.44948974]]
In [56]:
x=np.array([[1,2],[3,4]])
y=np.array([[5,6],[7,8]])
a=np.array([9,10])
b=np.array([11, 12])
# Inner product of vectors; both produce 219
print(a.dot(b))
print(np.dot(a, b))
219
219
In [57]:
# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(a))
print(np.dot(x, a))
[29 67]
[29 67]
In [58]:
# Matrix / matrix product; both produce the rank 2 array
# Matrix / matrix product; both produce the rank 2 array
print(x.dot(y))
print(np.dot(x, y))
[[19 22]
[43 50]]
[[19 22]
[43 50]]
In [59]:
# Sum function
x=np.array([[1,2],[3,4]])
# Compute sum of all elements
print (np.sum(x))
# Compute sum of each column
print (np.sum(x, axis=0))
# Compute sum of each row
print (np.sum(x, axis=1))
10
[4 6]
[3 7]
In [60]:
#Transpose function
x=np.array([[1,2], [3,4]])
print(x)
print(x.T)
[[1 2]
[3 4]]
[[1 3]
[2 4]]
In [61]:
# Note that taking the transpose of a rank 1 array does nothing:
v=np.array([1,2,3])
print(v)
print(v.T)
[1 2 3]
[1 2 3]
Broadcasting : Broadcasting enables arithmetic operations to be performed between different shaped arrays
In [62]:
# Broadcasting using NumPy
a = np.array([[1,2,3], [4,5,6], [7,8,9]])
v = np.array([1, 0, 1])
# Add v to each row of a using broadcasting
b = a + v
print(b)
[[ 2 2 4]
[ 5 5 7]
[ 8 8 10]]
In [63]:
X.T:
[[1 3]
[2 4]
[1 1]]
W:
[1 1]
Final :
[[2 3 2]
[4 5 2]]
In [64]:
# Multiply a matrix by a constant:
# x has shape (2, 3). Numpy treats scalars as arrays of shape ();
# these can be broadcast together to shape (2, 3)
print(x * 2)
[[2 4 2]
[6 8 2]]
6.1.2 : PANDAS
In [65]:
import pandas as pd
Pandas are an open source Python package providing fast, flexible, and expressive data structures designed to make working with
“relational” or “labeled” data both easy and intuitive. Pandas are well suited for tabular data with heterogeneously typed columns, as in an
SQL table or Excel spreadsheet
Data Structures:
Pandas introduces two new data structures to Python [ both of which are built on top of NumPy (this means it’s fast).]
1. Series
2. DataFrame
1.Series : This is a one-dimensional object similar to column in a spreadsheet or SQL table. By default each item will be assigned an
index label from 0 to N.
In [66]:
#Creating a pandas series
# creating a series by passing a list of values, and a custom index label.
#Note that the labeled index reference for each row and it can have
#duplicate values
s = pd.Series([1,2,3,np.nan,5,6], index=['A','B','C','D','E','F'])
print(s)
A 1.0
B 2.0
C 3.0
D NaN
E 5.0
F 6.0
dtype: float64
2.DataFrame : It is a two-dimensional object similar to a spreadsheet or an SQL table. This is the most commonly used pandas object
In [67]:
#Creating a pandas dataframe
data = {'Gender': ['F', 'M', 'M'],'Emp_ID': ['E01', 'E02',
'E03'], 'Age': [25, 27, 25]}
# We want the order the columns, so lets specify in columns parameter
df = pd.DataFrame(data, columns=['Emp_ID','Gender', 'Age'])
df
Out[67]:
0 E01 F 25
1 E02 M 27
2 E03 M 25
# writing
# index = False parameter will not write the index values, default is True
READ CSV:
Gender Height (in)
0 Male 72
1 Male 72
2 Female 63
3 Female 62
4 Female 62
5 Male 73
6 Female 64
7 Female 63
8 Female 67
9 Male 71
10 Male 72
11 Female 63
12 Male 71
13 Female 67
13 Female 67
14 Female 62
15 Female 63
16 Male 66
17 Female 60
18 Female 68
19 Female 65
20 Female 64
WRITE CSV:
Gender Height (in)
0 Male 72
1 Male 72
2 Female 63
3 Female 62
4 Female 62
5 Male 73
6 Female 64
7 Female 63
8 Female 67
9 Male 71
10 Male 72
11 Female 63
12 Male 71
13 Female 67
14 Female 62
15 Female 63
16 Male 66
17 Female 60
18 Female 68
19 Female 65
20 Female 64
READ TXT:
Empty DataFrame
Columns: [ 0.00632 18.00 2.310 0 0.5380 ]
Index: []
WRITE TXT:
Empty DataFrame
Columns: [ 0.00632 18.00 2.310 0 0.5380 ]
Index: []
READ EXCEL:
ATS SSM HBP FH ADM O
0 Yes Abnorm High Yes Yes 0.94
1 Yes Abnorm High Yes No 0.93
2 Yes Abnorm High No Yes 0.92
3 Yes Abnorm High No No 0.91
4 Yes Abnorm Norm Yes No 0.84
5 Yes Abnorm Norm Yes No 0.82
6 Yes Abnorm Norm No Yes 0.80
7 Yes Abnorm Norm No No 0.78
8 Yes Norm High Yes Yes 0.91
9 Yes Norm High Yes No 0.91
10 Yes Norm High No Yes 0.90
11 Yes Norm High No No 0.89
12 Yes Norm Norm Yes Yes 0.80
13 Yes Norm Norm Yes No 0.78
14 Yes Norm Norm No Yes 0.76
15 No Abnorm High Yes Yes 0.79
16 No Abnorm High Yes No 0.77
WRITE EXCEL:
ATS SSM HBP FH ADM O
0 Yes Abnorm High Yes Yes 0.94
1 Yes Abnorm High Yes No 0.93
2 Yes Abnorm High No Yes 0.92
3 Yes Abnorm High No No 0.91
4 Yes Abnorm Norm Yes No 0.84
5 Yes Abnorm Norm Yes No 0.82
6 Yes Abnorm Norm No Yes 0.80
7 Yes Abnorm Norm No No 0.78
8 Yes Norm High Yes Yes 0.91
9 Yes Norm High Yes No 0.91
10 Yes Norm High No Yes 0.90
11 Yes Norm High No No 0.89
12 Yes Norm Norm Yes Yes 0.80
13 Yes Norm Norm Yes No 0.78
14 Yes Norm Norm No Yes 0.76
15 No Abnorm High Yes Yes 0.79
16 No Abnorm High Yes No 0.77
EXCEL SHEET1:
EXCEL SHEET1:
ATS SSM HBP FH ADM O
0 Yes Abnorm High Yes Yes 0.94
1 Yes Abnorm High Yes No 0.93
2 Yes Abnorm High No Yes 0.92
3 Yes Abnorm High No No 0.91
4 Yes Abnorm Norm Yes No 0.84
5 Yes Abnorm Norm Yes No 0.82
6 Yes Abnorm Norm No Yes 0.80
7 Yes Abnorm Norm No No 0.78
8 Yes Norm High Yes Yes 0.91
9 Yes Norm High Yes No 0.91
10 Yes Norm High No Yes 0.90
11 Yes Norm High No No 0.89
12 Yes Norm Norm Yes Yes 0.80
13 Yes Norm Norm Yes No 0.78
14 Yes Norm Norm No Yes 0.76
15 No Abnorm High Yes Yes 0.79
16 No Abnorm High Yes No 0.77
EXCEL SHEET2:
ATS SSM HBP
0 Yes Abnorm High
1 Yes Abnorm High
2 Yes Abnorm High
3 Yes Abnorm High
4 Yes Abnorm Norm
5 Yes Abnorm Norm
6 Yes Abnorm Norm
7 Yes Abnorm Norm
8 Yes Norm High
9 Yes Norm High
10 Yes Norm High
11 Yes Norm High
12 Yes Norm Norm
13 Yes Norm Norm
14 Yes Norm Norm
15 No Abnorm High
16 No Abnorm High
(767, 9)
6 148 72 35 0 33.6 0.627 50 1
0 1 85 66 29 0 26.6 0.351 31 0
1 8 183 64 0 0 23.3 0.672 32 1
2 1 89 66 23 94 28.1 0.167 21 0
3 0 137 40 35 168 43.1 2.288 33 1
4 5 116 74 0 0 25.6 0.201 30 0
Shape:
(150, 4)
Head:
Sepal_Length Sepal_Width Petal_Length Petal_Width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
Tail:
Sepal_Length Sepal_Width Petal_Length Petal_Width
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
In [72]:
df = pd.DataFrame(iris.data)
df.columns = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
df.describe()
Out[72]:
cov() - Covariance indicates how two variables are related. A positive covariance means the variables are positively related, while a
negative covariance means the variables are inversely related. Drawback of covariance is that it does not tell you the degree of
positive or negative relation
In [73]:
df.cov()
Out[73]:
In [74]:
df.corr()
Out[74]:
Grouping
Grouping involves one or more of the following steps: • Splitting the data into groups based on some criteria, • Applying a function to
each group independently, • Combining the results into a data structure
In [75]:
#Grouping operation
df = pd.DataFrame({'Name' : ['jack', 'jane', 'jack', 'jane', 'jack', 'jane',
'jack', 'jane'],'State' : ['SFO', 'SFO', 'NYK', 'CA', 'NYK', 'NYK','SFO', 'CA'],
'Grade':['A','A','B','A','C','B','C','A'],
'Age' : np.random.uniform(24, 50, size=8),
'Salary' : np.random.uniform(3000, 5000, size=8),})
# Note that the columns are ordered automatically in their alphabetic order
print(df)
# for custom order please use below code
# df = pd.DataFrame(data, columns = ['Name', 'State', 'Age','Salary'])
# Find max age and salary by Name / State
# with groupby, we can use all aggregate functions such as min, max, mean,
#count, cumsum
df.groupby(['Name','State']).max()
Out[75]:
Name State
Name State
7.0 Matplotlib
In [76]:
import matplotlib.pyplot as plt
Matplotlib is a numerical mathematics extension NumPy and a great package to view or present data in a pictorial or graphical
format. It enables analysts and decision makers to see analytics presented visually, so they can grasp difficult concepts or identify
new patterns. There are two broad ways of using pyplot.
In [77]:
# Creating plot on variables
# simple bar and scatter plot
Histogram:
Line Graph:
Box Plot:
Customizing Labels
In [79]:
#Customize labels
# add footnote
plt.figtext(0.5, 0.01, 'Fig1: Sinusoidal', ha='right', va='bottom')
# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
import seaborn
print('seaborn: {}'.format(seaborn.__version__))
import pgmpy
print('pgmpy: {}'.format(pgmpy.__name__))
import urllib
print('urlib: {}'.format(urllib.__name__))
import csv
print('csv: {}'.format(csv.__version__))
Python: 3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 07:29:16) [MSC v.1900 32 bit (Intel)]
scipy: 0.19.1
numpy: 1.13.3
matplotlib: 2.1.0
pandas: 0.20.3
sklearn: 0.19.1
seaborn: 0.8.0
pgmpy: pgmpy
urlib: urllib
csv: 1.0
8.1 Scipy
SciPy, pronounced as Sigh Pi, is a scientific python open source, distributed under the BSD licensed library to perform Mathematical,
Scientific and Engineering Computations. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional
array manipulation. The SciPy library is built to work with NumPy arrays and provides many user-friendly and efficient numerical
practices such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick
to install and are free of charge. NumPy and SciPy are easy to use, but powerful enough to depend on by some of the world's
leading scientists and engineers.
In [81]:
import numpy as np
print(np.linspace(1., 4., 6))
In [82]:
#K-Means Implementation in SciPy
from scipy.cluster.vq import kmeans,vq,whiten
from numpy import vstack,array
from numpy.random import rand
Centroids:
[[ 1.12229616 0.96730452 1.15748527]
[ 1.91765881 2.42546485 1.32970073]
[ 2.80961316 2.57661677 2.82922036]]
Cluster:
[2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 1 0 2 2
2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 1 1 1 1 2 2
1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0
0 0 0 0 1 0 0 0 0 0 1 0 0 0 0]
In [83]:
#Fast Fourier Transform
#Importing the fft and inverse fft functions from fftpackage
from scipy.fftpack import fft
FFT :
[ 4.50000000+0.j 2.08155948-1.65109876j -1.83155948+1.60822041j
-1.83155948-1.60822041j 2.08155948+1.65109876j]
FFT Inverse:
[ 1.0+0.j 2.0+0.j 1.0+0.j -1.0+0.j 1.5+0.j]
In [84]:
#Discrete Cosine Transform
from scipy.fftpack import dct
print ("DCT:\n",dct(np.array([4., 3., 5., 10., 5., 3.])))
DCT:
[ 60. -3.48476592 -13.85640646 11.3137085 6. -6.31319305]
IDCT:
[ 39.15085889 -20.14213562 -6.45392043 7.13341236 8.14213562
-3.83035081]
SciPy - Integrate
The general form of quad is scipy.integrate.quad(f, a, b), Where ‘f’ is the name of the function to be integrated. Whereas, ‘a’ and ‘b’
are the lower and upper limits, respectively. Let us see an example of the Gaussian function, integrated over a range of 0 and 1.
f(x) = (ex)2
∫f(x)dx
In [85]:
# Single Integration
import scipy.integrate
from numpy import exp
f= lambda x:exp(-x**2)
i = scipy.integrate.quad(f, 0, 1)
print(i)
(0.7468241328124271, 8.291413475940725e-15)
Linear Algebra
x + 3y + 5z = 10
x + 3y + 5z = 10
2x + 5y + z = 8
2x + 3y + 8z = 3
In [86]:
#importing the scipy and numpy packages
from scipy import linalg
import numpy as np
Finding a Determinant
In [87]:
#importing the scipy and numpy packages
from scipy import linalg
import numpy as np
-2.0
Eigen Values :
[-0.37228132+0.j 5.37228132+0.j]
Eigen Vectors:
[[-0.82456484 -0.41597356]
[ 0.56576746 -0.90937671]]
Image Processing
In [89]:
from scipy import misc
f = misc.face()
misc.imsave('face.png', f) # uses the Image module (PIL)
In [90]:
# Statistical Information of the image
from scipy import misc
face = misc.face(gray = False)
print(face.mean(), face.max(), face.min())
110.162743886 255 0
In [91]:
# Cropping
from scipy import misc
face = misc.face(gray = True)
lx,ly = face.shape
# Cropping
crop_face = face[lx//4 : -lx//4 , ly//4 : -ly//4]
import matplotlib.pyplot as plt
plt.imshow(crop_face)
plt.show()
In [92]:
# up <-> down flip
from scipy import misc
face = misc.face()
face = misc.face()
flip_ud_face = np.flipud(face)
In [93]:
# rotation
from scipy import misc,ndimage
face = misc.face()
rotate_face = ndimage.rotate(face, 45)
In [94]:
# Blurring
from scipy import misc
face = misc.face()
blurred_face = ndimage.gaussian_filter(face, sigma=3)
import matplotlib.pyplot as plt
plt.imshow(blurred_face)
plt.show()
In [95]:
# Edge Detection
import scipy.ndimage as nd
import numpy as np
im = np.zeros((256, 256))
im[64:-64, 64:-64] = 1
im[90:-90,90:-90] = 2
im = ndimage.gaussian_filter(im, 0)
In [97]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
img=mpimg.imread('C:/Users/thyagaragu/Desktop/Data/Image/C1.jpg')
#print(img)
plt.imshow(img)
plt.show()
8.2 sklearn
Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.It is licensed
under a permissive simplified BSD license and is distributed under many Linux distributions, encouraging academic and commercial
use. The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn. This stack that
includes: NumPy: Base n-dimensional array package SciPy: Fundamental library for scientific computing Matplotlib: Comprehensive
2D/3D plotting IPython: Enhanced interactive console Sympy: Symbolic mathematics Pandas: Data structures and analysis
In [98]:
import sklearn
Scikit Learn Loading Dataset
In [99]:
from sklearn import datasets
In [100]:
# Data sets available in sklearn
iris= datasets.load_iris()
houseprice = datasets.load_boston()
diabetes = datasets.load_diabetes()
digits = datasets.load_digits()
linerud= datasets.load_linnerud() #Fitness Club Data Set
wine = datasets.load_wine()
breastcancer = datasets.load_breast_cancer()
In [101]:
print(digits.target_names)
print(digits.data[0])
[0 1 2 3 4 5 6 7 8 9]
[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5.
0. 0. 3. 15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8.
8. 0. 0. 5. 8. 0. 0. 9. 8. 0. 0. 4. 11. 0. 1.
12. 7. 0. 0. 2. 14. 5. 10. 12. 0. 0. 0. 0. 6. 13.
10. 0. 0. 0.]
In [102]:
print(houseprice.feature_names)
print(houseprice.data[0])
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
[ 6.32000000e-03 1.80000000e+01 2.31000000e+00 0.00000000e+00
5.38000000e-01 6.57500000e+00 6.52000000e+01 4.09000000e+00
1.00000000e+00 2.96000000e+02 1.53000000e+01 3.96900000e+02
4.98000000e+00]
In [103]:
print(diabetes.feature_names)
print(diabetes.data[0])
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
[ 0.03807591 0.05068012 0.06169621 0.02187235 -0.0442235 -0.03482076
-0.04340085 -0.00259226 0.01990842 -0.01764613]
In [104]:
print(linerud.data[0])
print(linerud.feature_names)
[ 5. 162. 60.]
['Chins', 'Situps', 'Jumps']
In [105]:
print(wine.feature_names)
print(wine.data[0])
In [106]:
print(breastcancer.feature_names)
print(breastcancer.data[0])
In [107]:
# Print shape of data to confirm data is loaded
print("IRIS:\n",iris.data.shape)
print("HOUSEPRICE:\n",houseprice.data.shape)
print("DIABETES:\n",diabetes.data.shape)
print("DIGITS:\n",digits.data.shape)
print("LINERUD:\n",linerud.data.shape)
print("WINE:\n",wine.data.shape)
print("BREASTCANCER:\n",breastcancer.data.shape)
IRIS:
(150, 4)
HOUSEPRICE:
(506, 13)
DIABETES:
(442, 10)
DIGITS:
(1797, 64)
LINERUD:
(20, 3)
WINE:
(178, 13)
BREASTCANCER:
(569, 30)
In [108]:
# see what’s available in iris:
iris.keys()
print("IRIS KEYS:\n",iris.keys())
n_samples, n_features = iris.data.shape
print ("IRIS # SAMPLES:\n",n_samples)
print ("IRIS # FEATURES:\n",n_features)
print ("IRIS FIRST FEW ROWS:\n",iris.data[0:10])
print("IRIS TARGETS NAMES",iris.target_names)
print("IRIS FEATURE NAMES",iris.feature_names)
print("IRIS TARGET",iris.target)
print("IRIS DESCR",iris.DESCR)
iris_X = iris.data
iris_y = iris.target
np.unique(iris_y)
IRIS KEYS:
IRIS KEYS:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
IRIS # SAMPLES:
150
IRIS # FEATURES:
4
IRIS FIRST FEW ROWS:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]
[ 5.4 3.9 1.7 0.4]
[ 4.6 3.4 1.4 0.3]
[ 5. 3.4 1.5 0.2]
[ 4.4 2.9 1.4 0.2]
[ 4.9 3.1 1.5 0.1]]
IRIS TARGETS NAMES ['setosa' 'versicolor' 'virginica']
IRIS FEATURE NAMES ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm
)']
IRIS TARGET [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
IRIS DESCR Iris Plants Database
====================
Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
References
----------
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
Out[108]:
array([0, 1, 2])
In [109]:
# Split iris data in train and test data
# A random permutation, to split the data randomly
np.random.seed(0)
indices = np.random.permutation(len(iris_X))
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test = iris_X[indices[-10:]]
iris_y_test = iris_y[indices[-10:]]
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
print("Predicted :\n",knn.predict(iris_X_test))
print("Actual:\n",iris_y_test)
Predicted :
[1 2 1 0 0 0 2 1 2 0]
Actual:
[1 1 1 0 0 0 2 1 2 0]
Linear regression
LinearRegression, in its simplest form, fits a linear model to the data set by adjusting a set of parameters in order to make the sum of
the squared residuals of the model as small as possible
In [110]:
diabetes = datasets.load_diabetes()
diabetes_X_train = diabetes.data[:-20]
diabetes_X_test = diabetes.data[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
In [111]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(diabetes_X_train, diabetes_y_train)
print("Regression Coef:\n",regr.coef_)
print("Mean:\n",np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2))
# Explained variance score: 1 is perfect prediction
# and 0 means that there is no linear relationship
# between X and y.
regr.score(diabetes_X_test, diabetes_y_test)
Regression Coef:
[ 3.03499549e-01 -2.37639315e+02 5.10530605e+02 3.27736980e+02
-8.14131709e+02 4.92814588e+02 1.02848452e+02 1.84606489e+02
7.43519617e+02 7.60951722e+01]
Mean:
Mean:
2004.56760269
Out[111]:
0.58507530226905735
In [112]:
# Sample Decision Tree Classifier
from sklearn import datasets
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# load the iris datasets
dataset = datasets.load_iris()
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
print(model)
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
[[50 0 0]
[ 0 50 0]
[ 0 0 50]]
8.3 pgmpy
In [113]:
import pgmpy
In [114]:
# Generate data
import numpy as np
import pandas as pd
coin
25 0
26 0
27 0
27 0
28 0
29 0
30 1
31 1
32 1
33 1
34 1
In [115]:
# Defining the Bayesian Model
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
model = BayesianModel()
model.add_node('coin')
╒═════════╤═════╕
│ coin(0) │ 0.3 │
├─────────┼─────┤
│ coin(1) │ 0.7 │
╘═════════╧═════╛
In [116]:
# Fitting the data to the model using Bayesian Estimator with Dirichlet prior with equal pseudo co
unts.
model.fit(data, estimator=BayesianEstimator, prior_type='dirichlet', pseudo_counts={'coin': [50, 50
]})
print(model.get_cpds('coin'))
╒═════════╤═════╕
│ coin(0) │ 0.4 │
├─────────┼─────┤
│ coin(1) │ 0.6 │
╘═════════╧═════╛
We can see that we get the results as expected. In the maximum likelihood case we got the probability just based on the data where
as in the bayesian case we had a prior of P(H) = 0.5 and P(T) = 0.5 , therefore with 30% heads and 70% tails in the data we got a
posterior of P(H) = 0.4 and P(T) = 0.6 . Similarly we can learn in case of more complex model. Let's take an example of the student
model and compare the results in case of Maximum Likelihood estimator and Bayesian Estimator.
In [117]:
# Generating radom data with each variable have 2 states and equal probabilities for each state
import numpy as np
import pandas as pd
D I G L S
100 1 1 1 0 0
101 1 1 1 1 0
102 1 0 0 1 0
103 1 0 1 1 0
104 1 1 0 1 0
105 0 0 0 1 0
106 0 1 0 0 0
107 1 0 1 0 0
108 0 0 1 1 1
109 0 0 1 1 0
110 1 1 0 1 1
In [118]:
# Defining the model
from pgmpy.models import BayesianModel
from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
CPD of D:
╒══════╤══════╕
│ D(0) │ 0.48 │
├──────┼──────┤
│ D(1) │ 0.52 │
╘══════╧══════╛
CPD of G:
╒══════╤════════════════════╤═════════════════════╤══════╤═════════════════════╕
│ D │ D(0) │ D(0) │ D(1) │ D(1) │
├──────┼────────────────────┼─────────────────────┼──────┼─────────────────────┤
│ I │ I(0) │ I(1) │ I(0) │ I(1) │
├──────┼────────────────────┼─────────────────────┼──────┼─────────────────────┤
│ G(0) │ 0.4618320610687023 │ 0.46788990825688076 │ 0.5 │ 0.44402985074626866 │
├──────┼────────────────────┼─────────────────────┼──────┼─────────────────────┤
│ G(1) │ 0.5381679389312977 │ 0.5321100917431193 │ 0.5 │ 0.5559701492537313 │
╘══════╧════════════════════╧═════════════════════╧══════╧═════════════════════╛
CPD of I:
╒══════╤═══════╕
│ I(0) │ 0.514 │
├──────┼───────┤
│ I(1) │ 0.486 │
╘══════╧═══════╛
CPD of L:
╒══════╤═════════════════════╤════════════════════╕
│ G │ G(0) │ G(1) │
├──────┼─────────────────────┼────────────────────┤
│ L(0) │ 0.45726495726495725 │ 0.4981203007518797 │
├──────┼─────────────────────┼────────────────────┤
│ L(1) │ 0.5427350427350427 │ 0.5018796992481203 │
╘══════╧═════════════════════╧════════════════════╛
CPD of S:
╒══════╤════════════════════╤════════════════════╕
│ I │ I(0) │ I(1) │
├──────┼────────────────────┼────────────────────┤
│ S(0) │ 0.5038910505836576 │ 0.5164609053497943 │
├──────┼────────────────────┼────────────────────┤
│ S(1) │ 0.4961089494163424 │ 0.4835390946502058 │
╘══════╧════════════════════╧════════════════════╛
As the data was randomly generated with equal probabilities for each state we can see here that all the probability values are close to
0.5 which we expected. Now coming to the Bayesian Estimator:
pseudo_counts = {'D': [300, 700], 'I': [500, 500], 'G': [800, 200], 'L': [500, 500], 'S': [400, 600]
}
model.fit(data, estimator=BayesianEstimator, prior_type='dirichlet', pseudo_counts=pseudo_counts)
for cpd in model.get_cpds():
print("CPD of {variable}:".format(variable=cpd.variable))
print(cpd)
CPD of D:
╒══════╤══════╕
│ D(0) │ 0.39 │
├──────┼──────┤
│ D(1) │ 0.61 │
╘══════╧══════╛
CPD of G:
╒══════╤═════════════════════╤════════════════════╤═════════════════════╤════════════════════╕
│ D │ D(0) │ D(0) │ D(1) │ D(1) │
├──────┼─────────────────────┼────────────────────┼─────────────────────┼────────────────────┤
│ I │ I(0) │ I(1) │ I(0) │ I(1) │
├──────┼─────────────────────┼────────────────────┼─────────────────────┼────────────────────┤
│ G(0) │ 0.7297939778129953 │ 0.7405582922824302 │ 0.7396166134185304 │ 0.7247634069400631 │
├──────┼─────────────────────┼────────────────────┼─────────────────────┼────────────────────┤
│ G(1) │ 0.27020602218700474 │ 0.2594417077175698 │ 0.26038338658146964 │ 0.2752365930599369 │
╘══════╧═════════════════════╧════════════════════╧═════════════════════╧════════════════════╛
CPD of I:
╒══════╤═══════╕
│ I(0) │ 0.507 │
├──────┼───────┤
│ I(1) │ 0.493 │
╘══════╧═══════╛
CPD of L:
╒══════╤═════════════════════╤════════════════════╕
│ G │ G(0) │ G(1) │
├──────┼─────────────────────┼────────────────────┤
│ L(0) │ 0.48637602179836514 │ 0.4993472584856397 │
├──────┼─────────────────────┼────────────────────┤
│ L(1) │ 0.5136239782016349 │ 0.5006527415143603 │
╘══════╧═════════════════════╧════════════════════╛
CPD of S:
╒══════╤═════════════════════╤═════════════════════╕
│ I │ I(0) │ I(1) │
├──────┼─────────────────────┼─────────────────────┤
│ S(0) │ 0.43527080581241745 │ 0.43808882907133245 │
├──────┼─────────────────────┼─────────────────────┤
│ S(1) │ 0.5647291941875826 │ 0.5619111709286676 │
╘══════╧═════════════════════╧═════════════════════╛
Since the data was randomly generated with equal probabilities for each state, the data tries to bring the posterior probabilities close
to 0.5. But because of the prior we will get the values in between the prior and 0.5.
In [120]:
print(cnt)
In [122]:
import random
print(random.randint(0, 5))
In [123]:
import random
print(random.random() * 100)
72.49014831732813
In [124]:
print(random.choice( ['red', 'black', 'green'] ))
green
In [125]:
import random
for x in range(10):
print(random.randint(1,101))
27
62
99
4
21
26
3
18
41
62
In [126]:
import random
for i in range(3):
print (random.randrange(0, 101, 5)) #range(start, stop, step)
20
40
80
In [128]:
df.agg(['sum', 'min'])
df.agg(['sum', 'min'])
Out[128]:
A B C
In [129]:
df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
Out[129]:
A B
In [130]:
def add(x):
return x
In [131]:
df2 = df.agg({'A':[add,lambda x: x/2]})['A']
In [132]:
print(df2)
add <lambda>
0 1.0 0.5
1 4.0 2.0
2 7.0 3.5
3 NaN NaN
Applying a function
print (df)
1. obj.groupby('key')
2. obj.groupby(['key1','key2'])
3. obj.groupby(key,axis=1)
Let us now see how the grouping objects can be applied to the DataFrame object
Example
In [134]:
# import the pandas library
import pandas as pd
print(df.groupby('Team'))
View Groups
In [135]:
print(df.groupby('Team').groups)
{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Ride
rs': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'),
'kings': Int64Index([5], dtype='int64')}
Example
Group by with multiple columns −
In [136]:
print(df.groupby(['Team','Year']).groups)
In [137]:
grouped = df.groupby('Year')
2014
Points Rank Team Year
0 876 1 Riders 2014
2 863 2 Devils 2014
4 741 3 Kings 2014
9 701 4 Royals 2014
2015
Points Rank Team Year
1 789 2 Riders 2015
3 673 3 Devils 2015
5 812 4 kings 2015
10 804 1 Royals 2015
2016
Points Rank Team Year
6 756 1 Kings 2016
8 694 2 Riders 2016
2017
Points Rank Team Year
7 788 1 Kings 2017
11 690 2 Riders 2017
Select a Group
Using the get_group() method, we can select a single group.
In [138]:
grouped = df.groupby('Year')
print(grouped.get_group(2014))
Aggregations
An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation
operations can be performed on the grouped data. An obvious one is aggregation via the aggregate or equivalent ## agg method −
In [139]:
import numpy as np
grouped = df.groupby('Year')
print(grouped['Points'].agg(np.mean))
Year
2014 795.25
2015 769.50
2016 725.00
2017 739.00
Name: Points, dtype: float64
In [140]:
# Another way to see the size of each group is by applying the size() function −
grouped = df.groupby('Team')
print(grouped.agg(np.size))
In [141]:
grouped = df.groupby('Team')
print(grouped['Points'].agg([np.sum, np.mean, np.std]))
Transformations
Transformation on a group or a column returns an object that is indexed the same size of that is being grouped. Thus, the transform
should return a result that is the same size as that of a group chunk.
In [142]:
grouped = df.groupby('Team')
score = lambda x: (x - x.mean()) / x.std()*10
print(grouped.transform(score))
Filtration
Filtration filters the data on a defined criteria and returns the subset of data. The filter() function is used to filter the data.
In [143]:
print(df.groupby('Team').filter(lambda x: len(x) >= 3))
numpy.random.rand() :
In [144]:
import numpy as np
np.random.rand(3,2)
Out[144]:
array([[ 0.68551272, 0.71602392],
[ 0.86216627, 0.50804434],
[ 0.461094 , 0.96511632]])
In [145]:
np.random.rand(10)
Out[145]:
array([ 0.79651226, 0.55873099, 0.33061707, 0.845238 , 0.45543639,
0.09268519, 0.45490427, 0.8719684 , 0.44828215, 0.01434915])
In [146]:
import random
print(random.randint(0,9))
Arbitrary functions can be applied along the axes of a DataFrame or Panel using the apply()
method, which, like the descriptive statistics methods, takes an optional axis argument. By
default, the operation performs column wise, taking each column as an array-like.
Example 1
In [147]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print(df)
df.apply(np.mean)
print (df.apply(np.mean))
Example 2:
By passing axis parameter, operations can be performed row wise
In [148]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean,axis=1)
print(df.apply(np.mean))
col1 0.023225
col2 -0.062539
col3 -0.695665
dtype: float64
Example 3:
In [149]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(lambda x: x.max() - x.min())
print (df.apply(np.mean))
col1 -0.419467
col2 0.338818
col3 0.321762
dtype: float64
In [150]:
df = pd.DataFrame([[4, 9],] * 3, columns=['A', 'B'])
df
Out[150]:
A B
0 4 9
1 A
4 B
9
2 4 9
In [151]:
df.apply(np.sqrt)
Out[151]:
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
In [152]:
df.apply(np.sum, axis=0)
Out[152]:
A 12
B 27
dtype: int64
In [153]:
df.apply(np.sum, axis=1)
Out[153]:
0 13
1 13
2 13
dtype: int64
In [154]:
df.apply(lambda x: [1, 2], axis=1)
Out[154]:
A B
0 1 2
1 1 2
2 1 2
In [155]:
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
Out[155]:
foo bar
0 1 2
1 1 2
2 1 2
In [156]:
In [156]:
import pandas as pd
In [157]:
url = 'http://bit.ly/kaggletrain'
train = pd.read_csv(url)
train.head(3)
Out[157]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
2 STON/O2.
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282
In [159]:
# let's compared Sex and Sex_num columns
# here we can see we map male to 1 and female to 0
train.loc[0:4, ['Sex', 'Sex_num']]
Out[159]:
Sex Sex_num
0 male 1
1 female 0
2 female 0
3 female 0
4 male 1
In [161]:
In [161]:
# the apply() method applies the function to each element
train.loc[0:4, ['Name', 'Name_length']]
Out[161]:
Name Name_length
In [162]:
import numpy as np
In [163]:
train.loc[0:4, ['Fare', 'Fare_ceil']]
Out[163]:
Fare Fare_ceil
0 7.2500 8.0
1 71.2833 72.0
2 7.9250 8.0
3 53.1000 54.0
4 8.0500 9.0
In [164]:
# let's extract last name of each person
Out[164]:
0 [Braund, Mr. Owen Harris]
1 [Cumings, Mrs. John Bradley (Florence Briggs ...
2 [Heikkinen, Miss. Laina]
3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]
4 [Allen, Mr. William Henry]
Name: Name, dtype: object
In [165]:
# we just want the first string from the list
# we create a function to retrieve
def get_element(my_list, position):
return my_list[position]
In [166]:
# use our created get_element function
# we pass position=0
train.Name.str.split(',').apply(get_element, position=0).head()
Out[166]:
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
Name: Name, dtype: object
Out[167]:
In [168]:
drinks.loc[:, 'beer_servings':'wine_servings'].head()
Out[168]:
0 0 0 0
1 89 132 54
2 25 0 14
4 217 57 45
In [169]:
# you want apply() method to travel axis=0 (downwards, column)
# apply Python's max() function
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=0)
Out[169]:
beer_servings 376
spirit_servings 438
wine_servings 370
dtype: int64
In [170]:
# you want apply() method to travel axis=1 (right, row)
# apply Python's max() function
drinks.loc[:, 'beer_servings':'wine_servings'].apply(max, axis=1)
Out[170]:
0 0
1 132
2 25
3 312
4 217
5 128
6 221
7 179
8 261
9 279
10 46
11 176
12 63
13 0
14 173
15 373
16 295
17 263
18 34
19 23
20 167
21 173
22 173
23 245
24 31
25 252
26 25
27 88
28 37
29 144
...
163 178
164 90
165 186
166 280
167 35
168 15
169 258
170 106
171 4
172 36
173 36
174 197
175 51
176 51
177 71
178 41
179 45
180 237
181 135
182 219
183 36
184 249
185 220
186 101
187 21
188 333
189 111
190 6
191 32
192 64
Length: 193, dtype: int64
In [171]:
# finding which column is the maximum's category name
drinks.loc[:, 'beer_servings':'wine_servings'].apply(np.argmax, axis=1)
Out[171]:
0 beer_servings
1 spirit_servings
2 beer_servings
3 wine_servings
4 beer_servings
5 spirit_servings
6 wine_servings
7 spirit_servings
8 beer_servings
9 beer_servings
10 spirit_servings
11 spirit_servings
12 spirit_servings
13 beer_servings
14 spirit_servings
15 spirit_servings
16 beer_servings
17 beer_servings
18 beer_servings
19 beer_servings
20 beer_servings
21 spirit_servings
22 beer_servings
23 beer_servings
24 beer_servings
25 spirit_servings
26 beer_servings
27 beer_servings
28 beer_servings
29 beer_servings
...
163 spirit_servings
164 beer_servings
165 wine_servings
166 wine_servings
167 spirit_servings
168 spirit_servings
169 spirit_servings
170 beer_servings
171 wine_servings
172 beer_servings
173 beer_servings
174 beer_servings
175 beer_servings
176 beer_servings
177 spirit_servings
178 spirit_servings
179 beer_servings
180 spirit_servings
181 spirit_servings
182 beer_servings
183 beer_servings
184 beer_servings
185 wine_servings
186 spirit_servings
187 beer_servings
188 beer_servings
189 beer_servings
190 beer_servings
191 beer_servings
192 beer_servings
Length: 193, dtype: object
Iterator
while True:
try:
print (next(it))
except StopIteration:
sys.exit()
SystemExit
C:\Users\thyagaragu\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2870:
UserWarning: To exit: use 'exit', 'quit', or Ctrl-D.
warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
isinstance()
In [173]:
# Python code for isinstance()
# Syntax : isinstance(object, classinfo)
#The isinstance() takes two parameters:
# object : object to be checked
# classinfo : class, type, or tuple of classes and types
class Test:
a = 5
TestInstance = Test()
print(isinstance(TestInstance, Test))
True
PPRINT
In [176]:
import pprint
In [177]:
data = {'a':2, 'b':{'x':3, 'y':{'t1': 4, 't2':5}}}
print(data)
pprint.pprint(data)
In [178]:
USING PRINT:
{'key1': 'value1', 'key2': 'value2', 'key3': {'key3a': 'value3a'}, 'key4': {'key4a': {'key4aa': '
value4aa', 'key4ab': 'value4ab', 'key4ac': 'value4ac'}, 'key4b': 'value4b'}}
USING PPRINT:
{'key1': 'value1',
'key2': 'value2',
'key3': {'key3a': 'value3a'},
'key4': {'key4a': {'key4aa': 'value4aa',
'key4ab': 'value4ab',
'key4ac': 'value4ac'},
'key4b': 'value4b'}}
math.floor(-23.11) : -24
math.floor(300.16) : 300
math.floor(300.72) : 300
In [180]:
# Python program to demonstrate the use of ceil() method
math.ceil(-23.11) : -23
math.ceil(300.16) : 301
math.ceil(300.72) : 301
In [181]:
from math import ceil
from math import floor
print("ceil(-23.11) : ",ceil(-23.11))
print("ceil(300.16) : ",ceil(300.16))
print("ceil(300.72) : ",ceil(300.72))
print("floor(-23.11): ",floor(-23.11))
print("floor(300.16): ",floor(300.16))
print("floor(300.72): ",floor(300.72))
ceil(-23.11) : -23
ceil(-23.11) : -23
ceil(300.16) : 301
ceil(300.72) : 301
floor(-23.11): -24
floor(300.16): 300
floor(300.72): 300
Numpy Sort
In [182]:
import numpy as np
a = np.array([[3,7],[9,1]])
b = np.array([3,7,9,1])
print('Array a is:')
print(a)
print('\n')
print('Array b is:')
print(b)
Array a is:
[[3 7]
[9 1]]
Array b is:
[3 7 9 1]
Sorted Array a is :
[[3 7]
[1 9]]
Sorted Array b is :
[1 3 7 9]
Numpy Clip
In [183]:
import numpy as np
my_array = np.array([[100, 200], [300, 400]],np.uint16)
my_array.clip(0,255) # clip(min, max)
Out[183]:
array([[100, 200],
[255, 255]], dtype=uint16)
In [184]:
x=np.array([1,2,3,4,5])
h = [(np.abs(x - x[i])) for i in range(5)]
print(x)
print(x)
print(h)
[1 2 3 4 5]
[array([0, 1, 2, 3, 4]), array([1, 0, 1, 2, 3]), array([2, 1, 0, 1, 2]), array([3, 2, 1, 0, 1]),
array([4, 3, 2, 1, 0])]
In [185]:
x=np.array([1,2,3,4,5])
h = [(np.abs(x - x[i]))[3] for i in range(5)] # Third Element
print(x)
print(h)
[1 2 3 4 5]
[3, 2, 1, 0, 1]
In [186]:
x=np.array([1,2,3,4,5])
print("x:\n",x)
print("x-x[i]:\n",[abs(x-x[i]) for i in range(5)])
h = [(np.abs(x - x[i]))[4] for i in range(5)] # To determine the nearest points to 4
print("Forth in h:\n",h)
h = [np.sort(np.abs(x - x[i])) for i in range(5)]
print("sorted h:\n",h)
h = [np.sort(np.abs(x - x[i]))[4] for i in range(5)]
print(" The Largest Distance Points sets :\n",h)
x:
[1 2 3 4 5]
x-x[i]:
[array([0, 1, 2, 3, 4]), array([1, 0, 1, 2, 3]), array([2, 1, 0, 1, 2]), array([3, 2, 1, 0, 1]),
array([4, 3, 2, 1, 0])]
Forth in h:
[4, 3, 2, 1, 0]
sorted h:
[array([0, 1, 2, 3, 4]), array([0, 1, 1, 2, 3]), array([0, 1, 1, 2, 2]), array([0, 1, 1, 2, 3]),
array([0, 1, 2, 3, 4])]
The Largest Distance Points sets :
[4, 3, 2, 3, 4]
In [187]:
print(x[:, None])
[[1]
[2]
[3]
[4]
[5]]
In [188]:
print(x[None, :])
[[1 2 3 4 5]]
In [189]:
[[ 0 -1 -2 -3 -4]
[ 1 0 -1 -2 -3]
[ 2 1 0 -1 -2]
[ 3 2 1 0 -1]
[ 4 3 2 1 0]]
In [190]:
print(np.abs(x[:, None] - x[None, :]))
[[0 1 2 3 4]
[1 0 1 2 3]
[2 1 0 1 2]
[3 2 1 0 1]
[4 3 2 1 0]]
In [191]:
print(h)
[4, 3, 2, 3, 4]
In [192]:
print(np.abs(x[:, None] - x[None, :])/ h)
[[ 0. 0.33333333 1. 1. 1. ]
[ 0.25 0. 0.5 0.66666667 0.75 ]
[ 0.5 0.33333333 0. 0.33333333 0.5 ]
[ 0.75 0.66666667 0.5 0. 0.25 ]
[ 1. 1. 1. 0.33333333 0. ]]
In [193]:
w = np.clip(np.abs((x[:, None] - x[None, :]) / h), 0.0, 1.0)
print(w)
[[ 0. 0.33333333 1. 1. 1. ]
[ 0.25 0. 0.5 0.66666667 0.75 ]
[ 0.5 0.33333333 0. 0.33333333 0.5 ]
[ 0.75 0.66666667 0.5 0. 0.25 ]
[ 1. 1. 1. 0.33333333 0. ]]
In [196]:
delta = np.ones(5)
print(delta)
[ 1. 1. 1. 1. 1.]
In [198]:
y= 1 + np.random.randn(5)
print(y)
In [200]:
delta = np.ones(5)
for i in range(5):
weights = delta * w[:, i] # Assigning Weights to each point
b = np.array([np.sum(weights * y), np.sum(weights * y * x)]) # Matrix B
A = np.array([[np.sum(weights), np.sum(weights * x)],
[np.sum(weights * x), np.sum(weights * x * x)]]) # Matrix A
print(b)
print(A)
[ 0.25211417 2.06653039]
[[ 2.5 5. ]
[ 5. 12.5]]
Usage of c_ and r_
In [1]:
import numpy as np
from numpy import c_
np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])]
Out[1]:
array([[1, 2, 3, 0, 0, 4, 5, 6]])
In [6]:
x0 = np.linspace(-3, 3, num=10)
print("X0:\n",x0)
x0 = np.r_[1, x0]
print("rX0:\n",x0)
X0:
[-3. -2.33333333 -1.66666667 -1. -0.33333333 0.33333333
1. 1.66666667 2.33333333 3. ]
rX0:
[ 1. -3. -2.33333333 -1.66666667 -1. -0.33333333
0.33333333 1. 1.66666667 2.33333333 3. ]
pinv
In [12]:
import numpy as np
from numpy import linalg
A = np.array([[1,-2],[3,5]])
print("The Matrix A: \n",A)
print("The dimension of A:\n ",A.shape)
print("The inverse of A: \n",linalg.inv(A))
print("The pinverse of A: \n",linalg.pinv(A))
The Matrix A:
[[ 1 -2]
[ 3 5]]
The dimension of A:
(2, 2)
The inverse of A:
[[ 0.45454545 0.18181818]
[-0.27272727 0.09090909]]
The pinverse of A:
[[ 0.45454545 0.18181818]
[-0.27272727 0.09090909]]
Creating Classes
In [2]:
class Employee:
'Common base class for all employees'
empCount = 0
def displayEmployee(self):
print("Name : ", self.name, ", Salary: ", self.salary)
To create instances of a class, you call the class using class name and pass in whatever
arguments its init method accepts.
The variable empCount is a class variable whose value is shared among all instances of a this class. This can be accessed
as Employee.empCount from inside the class or outside the class.
The first method init() is a special method, which is called class constructor or initialization method that Python calls when
you create a new instance of this class.
You declare other class methods like normal functions with the exception that the first argument to each method is self.
Python adds the self argument to the list for you; you do not need to include it when you call the methods.
In [3]:
"This would create first object of Employee class"
emp1 = Employee("Zara", 2000)
"This would create second object of Employee class"
emp2 = Employee("Manni", 5000)
Accessing Attributes
You access the object's attributes using the dot operator with object. Class variable would be
accessed using class name as follows −
In [5]:
emp1.displayEmployee()
emp2.displayEmployee()
print("Total Employee %d" % Employee.empCount)