Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Numpy

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

• Python for Data Science (PDS) (3150713)

Unit-03
Capturing, Preparing and Working with data
 Outline
Looping

✓ Basic File IO in Python


✓ NumPy V/S Pandas (what to use?)
✓ NumPy
✓ Pandas
✓ Accessing text, CSV, Excel files using pandas
✓ Accessing SQL Database
✓ Web Scrapping using BeautifulSoup
Basic IO operations in Python
• Before we can read or write a file, we have to open it using Python's built-in open()
function.
syntax
fileobject = open(filename [, accessmode][, buffering])
• filename is a name of a file we want to open.
• accessmode is determines the mode in which file has to be opened (list of possible values
given below)
• If buffering is set to 0, no buffering will happen, if set to 1 line buffering will happen, if
grater than 1 then the number of buffer and if negative is given it will follow system default
buffering behaviour. M Description
M Description M Description (create file if not exist) Opens file to append, if file not
a
exist will create it for write
R Read only (default) w Write only
Append in binary format, if file
rb Read only in binary format wb Write only in binary format ab
not exist will create it for write
r+ Read and Write both w+ Read and Write both Append, if file not exist it will
a+
Read and Write both in Read and Write both in create for read & write both
rb+ wb+
binary format binary format ab+
Read and Write both in binary
format
Example : Read file in Python
• read(size) will read specified bytes from the file, if we don’t specify size it will return whole file.
readfile.py college.txt
1 f=open('college.txt') Shree Swaminaryan Institute of Technology
2 data=f.read() Bhat Gandhinagar
3 print(data) Gujarat- INDIA

• readlines() method will return list of lines from the file.


readlines.py OUTPUT
1 f=open('college.txt') [Shree Swaminaryan Institute of Technology
2 lines=f.readlines() Bhat Gandhinagar
3 print(lines) Gujarat- INDIA ]

• We can use for loop to get each line separately,


readlinesfor.py OUTPUT
1 f=open('college.txt') Shree Swaminaryan Institute of Technology
2 lines=f.readlines()
3 for l in lines : Bhat Gandhinagar
4 print(l)
Gujarat- INDIA
How to write path?
• We can specify relative path in argument to open method, alternatively we can
also specify absolute path.
• To specify absolute path,
• In windows, f=open(‘D:\\folder\\subfolder\\filename.txt’)
• In mac & linux, f=open(‘/user/folder/subfolder/filename.txt’.
• We suppose to close the file once we are done using the file in the Python using
close() method.
closefile.py
1 f = open('college.txt')
2 data = f.read()
3 print(data)
4 f.close()
Handling errors using “with” keyword
• It is possible that we may have typ in the filename or file we specified is
moved/deleted, in such cases there will be an error while running the file.
• To handle such situations we can use new syntax of opening the file using with
keyword.
fileusingwith.py
1 with open('college.txt') as f :
2 data = f.read()
3 print(data)

• When we open file using with we need not to close the file.
Example : Write file in Python
• write() method will write the specified data to the file.
readdemo.py
1 with open('college.txt','a') as f:
2 f.write('Hello world')

• If we open file with ‘w’ mode it will overwrite the data to the existing file or will
create new file if file does not exists.
• If we open file with ‘a’ mode it will append the data at the end of the existing file or
will create new file if file does not exists.
Reading CSV files without any library functions
• A comma-separated values file is a delimited text file that uses a comma to separate
values.
• Each line of is a data record, Each record consists of many fields, separated by
commas. Book1.csv readlines.py

• Example : studentname,enrollment,cpi 1 with open('Book1.csv') as f :


2 open('C:\\Users\\user1\\Desktop\\Book1.csv')
rows = f.readlines() as
abcd,123456,8.5
bcde,456789,2.5 3 f:for r in rows :
cdef,321654,7.6 2
4 rows=f.readlines()
cols = r.split(',')
3 isFirstLine=True
5 print('Student Name = ', cols[0], end=" ")
4 for print('\tEn.
6 r in rows: No. = ', cols[1], end=" ")
5
7 if isFirstLine:
print('\tCPI = \t', cols[2])
• We can use Microsoft Excel to access 6 isFirstLine=False
7 continue
CSV files. 8 print('Student Name = ',cols[0],end=" ")
• In the later sessions we will access CSV 9
10
print('\tEn. No. = ',cols[1],end=" ")
print('\tCPI = \t',cols[2])
files using different libraries, but we can
also access CSV files without any libraries.
(Not recommend)
NumPy v/s Pandas
• Developers built pandas on top of NumPy, as a result every task we perform using
pandas also goes through NumPy.
• To obtain the benefits of pandas, we need to pay a performance penalty that some
testers say is 100 times slower than NumPy for similar task.
• Nowadays computer hardware are powerful enough to take care for the performance
issue, but when speed of execution is essential NumPy is always the best choice.
• We can use pandas to make writing code easier and faster, pandas will reduce potential
coding errors.
• Pandas provide rich time-series functionality, data alignment, NA-friendly statistics,
groupby, merge, etc.. methods, if we use NumPy we have to implement all these
methods manually.
• So,
• if we want performance we should use NumPy,
• if we want ease of coding we should use pandas.
• Python for Data Science (PDS) (3150713)

Unit-03.01
Lets Learn
NumPy
NumPy
• NumPy (Numeric Python) is a Python library to manipulate arrays.
• Almost all the libraries in python rely on NumPy as one of their main building
block.
• NumPy provides functions for domains like Algebra, Fourier transform etc..
• NumPy is incredibly fast as it has bindings to C libraries.
Install :
• OR
• conda install numpy
• pip install numpy
NumPy Array
• The most important object defined in NumPy is an N-dimensional array type called ndarray.
• It describes the collection of items of the same type, Items in the collection can be accessed using a zero-based
index.
• An instance of ndarray class can be constructed in many different ways, the basic ndarray can be created as
below.

syntax
import numpy as np
a= np.array(list | tuple | set | dict)

numpyarray.py Output
1 import numpy as np <class 'numpy.ndarray'>
2 a=np.array(['swaminarayan','Insitute','gandh [‘swaminarayan' 'Insitute'
inagar']) ‘gandhinagar']

3 print(type(a))
4 print(a)
NumPy Array (Cont.)
• arange(start,end,step) function will create NumPy array starting from start till end (not included) with specified
steps.
numpyarange.py Output
1 import numpy as np [0 1 2 3 4 5 6 7 8 9]
2 b = np.arange(0,10,1)
3 print(b)

• zeros(n) function will return NumPy array of given shape, filled with zeros.
numpyzeros.py Output
1 import numpy as np [0. 0. 0.]
2 c = np.zeros(3)
3 print(c) [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]]
4 c1 = np.zeros((3,3)) #have to give as tuple
5 print(c1)

• ones(n) function will return NumPy array of given shape, filled with ones.
NumPy Array (Cont.)
• eye(n) function will create 2-D NumPy array with ones on the diagonal and zeros elsewhere.
numpyeye.py Output
1 import numpy as np [[1. 0. 0.]
2 b = np.eye(3) [0. 1. 0.]
3 print(b) [0. 0. 1.]]

• linspace(start,stop,num) function will return evenly spaced numbers over a specified interval.
numpylinspace.py Output
1 import numpy as np [0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
2 c = np.linspace(0,1,11) 0.9 1. ]
3 print(c)

• Note: in arange function we have given start, stop & step, whereas in lispace
function we are giving start,stop & number of elements we want.
Array Shape in NumPy
• We can grab the shape of ndarray using its shape property.
numpyarange.py Output
1 import numpy as np (3,3)
2 b = np.zeros((3,3))
3 print(b.shape)

• We can also reshape the array using reshape method of ndarray.


numpyarange.py Output
1 import numpy as np [[29 55]
2 re1 = np.random.randint(1,100,10) [44 50]
3 re2 = re1.reshape(5,2) [25 53]
4 print(re2) [59 6]
[93 7]]
• Note: the number of elements and multiplication of rows and cols in new array must be
equal.
• Example : here we have old one-dimensional array of 10 elements and reshaped shape is (5,2)
so, 5 * 2 = 10, which means it is a valid reshape
NumPy Random
• rand(p1,p2….,pn) function will create n-dimensional array with random data using uniform distrubution, if we do
not specify any parameter it will return random float number.

numpyrand.py Output
1 import numpy as np 0.23937253208490505
2 r1 = np.random.rand()
3 print(r1) [[0.58924723 0.09677878]
4 r2 = np.random.rand(3,2) # no tuple [0.97945337 0.76537675]
5 print(r2) [0.73097381 0.51277276]]

• radint(low,high,num) function will create one-dimensional array with num random integer data between low and
high.
numpyrandint.py Output
1 import numpy as np [78 78 17 98 19 26 81 67 23 24]
2 r3 = np.random.randint(1,100,10)
3 print(r3)

• We can reshape the array in any shape using reshape method, which we learned in
previous slide.
NumPy Random (Cont.)
• randn(p1,p2….,pn) function will create n-dimensional array with random data using standard normal distribution, if
we do not specify any parameter it will return random float number.
numpyrandn.py Output
1 import numpy as np -0.15359861758111037
2 r1 = np.random.randn()
3 print(r1) [[ 0.40967905 -0.21974532]
4 r2 = np.random.randn(3,2) # no tuple [-0.90341482 -0.69779498]
5 print(r2) [ 0.99444948 -1.45308348]]

• Note: rand function will generate random number using uniform distribution,
whereas randn function will generate random number using standard normal
distribution.
• We are going to learn the difference using visualization technique (as a data
scientist, We have to use visualization techniques to convince the audience)
Visualizing the difference between rand & randn
• We are going to use matplotlib library to visualize the difference

matplotdemo.py
1 import numpy as np
2 from matplotlib import pyplot as plt
3 %matplotlib inline
4 samplesize = 100000
5 uniform = np.random.rand(samplesize)
6 normal = np.random.randn(samplesize)
7 plt.hist(uniform,bins=100)
8 plt.title('rand: uniform')
9 plt.show()
10 plt.hist(normal,bins=100)
11 plt.title('randn: normal')
12 plt.show()
Aggregations
• min() function will return the minimum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymin.py Output
1 import numpy as np Min way1 = 1
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Min way2 = 1
3 a = np.array(l)
4 print('Min way1 = ',a.min())
5 print('Min way2 = ',np.min(a))
• max() function will return the maximum value from the ndarray, there are two ways in which we can use max
function, example of both ways are given below.
numpymax.py Output
1 import numpy as np Max way1 = 11
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Max way2 = 11
3 a = np.array(l)
4 print('Max way1 = ',a.max())
5 print('Max way2 = ',np.max(a))
Aggregations (Cont.)
• NumPy support many aggregation functions such as min, max, argmin, argmax,
sum, mean, std, etc…
numpymin.py Output
1 l = [7,5,3,1,8,2,3,6,11,5,2,9,10,2,5,3,7,8,9,3,1,9,3]
2 a = np.array(l)
3 print('Min = ',a.min()) Min = 1
4 print('ArgMin = ',a.argmin()) ArgMin = 3
5 print('Max = ',a.max()) Max = 11
6 print('ArgMax = ',a.argmax()) ArgMax = 8
7 print('Sum = ',a.sum()) Sum = 122
8 print('Mean = ',a.mean()) Mean = 5.304347826086956
9 print('Std = ',a.std()) Std = 3.042235771223635
Using axis argument with aggregate functions
• When we apply aggregate functions with multidimensional ndarray, it will apply aggregate function to all its
dimensions (axis).
numpyaxis.py Output
1 import numpy as np sum = 45
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
3 print('sum = ',array2d.sum())

• If we want to get sum of rows or cols we can use axis argument with the aggregate functions.
numpyaxis.py Output
1 import numpy as np sum (cols) = [12 15 18]
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]]) sum (rows) = [6 15 24]
3 print('sum (cols)= ',array2d.sum(axis=0)) #Vertical
4 print('sum (rows)= ',array2d.sum(axis=1)) #Horizontal
Single V/S Double bracket notations
• There are two ways in which you can access element of multi-dimensional array,
example of both the method is given below
numpybrackets.py Output
1 arr = double = h
2 np.array([['a','b','c'],['d','e','f'],['g','h','i']]) single = h
3 print('double = ',arr[2][1]) # double bracket notaion
4 print('single = ',arr[2,1]) # single bracket notation

• Both method is valid and provides exactly the same answer, but single bracket
notation is recommended as in double bracket notation it will create a temporary
sub array of third row and then fetch the second column from it.
• Single bracket notation will be easy to read and write while programming.
Slicing ndarray
• Slicing in python means taking elements from one given index to another given index.
• Similar to Python List, we can use same syntax array[start:end:step] to slice ndarray.
• Default start is 0
• Default end is length of the array
• Default step is 1

numpyslice1d.py Output
1 import numpy as np ['c' 'd' 'e']
2 arr = ['a' 'b' 'c' 'd' 'e']
np.array(['a','b','c','d','e','f','g','h']) ['f' 'g' 'h']
3 print(arr[2:5]) ['c' 'e' 'g']
4 print(arr[:5]) ['h' 'g' 'f' 'e' 'd' 'c'
5 print(arr[5:]) 'b' 'a']
6 print(arr[2:7:2])
7 print(arr[::-1])
Slicing multi-dimensional array
• Slicing multi-dimensional array would be same as single dimensional array with
the help of single bracket notation we learn earlier, lets see an example.
numpyslice1d.py Output
1 arr = [['a' 'b']
2 np.array([['a','b','c'],['d','e','f'],['g','h', ['d' 'e']]
'i']]) [['g' 'h' 'i']
3 print(arr[0:2 , 0:2]) #first two rows and cols ['d' 'e' 'f']
4 print(arr[::-1]) #reversed rows ['a' 'b' 'c']]
5 print(arr[: , ::-1]) #reversed cols [['c' 'b' 'a']
6 print(arr[::-1,::-1]) #complete reverse ['f' 'e' 'd']
['i' 'h' 'g']]
[['i' 'h' 'g']
['f' 'e' 'd']
['c' 'b' 'a']]
Warning : Array Slicing is mutable !
• When we slice an array and apply some operation on them, it will also make changes in original array, as it will not
create a copy of a array while slicing.
• Example,
numpyslice1d.py Output
1 import numpy as np Original Array = [2 2 2 4 5]
2 arr = np.array([1,2,3,4,5]) Sliced Array = [2 2 2]
3 arrsliced = arr[0:3]
4
5 arrsliced[:] = 2 # Broadcasting
6
7 print('Original Array = ', arr)
8 print('Sliced Array = ',arrsliced)
NumPy Arithmetic Operations
numpyop.py Output
1 import numpy as np Addition Scalar = [[3 4 5]
2 arr1 = np.array([[1,2,3],[1,2,3],[1,2,3]]) [3 4 5]
3 arr2 = np.array([[4,5,6],[4,5,6],[4,5,6]]) [3 4 5]]
Addition Matrix = [[5 7 9]
4
[5 7 9]
5 arradd1 = arr1 + 2 # addition of matrix with scalar [5 7 9]]
6 arradd2 = arr1 + arr2 # addition of two matrices Substraction Scalar = [[-1 0 1]
7 print('Addition Scalar = ', arradd1) [-1 0 1]
8 print('Addition Matrix = ', arradd2) [-1 0 1]]
9 Substraction Matrix = [[-3 -3 -3]
10 arrsub1 = arr1 - 2 # substraction of matrix with [-3 -3 -3]
scalar [-3 -3 -3]]
Division Scalar = [[0.5 1. 1.5]
11 arrsub2 = arr1 - arr2 # substraction of two matrices
[0.5 1. 1.5]
12 print('Substraction Scalar = ', arrsub1) [0.5 1. 1.5]]
13 print('Substraction Matrix = ', arrsub2) Division Matrix = [[0.25 0.4 0.5
14 arrdiv1 = arr1 / 2 # substraction of matrix with ]
scalar [0.25 0.4 0.5 ]
15 arrdiv2 = arr1 / arr2 # substraction of two matrices [0.25 0.4 0.5 ]]
16 print('Division Scalar = ', arrdiv1)
17 print('Division Matrix = ', arrdiv2)
NumPy Arithmetic Operations (Cont.)
numpyop.py Output
1 import numpy as np Multiply Scalar = [[2 4 6]
2 arrmul1 = arr1 * 2 # multiply matrix with scalar [2 4 6]
3 arrmul2 = arr1 * arr2 # multiply two matrices [2 4 6]]
Multiply Matrix = [[ 4 10 18]
4 print('Multiply Scalar = ', arrmul1)
[ 4 10 18]
5 #Note : its not metrix multiplication* [ 4 10 18]]
6 print('Multiply Matrix = ', arrmul2) Matrix Multiplication = [[24 30
7 # In order to do matrix multiplication 36]
8 arrmatmul = np.matmul(arr1,arr2) [24 30 36]
9 print('Matrix Multiplication = ',arrmatmul) [24 30 36]]
10 # OR Dot = [[24 30 36]
arrdot = arr1.dot(arr2) [24 30 36]
[24 30 36]]
11 print('Dot = ',arrdot)
Python 3.5+ support = [[24 30 36]
12 # OR [24 30 36]
13 arrpy3dot5plus = arr1 @ arr2 [24 30 36]]
14 print('Python 3.5+ support = ',arrpy3dot5plus)
Sorting Array
• The sort() function returns a sorted copy of the input array.
syntax Parameters
import numpy as np arr = array to sort (inplace)
# arr = our ndarray axis = axis to sort (default=0)
np.sort(arr,axis,kind,order) kind = kind of algo to use
# OR arr.sort() (‘quicksort’ <- default,
‘mergesort’, ‘heapsort’)
order = on which field we want
• Example : to sort (if multiple fields)

numpysort.py Output
1 import numpy as np Before Sorting = ['Swaminarayan'
2 arr = 'Institute' 'of' 'Technology‘]
np.array([‘Swaminarayan',‘Institute',‘of',‘ After Sorting = ['Institute' 'Swaminarayan'
Technology']) 'Technology' 'of']
3 print("Before Sorting = ", arr)
4 arr.sort() # or np.sort(arr)
5 print("After Sorting = ",arr)
Sort Array Example
numpysort2.py Output
1 import numpy as np [(b'ABC', 300) (b'PQR', 200) (b'XYZ', 100)]
2 dt=np.dtype([('name', 'S10'),('age', int)])
3 arr2=np.array([('PQR',200),('ABC',300),('XYZ'
,100)],dtype=dt)

4 arr2.sort(order='name')
5 print(arr2)
Conditional Selection
• Similar to arithmetic operations when we apply any comparison operator to Numpy Array, then
it will be applied to each element in the array and a new bool Numpy Array will be created with
values True or False.
numpycond1.py Output
1 import numpy as np [25 17 24 15 17 97 42 10 67
2 arr = np.random.randint(1,100,10) 22]
3 print(arr) [False False False False
4 boolArr = arr > 50 False True False False True
5 print(boolArr) False]
numpycond2.py Output
1 import numpy as np All = [31 94 25 70 23 9 11
2 arr = np.random.randint(1,100,10) 77 48 11]
3 print("All = ",arr) Filtered = [94 70 77]
4 boolArr = arr > 50
5 print("Filtered = ", arr[boolArr])

You might also like