Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Machine Learning Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 52
At a glance
Powered by AI
The key takeaways are that machine learning involves teaching machines to recognize patterns in data through algorithms like classification and regression. The main types are supervised, unsupervised, and reinforcement learning.

The main types of machine learning algorithms are classification algorithms like K-NN and logistic regression which differentiate between categories, and regression algorithms like linear and polynomial regression which predict continuous numeric values. Clustering is used for unsupervised learning to group unlabeled data.

Arrays in NumPy are homogeneous data structures that allow fast mathematical operations. 1D, 2D and 3D arrays can be created. Arrays are different from lists as they are fixed in size and type. Functions like zeros, ones and arange can generate arrays of different types.

Date: 01.09.

2020

What is machine learning?

 It is the process to teach machine to recognize the images or pictures or some kind of data.
 Combination of data

Categories
Machine learning
Supervised- we have input as well as to correct it we have output

input

output

The difference between predicted and actual output is called error margin

Two type of algorithms (to handle/complete the big main process divide it into small process)

1. Classification – to differentiate/distinguish between things (e.g. difference between cat


and dog)
Example
o K-nn
o Logistic regression
2. Regression – where we have to predict the answer is yes or no
o Single linear
o multiple linear
o polynomial

Unsupervised- we only have input.

We mostly use clustering (grouping) in unsupervised

Algorithms:

1. Clustering(grouping)
Example
o K mean clustering
2. Anamoly
To understand the pattern

Reinforcement- reward based learning (deep learning concept).

Algorithms:

1. Monte Carlo
2. Q learning

Pre-processing- python pandas library

Pi cham
Jupiter note book

Idle

Python should be recognised

Python 3.7.4

Python install package

In cmd type pip install Jupiter

After that create new folder and open cmd from that path

Shift+enter to run jupyter code

Num pi

In cmd pip install num pi

Import numpy/pandas/matplotlib

Commands in cmd

1. Python hit enter


2. Exit
3. Pip –version
4. Pip install jupyter
5. Pip install numpy
6. Pip install pandas
7. Pip install matploy
Date: 03.09.2020

Numpy
It is a library which is used in python. Used for mathematical and scientific calculations

In jupyter notebook

Import numpy as np

Print(dir(np))

Print(help(np))

In google

Numpy documentation for python 3 (for guidance)

Array (collection of same data type)

Array is different from matrix. As the operations performed on them are different.

# 1D array

A_1 = np.array([1,2,3,4,5])--------the ‘()’are used for array method in numpy and ‘[]’ are used to
define an array items

Size is used to know the number of elements on array

print(A_1)

ouput

[1 2 3 4 5]-----shape of this array is 8


# 2D array

A_2 = np.array([

[1,2,3],---->1st Row

[4,5,6],---->2nd row---------shape of this array is 3

[7,8,9]----->3rd row

])

print(A_2)

Output

[[1 2 3]
[4 5 6]
[7 8 9]]
# 3D array shape 3x3x3---->3-rows, 3-cols, 3-elements

A_3=np.array([

[[1,2,3],[4,5,6],[7,8,9]],

[[10,11,12],[13,14,15],[16,17,18]],

[[19,20,21],[22,23,24],[25,26,27]],

])

Output:

[[[1 2 3]
[4 5 6]
[7 8 9]]

[[10 11 12]
[13 14 15]
[16 17 18]]

[[19 20 21]
[22 23 24]
[25 26 27]]]

Range function
Have three parameters (start, stop, increment/ decrement)
Gives integer type of data
size = range(10)
print(size)
type(size)
for i in size:
print(i)
print(type(i))

Arange(same as range function)method and functions are almost same


It generates array type of data
> arrange(p1,p2,p3)

P1=staring position
P2=ending position
P3=increment/decrement
arr = np.arange(1,10)
print(arr)
print(type(arr))
output :[1 2 3 4 5 6 7 8 9]
<class 'numpy.ndarray'>

Nd means nth dimension array

ar_1=np.arange(10,51,2)

print(ar_1)
output: [10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50]

How to create zero array?


Numpy.zeroes((no_of_rows,no_of_cols))
Numpy.ones((no_of_rows,no_of_cols))

zeros=np.zeros((4,3))
zeros
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])

zeros_1=np.zeros((3,4),dtype=np.int16)
zeros_1
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]], dtype=int16)

The same goes with one array

Methods and keywords (numpy)


Ndim(used to get the dimension of array/provides the dimension of the array)

--------------print(arr_1.ndim)

Size(gives no of elements of array)

--------------print(arr_1.size)

itemsize(gives the size of how much data/bits a single element consume)

-------------print(arr_1.itemsize)

Dtype(gives datatype of the element)

---------------print(arr_1.dtype)

Shape(gives no of rows and cols)

---------------print(arr_1.shape) output=(8,)

Reshape an array (conversion of 1D array to 2D or 3D or vice versa)


temp_arr=arr_1.reshape(2,4)

print(temp_arr)

temp_arr.ndim

Flatten(coverts 2D&3D array to 1D)


Array consumes less memory than list/why is array faster than list.
import numpy as np

import time

import sys

s=range(1000)

print(sys.getsizeof(5)*len(s))

14000

d=np.arange(1000)

print(d.size*d.itemsize)

4000

Matrix multiplication matrix addition and matrix division therioticaly explained

Date: 04.09.2020

 Zip is used to compress two lists


Array is faster than list
Program
import numpy as np

import time

size = 10000

#list

l1=range(size)

l2=range(size)

start=time.time()

print(start)

#result=l1+l2--->this will concatinate

result=[(x,y) for x,y in zip(l1,l2)] #zip is used to compress to list

print((time.time()-start)*1000)

#array

a1=np.arange(size)
a2=np.arange(size)

start=time.time()

result=a1+a2

print((time.time()-start)*1000)

What is line?

Line is the combination of multiple/infinte points.

Linspace function
Linspace is used to get point in between two points of line

Syntax: linspace(a,b,c)

a-starting point

b- Ending point

c- No of points you want

to get minimum value from an array

array_name.min()

to get maximum value from an array

array_name.max()

to get sum of array element

array_name.sum()

arr_1=

8 9
10 11
12 13
Rows are indicated as axis-1

Columns are indicated as axis-0


#used mostly in pre-processing

arr_1=np.array([

[8,9],

[10,11],

[12,13]

])

arr_1.shape

print(arr_1.sum())

print(arr_1.sum(axis=0))---->gives addition of cols

print(arr_1.sum(axis=1))--->gives addition of rows

Square Root of array


It gives element vise square root

Syntax: numpy.sqrt(array_name)

Example: arr_1=np.array([

[8,9],

[10,11],

[12,13]])

print(np.sqrt(arr_1))

output:

[[2.82842712 3. ]
[3.16227766 3.31662479]
[3.46410162 3.60555128]]

Addition, subtraction, multiplication and division of array


Program
arr_1=np.array([[1,2,3],[3,4,5]])

arr_2=np.array([[1,2,3],[3,4,5]])

print("addition \n",arr_1+arr_2)

print("subtraction \n",arr_1-arr_2)

print("multiplication \n",arr_1*arr_2)

print("division\n",arr_1/arr_2)
output
addition
[[ 2 4 6]
[ 6 8 10]]
subtraction
[[0 0 0]
[0 0 0]]
multiplication
[[ 1 4 9]
[ 9 16 25]]
division
[[1. 1. 1.]
[1. 1. 1.]]

Horizontal stacking vertical stacking


It should be in tuple only
Syntax: vstack(tuple)
hstack(tuple)
Program
a= np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[3,4,5]])
#vertical stacking
print(np.vstack((a,b)))
print(np.hstack((a,b)))

Matrix
#convert array to matrix
m_1=np.matrix(a)
print(type(m_1))

#creating a matrix 3X3


m_2=np.matrix("1 2 3 ;4 5 6;7 8 9")
print(m_2)
print(np.min(m_2))
print(np.max(m_2))
print(np.sum(m_2))
print(np.diagonal(m_2))

PANDAS
Python has 4 type of number system
Decimal

Hexadecimal

Octadecimal

Binary

Dictionary
Key is ont of two data type either numeric(int) or string

Dataframe(mostly used in data analysis)


It is a table like structure which looks like an excel sheet

Key works as a column name in dataframe

Program
#creation of dataframe
import pandas as pd

weather_data={

"Day":['1/1/2020','22/1/2020','3/2/2020','12/4/2020','25/5/2020'],

"Temp":[31,29,22,35,19],

"Wind_speed":[7,9,4,5,6],

"Event":["sunny","sunny","rain","fog","sunny"]

print(weather_data)

#convertion od dictionary to data frame

df=pd.DataFrame(weather_data)

df

Date: 05.09.2020

Shape: used to give dimensions of the table that is the number of rows and columns
ROW
Head
Syntax: dataframe_name.head(no of rows)

Gives/display upper/starting rows of the data frame

Bydefault: 5

Tail
Syntax: dataframe_name.tail(no of rows)

Gives lower/last rows of the dataframe

Bydefault: 5

Slicing:
It is used to create a sublist

Syntax:

List_name[starting index:ending index]

Indexing/slicing in dataframe:
Dataframe_name[starting_index:ending index]

COLUMN
Dataframe_name.columns

Gives name of the columns

To get the data of a particular column

--->dataframe_name[column_name]

Day=df[‘Day’].values

Print(Day)

.values--->is used to get the values in array

Program to get two columns values

Df[[‘day’,temp’]].values

METHODS

Temp_col=df[“Temp”].values

Temp_col.max()
Program

Df[Df[“Temp”]>32]-->queries

Df[df[“temp”]==df[“temp”].max()]

Df.describe()------>used to get operational/ int fomate data, gives mean, standered deviation and cout
and many more things.

Using Tuple list

Weather_data=[ (‘12/2/2020’,32,8,”rain”),

(‘29/3/2020’,22,8,”rain”),

(‘28/5/2020’,19,8,”rain”),

(‘22/7/2020’,23,8,”rain”),]

Df=pd.DataFrame(data=weather_data,columns=[“day”,”temp”,”wind_speed”,”events”])

Date: 07.09.2020

CSV file

How to import data from csv file

Copy the csv data path

Read_csv()

It is a me

#csv data

Pd.read_csv(r”path”)

#XLS

Df=pd.read_excel(r”path”)

Pandas is fast because of dataframe

Set first column as index


Df.set_index(“column_name”,inplace=True)--->to perment the operation inplace is used

Df.index

------

Df.loc[“index_name”]

------gives data of the index mentioned

READ and WRITE the data operations


Csv and excel

Skip row attribute is used to skip upper rows


Syntax

df= pd.read_csv(r"C:\Users\K\Documents\stock_csv_data.csv",skiprows=No_of_rows)

Header
Makes the mentioned row a header

Index wise

Of we put 2 then it skips the ist row and makes the 2nd row header

Nrows stands for number of rows


To replace the empty values from the table to NaN

So that it becomes easy to perform operations

Example:

df= pd.read_csv(r"C:\Users\K\Documents\stock_csv_data.csv",na_values=["n.a.","not available"])

df

Write csv
To_csv()=to create a new writern csv dataframe file

Example: df1.to_csv("newfile.csv",index=False)

Write excel:
df_1.to_excel("exelfile.xlsx",sheet_name="stock",index=False,startrow=2,startcol=1,header=False)

to_excel:----- to create a new writern excel dataframe file

#write two dataframe to two seperate sheet in excel


program

import pandas as pd

df_stock=pd.DataFrame({

"tikers":["Google","WMT","MSFT"],

"price":[30,40,10],

"pe":[20.5,65.10,35.2],

"eps":[20.5,20.5,56.1]

})

df_stock

df_weather=pd.DataFrame({

"day":["monday","tuesday","friday"],

"temp":[30,40,10],

"event":["rain","sunny","rain"],

"humidity":[20.5,20.5,56.1]

})

df_weather

#openpyxl----used to write a excel file

with pd.ExcelWriter("stock_weather.xlsx") as writer:

df_stock.to_excel(writer ,sheet_name="stock_file")

df_weather.to_excel(writer ,sheet_name="weather_file")

Handle missing values


Df.fillna(attribute)----used to fill the missing data for whole data

Attribute can be any thing you wanna replace the missing value with

Handling the empty value column vise


New_df=df.fillna({

“day” : “no date”,

“temperature” : 0,

“windspeed” : 0,

“event” : ”no events”})


Date:09.09.2020

Handling missing data


On index methods are not performed

parse_dates=['day']

used to change the day/date str values into date format/ time stamp

example: df=pd.read_csv("weather_data.csv",parse_dates=['day'])

Fillna(value)

Fills all the NAN values of the table with the value passed

method of fillna

Ffill(forward fill)

It is used to fill the current value with the previous value/above cell value

Bfill (backward fill)

It is used to fill the current value with the next value/below cell value

Interpolate

It gives meaningful data. It by default gives data in linear interpolation.

Dropna

Used to drop the data which has missing data

Thresh

It is used to remove the no of NAN valued columns

Like if (thresh=1)----then if then it will keep all the rows which has 1 NAN in it

#insert date used to insert the date into the table


rg=pd.date_range("2017-01-01","2017-01-11")

index=pd.DatetimeIndex(rg)

df.reindex(index)

replace using #regular expression regex

df2=df.replace({

"temperature":'[a-z]',

"windspeed":'[a-z]'

},'',regex=True)
df2

Grouping / clustering

If we have a data which has a column having repitative values of different type

Date: 10.09.2020

Concatenation:
Used for basic concatenation

#conactenate and key

df=pd.concat([india_weather,us_weather],keys=["india","us"])

df

#with the index

df=pd.concat([india_weather,us_weather])

df

#ignore the index

df=pd.concat([india_weather,us_weather],ignore_index=True)

df

Merge
Merge is use to merge to data frames

Syntax: pandas.merge(dataframe1,dataframe2,on=”columnname”,
how=”inner/outer”,indicator=”true/false(bydefault”)

“on”---it merges the dataframes on the basis o fthe column name mentioned.

Merge has the same consept as joins in DBMS

Inner join-----comman things only(intersection)

Outer join-----whole data(union)

Left join------only left data(with comman things from right)

Right join-------only right data(with comman things from left)

Indicator flag-----used to indicate which are common

Ex: df3=pd.merge(df1,df2,on="city",how="outer",indicator=True)

Suffixes----- used to get values of both the dataframes


Ex:
df3=pd.merge(df1,df2,on="city",how="outer",suffixes=("_first","_second"),indicator=True)
df3
output:

temp_firs humidity_firs humidity_secon


city temp_second _merge
t t d

0 new york 22.0 55.0 18.0 68.0 both

1 chicago 15.0 85.0 23.0 65.0 both

2 orlando 35.0 76.0 NaN NaN left_only

baltimor
3 40.0 68.0 NaN NaN left_only
e

4 san diego NaN NaN 35.0 71.0 right_only

MATPLOT LIB
import matplotlib.pyplot as plt

%matplotlib inline(always used in jupyter only)

To plot-----plt.plot(x,y)

plt.title("Weather Graph")

plt.xlabel("Days")

plt.ylabel("Temp")

plt.plot(x,y,color="blue", linewidth=2,linestyle="dotted",marker="*")

attributes of plot
color

linewidth

linesyle

marker----end pints syle of the plot

alpha----controles opacity

In IDLE python to see the the graph you have to right

plt.show()
String format

Refer: https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html

Date: 11.09.2020

To get three cities temperature information in one graph

Program:

plt.title("weather")

plt.xlabel("day")

plt.ylabel("temp")

plt.plot(day,mumbai,"g*-")

plt.plot(day,delhi,"ro-")

plt.plot(day,pune,"b^-")

plt.legend(loc="upper right")-------used to place the graph scale(bydefault value is best is it will fit
itself at the emply space )

attributes of legend:

loc,fontsize====”large”,”small” ,shadow—give shadow to the scale box

plt.grid()

Bar Chart:
company=["Relince","Indian Oil","State Bank Of India","TATA"]

revenue=[82,77,47,65]

plt.bar(company,revenue,color="red")

#if error comes abut integer values

company_position=np.arange(len(company))

plt.bar(company_position,revenue,color="green")

plt.xticks(company_position,company)

plt.show()

#multiple bar vertical

plt.bar(company_position-0.2,revenue,width=0.4,label="revenue")

plt.bar(company_position+0.2,profit,width=0.4,label="profit")

plt.legend(fontsize="large")

plt.xticks(company_position,company)

plt.show()
#multiple bar horizontal

plt.barh(company_position-0.2,revenue,label="revenue")

plt.barh(company_position+0.2,profit,label="profit")

plt.legend(fontsize="large")

plt.yticks(company_position,company)

plt.show()

Histogram
It can be generated using also with one parameter

X axis carries the variable

Y axis generates the frequency accordingly

people_ages=[12,45,18,8,3,85,75,65,15,95,35,23,44,66,58,62,73,84,92,110]

age_group=[1,10,20,30,40,50,60,70,80,90,100,110]

plt.hist(people_ages,age_group,rwidth=0.8)

plt.yticks(range(0,7))

plt.show()

Pi Chart
exp=[1400,600,300,410,250]

exp_label=["bike","food","phone","internet","others"]

plt.pie(exp,labels=exp_label,shadow=True,autopct=”%1.5f%%”,explode=[0,0,0,0.4,0)

plt.show()

#plt.axis("equal") ---when you get ovel shaped pi chart used to make is circular

To save the chart

plt.savefig("pie.png")

#never write plt.show() before savefig it will not give the graph

Date: 14.09.2020
ALGORITHMS
Simple linear Regression
----algorithm implementation in ml is called model

When predict value is in number form there we can used regression

Independent variable---data that can be controlled directly

Dependent variable---data that cannot be controlled directly

Used to find best fit line with minimum error margin

x Y
Y_e Sal
1 20
2 40
3 50
4 40
5 50

50

40

30

20

10

1 2 3 4 5

Formula: y=b0+b1*x / y=c+mx

B0--->it is the intercept

B1---->slop of coefficient

Main formula to find correlation: y=b0+b1x

b0=y_mean-b1*x_mean b1=sum(x-x_mean)*sum(y-y_mean)/sum(x-x_mean)^2

c---->constant

m---->slope

y---->dependent x---->independent
Required things to find best fit line

(x- (x-x’)
X Y x-x’ Y-y’
x’)^2 *(y-y’)
1 20 -2 -20 4 40
2 40 -1 0 1 0
3 50 0 10 0 0
4 40 1 0 1 0
5 50 2 10 4 20
Xmean= Ymean=
0 0 10 60
3 40
m= (x-x’)*(y-y’)/(x-x’)^2=6

to find constant

y=c+m*x

40=c+18

c=2

simple linear regression code in python


import numpy as np #predict value
import matplotlib.pyplot as plt y_pred=b[0]+(b[1]*x)
print(y_pred)
def coe(x,y):
global m_x,m_y,c
#plotting regression line
n=np.size(x)
print(n)
#mean plt.plot(x,y_pred)
m_x=np.mean(x) #labels
m_y=np.mean(y) plt.xlabel("year of experience")
#calculating cross validation about x plt.ylabel("Salary")
ss_xy=np.sum(y*x)-(n*m_y*m_x) #for (x-x’) *(y- plt.grid()
y’) plt.show()
ss_xx=np.sum(x*x)-(n*m_x*m_x) #for (x-x’)^2
def main():
#calculating regression coefficient
m=ss_xy/ss_xx
x=np.array([1,2,3,4,5])
c=m_y - m*m_x y=np.array([20,40,50,40,50])
#call function coe
return (c,m) b=coe(x,y)
def plotting_regeression_line(x,y,b): print(b)
#plotting data points
global m_x,m_y,c plotting_regeression_line(x,y,b)
plt.scatter(x,y)
main()

sklearn it is used to make things easy

code jupyter
from sklearn.linear_model import LinearRegression

mport numpy as np #reshaping the x,y cos it only takes 2D arrays or


import pandas as pd values
import matplotlib.pyplot as plt
%matplotlib inline X=X.reshape(-1,1)
dataframe=pd.read_csv("emp_data.csv") X.shape
dataframe Y=Y.reshape(-1,1)
dataframe.head()
#isnull gives the number of null values in the data lr=LinearRegression()
dataframe.isnull().sum()
#training the machine according to the data
lr.fit(X,Y)#---------used to fit the value in
X=dataframe["Year of Experience"].values machine
Y=dataframe["salary"].values y_pred=lr.predict(X)
y_pred
from sklearn.linear_model import
LinearRegression #plotting
plt.scatter(X,Y,color="red")
plt.plot(X,y_pred,color="blue")
plt.show()

Date: 15.09.2020

Cross validation for simple linear regression


Main formula to find best fit line:

Y=b0+b1*x

b0 =sum(y)*sum(x**2)-sum(x)*sum(xy)/n*sum(x**2)-(sum(x))**2=22

b1=n*sum(xy)-sum(x)*sum(y)/ n*sum(x**2)-(sum(x))**2=6

X y xy x**2
1 20 20 1
2 40 80 4
3 50 150 9
4 40 160 16
5 50 250 25
15 200 660 55

Y=22+6*5(to get the predict value )

seperate x and y using iloc method

date: 16.09.2020(absent)

Multiple Linear Regression:


output

X1 X2 Y
21 31 44
22 36 45
23 32 46
24 35 47
28 34 48
27 38 49
24.16 34.33 46.5

Y= b0 + b1x

Multiple linear:

Y= b0 + b1x1 + b2x2 pred

b1 = sum[(x1 – x1mean) * (y – ymean)] / sum(x1 – x1mean)**2

b2 = sum[(x2 – x2mean) * (y – ymean)] / sum(x2 – x2mean)**2

x1_mean = 24.16

x2_mean = 34.33

y_mean = 46.5

x1 x2 y (x1 - (x2 – (y – ymean) (x1 -x1mean)^2 (x2 – (x1 – (x2 –


x1mea
x1mean) x2mean) x2mean)^2 x2mea
n) * (y
-ymea n) * (y
n) –
ymean)
21 31 44 -3.16 -3.33 -2.5 9.9856 11.0889 7.9 8.325
22 36 45 -2.16 1.67 -1.5 4.6656 2.7889 3.24 -2.505

23 32 46 -1.16 2.33 -0.5 2.5056 5.4289 0.58 1.165

24 35 47 -0.16 0.67 0.5 0.0256 0.4489 -0.08 0.335

28 34 48 3.84 -0.33 1.5 14.7456 0.1089 5.76 -0.495

27 38 49 2.84 3.67 2.5 8.0656 13.4689 7.1 9.175

24.16 34.33 46.5 38.8336 32.8334 24.5 16

b1 = sum[(x1 – x1mean) * (y – ymean)] / sum(x1 – x1mean)**2

b1 = 24.5 / 38.8336

b1 = 0.63089

b2 = sum[(x2 – x2mean) * (y – ymean)] / sum(x2 – x2mean)**2

b2 = 16 / 33.3334
b2 = 0.48

y_mean = b1x1_mean + b2x2_mean +b0

b0 = y_mean – b1x1_mean – b2x2_mean

b0 = 46.5 – (0.63089)*(24.16) – (0.48)*(34.33)

b0 = 46.5 – 15.24 – 16.4784

b0 = 14.7

check the prediction:

y_pred = b1x1 + b2x2 + b0

y_pred = 0.6309*22 + 0.48*36 +14.78

y_pred = 13.87 + 17.28 + 14.78

y_pred = 45.93

Date: 18.09.2020

simple Multiple
Single input x,y Multiple input x1,x2,x3,y
x y X1 X2 X3 y
Reshape(-1,1) to fit in the columns Reshape(1,-1) to fit row wise

Get dummies

It will convert the string data in int by analysing it

Refer get dummies excel sheet

New york California Florida


1 0 0
0 1 0
We can’t do plotting in multiple linear regressions because of multiple inputs and one output

Score(used r squared method)

Date. 21.09.2020

Under fitting model (not creates much problem)

Score of the model is very poor--- (50 or 60)

Not fully fit

Training is give60-50% accuracy

Actual value and predicted value has large difference which mean testing is also not accurate

Over fitting model


Model tries to cover the whole model

Type of data

Training data- is wholly memorised and 100% accurate

Testing data-but gives bad prediction

In this model the training model is 99% accurate but the testing data is not that accurate

Under fitting problem is very low and over fitting problem is very high.

Generalized model

It almost cover all the points and give as must as low error as possible

We should get low error in training as well as in testing

This is the ideal model

R squared method

R2=1-RSS/TSS

Recedual = y-y_pred

T=y-y_mean
Rss=(sum(y-y_pred))^2

Tss=(sum(y-y_mean))^2

Polynomial regression
One input one output

Y=b0+b1x

Polynomial regression

Y=b0+b1x+b2x^2+b2x^3

More the curve more increase the degree

It is used where we have to predict values continuously there regression algorithm is used it predicts
continuous numbers

Generalize form

n Sum(x) Sum(x^2) B0
Sum(x) Sum(x^2) Sum(x^3) B1
Sum(x^2) Sum(x^3) Sum(x^4) B1
=
Sum(y)
Sum(yx)
Sum(yx^2)

Date: 22.09.2020

Date: 23.09.2020

Regression:- regression means predict the continuous value.

Classification:- categorising the data , it means it predicts output from multiple classes

(Categorical data) Animal data example

Categorical data

Class1 is cat, class2 is dog

Cat
Dog
Dog
Cat
Cat
Cat
Dog
Dog

Logistic Regression:
(It is a classification algorithm), it is similar to linear regression.

For binary classification best algorithm used is logistic regression

Sygmoid curve

The value is either 0 or 1

It works on probability
Formula of sigmoid curve:simple linear

y=eb0+b1x/1+eb0+b1
Or

y=1/1+e-(bo+b1x)

multiple linear

y=eb0+b1x1+b2x2+…...+bnxn/1+ eb0+b1x1+b2x2…..+bnxn
or

y=1/1+e-(b0+b1x1+b2x2+…...+bnxn)

Example:

y- (x-
x- (x-
x y y_mea x_mean)*(
x_mean x_mean)**2
n y-y_mean)
21 1 -1.5 0.5 2.25 -0.75
22 0 -0.5 -0.5 0.25 0.25
23 0 0.5 -0.5 0.25 -0.25
24 1 1.5 0.5 2.25 0.75
22.5 0.5 0 0 5 0

B1=0

B0=0.5

if we have new value x=46 predict y


y=b0+b1x

y=0.5

y=eb0+b1x/1+eb0+b1
y=1/1+e-(bo+b1x)
y=0.622459

this will go in 1’s category

Code of logistic regression with sklearn for single variable i.e single input

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

Df=pd.read_csv(“insurance_data.csv”)

Df.isnull().sum()

X=df[“age”].values

y=df[“bought_insuarance”].values

#plotting the actual data

plt.scatter(X,y)

from sklearn,model_selection import train_test_split

x_train, x_test,y_train,y_test=train_test.split(X,y,test_size=0.2,random_state=0)

#creating a model

from sklearn.linear_model import LogisticRegression

log_model=LogisticRegression()

x_train.x_train.reshape(-1,1)

y_train=y_train.reshape(-1,1)

x_test.x_test.reshape(-1,1)

y_test=y_test.reshape(-1,1)

log_model.fit(x_train,y_train)

y_pred=log_model.predict(x_test)

log_model.score(x_test,y_test)*100
#prediction for external value

test=np.array([[25]])

log_model.predict(test)

plt.scatter(X,y)

y_pred=log_model.predict(X)

plt.plot(X,y_pred)

Date:24.09.2020

Logistic regression (titanic project)


 Step analyse the data by see what information and columns are gives and remove the
unnecessary data.
 If there is categorical data given use getdummies to separate them.
 life cycle of model steps:
1. Collecting data
2. Analyse the data---which variable have most or many correlation(between labels and ).i.e.
which column to remove an what to keep
3. Data Wrangling—clean the data I.e. to remove unwanted data, fill or remove null values
as given.
4. Training or testing the data---we have to train the model and separate the training and
testing data
5. Validation-----weather the model gives right output or not. Checking the accuracy of the
model

Confusion matrix:
Pred0 pred 1
Prednot pred survive

Actual not survived [74, 8] actual 0


Actual survived [18, 43]actual 1
used to check the right and wrongness of the probability
always seen in the diagonal
18 value is the error of the model

Date: 25.09.2020

K Nearest Neighbour
We count the distance of the all points from the new data point entered.

Euclidean distance formula:

root((x1-x2)^2+(y1-y2)^2)

male=0,female=1

Name Age Gender Sports Distance


A 32 0 Football 27.01
M 40 0 Neither 35.01
S 16 1 Cricket 11
Z 34 1 Cricket 29
S 55 0 Neither 50.01
R 40 0 Cricket 35.01
A 20 1 Neither 15
A 15 0 Cricket 10.1
P 55 1 Football 50
A 15 0 Football 10
5 1 ?===cricket

X1=32 #existing data point k=3 #k is always odd number

Y1=0 #existing data point

X2=5 #new data point

Y2=1 #new data point

Math.sqrt ((x1-x2)**2+(y1-y2)**2)

Use case is used to solve the data manually without sklearn

Date: 28.09.2020

Knn is mostly used in YouTube, online shopping apps and websites

For recommending things

Date. 30.09.2020

SVM (Support Vector Machine)


Hyper plane is nothing but linear line used for classification (decision boundary)

Support vectors are the data points which are near to hyper plane

Opposite data points –d and +d from the hyper plane

The margin is created by using the support vectors

Maximum margin hyper plane

On this basis we choose how our hyper plane will be formed.

How to decide which margin we have to pick

In the below example we choose the margin 2 because it has more distance or width
There are two types of SVM

1. Linear support vector machine


We can easy separate data by linear line
2. Non Linear support vector machine
Kernel: converts low dimensional data to high dimensional data
I.e. if the data is in 2D then it coverts in into 3D

There are 4 types of kernel

1. Linear(by default)
2. Polynomial
3. RBF-Radial basis function(best for vector)(non-linear)
4. Sigmoid

Example: maths sum

How mathematically hyper plane is drawn

Q. draw the hyper plane data given

(1,1)(2,1)(1,-1)(2,-1)(4,0)(5,1)(5,-1)(6,0)

1 1
2 1
1 -1
2 -1
4 0
5 1
5 -1
6 0

Step1: plot the graph


s1 = 2 1 1 s2= 2 -1 1 s3= 4 0 1

the last added 1 is bias

we will use linear method as the by default is linear

and vector separation

α1 s1 s1 + α2 s1 s2 + α3 s1 s3=-1----s1 is constant

α1 s1 s2 + α2 s2 s2 + α3 s2 s3=-1----s2 is constant

α1 s1 s3 + α2 s2 s3 + α3 s3 s3=1----s3 is constant

solving the above equations

for vector s1

α1 s1 s1 + α2 s1 s2 + α3 s1 s3

6 α1+4 α2+9 α3

Similarly

For s2

4 α1+6 α2+9 α3
For s3

9 α1+9 α2+17 α3

α1=-3.25 α2=-3.25 α3=3.5

Ɯ=∑αisi

Ɯ=α1*s1+α2*s2+α3+s3=2

Ɯ= 1 0 -3(matrix)

Equation of a plane

y=b0+b1x

b0=3---offset/bias

b1=1 0

Date: 01.10.2020

Bydefault the kernel is taking RBF


Benefit
We can do parameter tuning to increase accuracy

Disadvantage of SVM
SVM cannot handle very large scale data

Because its training time is very large

To see al the build in datasets in sklearn

from sklearn import datasets


dir(datasets)

Homework for cancer dataset

Convert in dataframe

check 0 and 1 clases

Date: 02.10.2020

Decision tree
Helps to take decision like whether is yes or no, profit or loss.

Example of salesman for Loan Company

This is called decision tree

It either classifies or regression : mostly used in classification

Target: is to sale the loan

The rectangular boxes are nodes

Two nodes are very imp

1. Starting node: Root node


The starting of decision
2. Ending node: leaf node
Where the decision has been taken

Step one find the target

Step two find the root note


Old=50-50 above mid= 20 -50 new =20 and below 20

Age Competition Type Profit


Old Yes Software Down
Old No Software Down
Old No Hardware Down
Mid Yes Software Down
Mid Yes Hardware Down
Mid No Hardware Up
Mid No Software Up
New Yes Software Up
New No Hardware Up
New No software Up

Step1: find the target attribute

Target attribute: profit

Step 2: find the information gain of the target attribute

IG(information gain)

if from both even if one of the value is 0 then its IG is zero and if both values are same then its IG is
1

Formula:

IG= (-P/P+N) log2 (P/P+N)-(N/P+N) log2 (N/P+N)

P=no of up i.e. profit gain

N=no of down i.e. profit loss

P=5

N=5

IG=1

Step 3: find the gain of each feature attribute

1. Age:
Information gain (IG) for each old, mid, new

age down Up
Old 3 0
Mid 2 2
new 0 3
For old
IG= (-3/3+0) log2 (3/3+0)-(0/3+0)log2(0/3+0)
IG=0
Similarly
For mid IG=1
For new IG=0
Find the Entropy ---summation of IG’s
E(A)=∑(Pi+Ni/P+N)IG(PiNi)

E(A)=(3+0/10)*0+(4/10)1+(0+3/10)*0
Gain
Gain=IG(target)-E(A)=1-0.4
Gain=0.6
2. Competition
Information gain (IG) for yes and no

Comp Down N Up P
Yes 3 1
No 2 4
For yes
IG= (-P/P+N) log2 (P/P+N)-(N/P+N) log2 (N/P+N)
IG (yes)=0.81127
IG (no)=0.91829
E(C) = (1+3/10)*0.81127+(4+4/10)0.91829)
E=0.87548
Gain=1-0.87548=0.12452
3. Type

Type Up down
Software 3 3
Hardware 2 2
IG(s)=1
IG(h)=1.
E(T)= (6/10)*1+(4/10)1
Gain=0

Now compare the gain


Age=0.6
Competition= 0.12
Type=0

As the gain of age is most age is the root node


Decision tree
Id3 algorithm

Date: 06.10.2020

prediction
D
A
T
A SVM
S
E
T

Ensemble learning: mobile in ss

Random Forest Algorithmde


Uses ensemble learning for choosing

And used only decision tree

Like multiples of decision tree

Original dataset

SR A B C Target
1 Y
2 N
3 Y
4 Y
5 N
Bootstrap dataset:duplicasy is allowed

Sr A B C Target
2 N
1 Y
1 Y
3 Y
4 Y

Date: 07.10.2020

Absent

Date:08.10.2020

Naive Bayes algorithm


We find the probability

Used in predicting spam messages

It uses conditional probability.

It is classification algorithm but can also perform regression

It haves different Varian’s

Intotal it has tree varian’s

1. Bernoulli distribution/naive bayes


For binary classification
Best for success and failures ,no yes,true false
2. Multinomial naive bayes
For multiple classes
3. Gaussian naive bayes
To predict the continuous value

Mathematical approach

Sample of fruits

Fruits(target) Yellow(feature Sweet(feature Long(feature) total


) )
Mango 350 450 0 650
Banana 400 300 350 400
others 50 100 50 150
total 800 850 400 1200

Predict=(yellow,sweet,long) =which fruit is this

Formula: to find probability


p(A|B) = probability of A when B is true

p(A|B)=p(B|A)*p(A)/p(B)

Step1: Find the probability of mango

Probability for mango:

(yellow,sweet,long)=x(considered value which changes accordingly

1. Probability of yellow mango x=yellow


P(yellow|mango)=p(mango|yellow)*p(yellow)/p(mango)
=(350/800)*(800/1200)/(650/1200)
=(0.4375)*(0.6667)/(0.54667)
=0.5386
2. x=sweet
p(sweet|mango)=(450/850)*(850/1200)/(650/1200)
=0.692
3. x=long
p(long|mango)=(0/400)*(400/1200)/(650/1200)
=0
Total=0

Probability for banana:

1. probability for yellow banana x=yellow


p(yellow|banana)=(400/800)*(800/1200)/(400/1200)=1
2. x=sweet
p(sweet/banana)=(300/850)*(850/1200)/(400/1200) =0.75
3. x=long
p(long/banana)=(350/400)*(400/1200)/(400/1200)=0.875

total:0.6562

Probability for other:

1. probability for yellow other x=yellow


p(yellow|other)=(50/800)*(800/1200)/(150/1200)=0.333
2. x=sweet
p(sweet/others)=(100/850)*(850/1200)/(150/1200) =0.666
3. x=long
p(long/others)=(50/400)*(400/1200)/(150/1200)=0.333
total: 0.0738

Therefore the predict fruit is banana

Example 2

colour type origin Stolen


Red Sports Domestic Y
Red Sports Domestic N
Red Sports Domestic Y
Yellow Sports Domestic N
Yellow Sports Imported Y
Yellow SUV Imported N
Yellow SUV Imported Y
Yellow SUV Domestic N
Red SUV Imported N
Red sports Imported Y
Red=5 yellow=5 total 10

Sports=6 SUV=4 total=10

Domestic=5 imported=5 total=10

Y=5 n=5 total 10

(red,SUV,domestic)=whether it is stolen or not?

Prob(red|yes)=(3/5)(5/10)/(5/10)=0.6

Prob(red|no)=(2/5)(5/10)/(5/10)=0.4

Prob(yellow|yes)=(2/5)(5/10)/(5/10)=0.4

Prob(yellow|no)=(2/5)(5/10)/(5/10)=0.6

For color yess: 0.24 no:0.24

Prob(sports|yes)=(4/6)(6/10)/(5/10)

Prob(Sports|no)=(2/6)(6/10)/(5/10)=0.4

Pron(suv|yes)=(1/4)(4/10)/(5/10)=0.2

Date: 13.10.2020

K mean Clustering
Unsupervised machine learning

Amazon

Home
Eletronics
appliances

sports
swim

Amazon cluster
Example 1:

Data={2,3,4,10,11,12,20,25,30 }

K=2

Form 2 clusters

Step 1: pick any two random values

m1=4 mid/mean val for c1 m2=12 mid/mean val for c2

we see the nearest val to the mean or mid val; by calculation for eg take the val 10 then we subtract it
with means 1st 10-4=6 2nd 12-10=2 therefore 10 belongs to c2 always take +ve difference

c1={2,3,4}

c2={10,11,12,20,25,30}

find the mean of c1 and c2

actual m1=2+3+4/3=3

m2=18

same steps will be performed with actual calculated m1 and m2

c1={2,3,4,10}

c2={11,12,20,25,30}

m1= 4.75 m2=19.6

c1={2,3,4,10,11,12}

c2={20,25,30}

m1= 7 m2=25

c1={2,3,4,10,11,12}

c2={20,25,30}

This is the final cluster as the m1 and m2 are same as before

Euclidean formula

√(XH-H1)**2+(XW-W1)**2)

H->height
Sr Height Weight
no
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
7 180 71
8 180 70
9 183 84
10 180 88
11 180 67
12 177 76

Form 2 clusters

height weight Centroid value


C1 185 72 (185,72)
C2 170 56 (170,56)
C1={1,4,5}

C2={2,3,}

Row3:168,60

For 1.

√(XH-H1)**2+(XW-W1)**2)

√(168-185)**2+(60-72)**2)=20.8086

For 2

√(168-170)**2+(60-56)**2)=4.47

height weight Centroid value


C1 185 72 (185,72)
C2 170 56 (169,58)
C1={1,}

C2={2,3,}

Update centroid value of c2

Mean of r2 =170+168/2=169,58

Row 4: 179,68
For 1.

√(XH-H1)**2+(XW-W1)**2)

√(179-185)**2+(68-72)**2)=7.211

For 2

√(179-169)**2+(68-58)**2)=14.14

height weight Centroid value


C1 185 72 (182,70)
C2 170 56 (169,58)
C1={1,4}

C2={2,3,}

Row5:182,72

For 1.

√(182-182)**2+(72-70)**2)=2

For 2

√(182-169)**2+(72-58)**2)=19.798

height weight Centroid value


C1 185 72 (182,71)
C2 170 56 (169,58)
C1={1,4,5}

C2={2,3,}

Row 6:188,77

For 1.

√(188-182)**2+(77-71)**2)=8.485

For 2

√(188-169)**2+(77-58)**2)=26.8700

height weight Centroid value


C1 185 72 (185,74)
C2 170 56 (169,58)
C1={1,4,5,6}
C2={2,3,}

Row7: 180,71

For 1.

√(180-185)**2+(71-74)**2)=5.830

For 2

√(180-169)**2+(71-58)**2)=17.029

height weight Centroid value


C1 185 72 (182.5,72.5)
C2 170 56 (169,58)
C1={1,4,5,6,7}

C2={2,3,}

Row8: 180,70

For 1.

√(180-182.5)**2+(70-72.5)**2)=3.5355

For 2

√(180-169)**2+(70-58)**2)=16.2788

height weight Centroid value


C1 185 72 (185,74)
C2 170 56 (169,58)
C1={1,4,5,6,7,8}

C2={2,3,}

Date: 14.10.2020

Date: 15.10.2020

NLP : Natural Language Processing


It’s a part of deep learning and AI.

It is a part of computer science, machine learning and artificial intelligence which deal with the
human language.

Tokenization
Types

1. Bigram
2. Trigram
3. Ngram

Library: NLTK (natural language tool kit)

Helps in text analysis

Application:

1. Sentiments analysis: it analysed our moods or our words or politeness


2. Alexa, siri, chatbot

NLP has two part

1. NLU (Natural language understanding)


It map the text or sentence from the database
2. NLG (Natural language generation)
It answers with meaning full sentence

Ambiguity is the errors that occur in NLP

NLU ambiguity

1. Lexical ambiguity
Error of word which have two meaning
Ex: she looking for a match
Here match has two meaning like one is games match ore partner match
2. Syntactic ambiguity
A sentence which has two different meaning because of wrong grammar/sentence
misformation
Ex: chicken ready to eat
3. Referential ambiguity
Sentence with wrong reference like
Ex: this is that and that is this

Step1: Pip install nltk


Step2: import nltk on python idle shell

Step3: nltk.download()

A nltk downloader windows gets open

Date: 16.10.2020

Tokenization

To take only the imp things with dividing the Para in small token stop words will be removed

Stemming

Words which are similar to each other finds unique ness in the words and then it generates a new
word. Stemming is faster

There a possibility that the word for has any meaning

Ex history and historicalhistori

Finally and finalizationfinal

Lemmatization

It is similar to stemming but it always gives meaningful words. It used more time to generate words

Ex history and historical history

Stopwords-> (I,me,your,of,them,for,on,to,….)

Word which are not usefull

Date: 17.10.2020

Date: 19.10.2020

Bag of Word

It is the function which tells which word is important

For ex: to tell whether the review is good or not

It is also called binary filtration

It finds with the help of frequency

YouTube also uses bag of words

Ex.

He is a nice boy 1

She is a nice girl 1

Boy and girl are nice 0


Nice boy

Nice girl

Boy girl nice

Nice boy Girl


1 1 1 0 110
2 1 0 1 101
3 1 1 1 111

Date: 20.10.2020

Neural Network:

Layers:

1. Input layer
2. Output layer
Date: 22.10.2020

Forward propagation:

Activation function

It converts input signal to output signal

Weight and bias are some random numbers

f->indicates activation function

v1=f(u1*wa+u2*wx+b1) O=f(v1*W1+v2*W2+v3*W3+B)
v2=f(u1*wb+ u2*wy+b2) Final activation function O
v3=f(u1*wc+ u2*wz+b3)

Do neural networks need activation function?

Answer is yes also and no also

When the data turn toward non-linear or the complexity increases at that time activation function is
necessary

Activation function

1. Identity function\Linear activation function


Completes the activation function place in node
Equation: f(x)=x for all x
Generates Output signal same as input signal
Used when the problems are very simple

2. Heviside activation function/binary step function


Helps in complex decision
For example multiple classes
Use in single layer network convert net input to output signal should be 0 or 1
Equation: f(x)={1 if x >=t}
{0 if x<t}
T is threshold value

3. Sigmoid activation function


Used for backward propagation
Equation: f(x) =1/1+e (-x)

4. Hyperbolic tangent activation function/Relu


It is called as tanh activation function
Optimization is easy
-1 to 1
(-2x) (-2x)
Equation: f(x) =1- e /1+e
5. Remp activation function
Equation: f(x)={x,x>0}
{0,x<0}

Length Width Color Of


H(X1) W(X2) Flower
2 1.5 R(1)
3 1 B(0)
4 1.5 R(1)
2 0.5 ?

Code in python

Cost function

Difference of actual data point and calculated data point


1. R2/square error
2. MSE-mean squared error
3. RMSE(root mean squared error)

Date: 23-10-2020

--------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------

To import and merge many file(1lakh) All_file)names=[I for I in


glob.glob(f””(file_extention))]
Import os
Df=pd.read_csv(“csv file name”)
Import pandas as pd
Read _all_data=[]
Import glob #for directories
Os.chdir(r”path of file for output dir”)
Os.getcwd()
Combine_file.to_csv(“combine_file.csv”)
Os.chdir(r”path of file where all the csv files are saved”)

File_extention=”.csv”

--------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------
1. R2/square error
Formula:
SE=sum(y^i-y1)2 y^i is y dash of i

2. MSE-mean squared error


Formula:
MSE=sum(y^i –yi )2/n

3. RMSE(root mean squared error)


Formula:
RMSE=root(sum(y^i –yi )2/n)

Cost function does not give value in percentage it gives continuous values

How to reduce error using cost function:

Actual data: 2

Predicted value=1.2

Cost=actual data-predicted value=2-1.2=0.8

Date: 24.10.2020

Date: 26.10.2020-last day

Name

You might also like