Data Analysis in Python-3
Data Analysis in Python-3
Data Analysis in Python-3
'''
Logistics Regression:
It is a machine learning classification algorithm used to predict
the probability of a categorical dependent variable. It is used to
build
a classifier model based on the available data but first we have to
convert the
categorical data to numbers(0 and 1), because machine learning does
not
work with the categorical data then after processing we can convert
them back to categorical.
'''
#reindexing the salary status names to 0,1 using .map function as
integer encoding
data2['SalStat']=data2['SalStat'].map({' less than or equal to
50,000':0,' greater than 50,000':1})
print(data2['SalStat'])
#now converting the other categorical values into dummy variables using
pandas get_dummies function
new_data=pd.get_dummies(data2,drop_first=True) #which splits the number
of categories present in a column into many columns and their values to
0,1.
#now next step is to store the column names of new_data as list
columns_list=list(new_data.columns) #storiing new_data columns names as
a lsit
print(columns_list)
#next step is to separate the input variables from the data
features=list(set(columns_list)-set(['SalStat'])) #features as input
variable having excluded SalStat.
print(features)
#next step is to store the output values in y(dependent variable) using
.values(which is used to extract values)
#from a data frame(new_data)
y=new_data['SalStat'].values #Extracting SalStat values from new_data
to y(output variable)
print(y)
#similarly
x=new_data[features].values #Extracting the values corresponding to
features(column names) from new_data
#and storing them in x(input variable)
print(x)
#next step is to split the data into train and test sets using
train_test_split command
train_x,test_x,train_y,test_y=train_test_split(x,y,test_size=0.3,random
_state=0) #test_size=0.3 is the proportion of the data set used(30%)
for testing
#and random_state=0 takes the same set of inputs all the times
otherwise it will take any ramdom sets.
#therefore here the test set will take around 30% of the data and train
set around 70%.
#next step is to make an instance of logistic-regression classifier
model using logisticRegression function
logistic=LogisticRegression() #logistic-regression classifier model
#then we will fit the model on train set data using fit function onto
the above instance of the model.
logistic.fit(train_x,train_y)
#now we can extract some of the attributes from this logistic-
regression classifier model
logistic.coef_ #coeffient values of all the variables
logistic.intercept_ #intercept of logistic-regression model
#now we have built the model to classify the salary status of
individuals less than or equal to 50000 or greater than 50000.
#and now we have to check the performance of this model with test data
set for its predictions.
prediction=logistic.predict(test_x) #prediction of salary status from
test data frame
print(prediction)
#now we will use the confusion matrix to evaluate the performance of
above model
#confusion_matrix is a table to evaluate the performance of a
classification model
#confusion_matrix gives the output as number of correct predictions and
number of incorrect predictions
#and it will sumup all the values classwise.
confusion_matrix=confusion_matrix(test_y,prediction) #its columns
represent actual classes like less than/equal to 50000 and greater than
50000
#rows represents predicted values
print(confusion_matrix) #the diagonal values gives correctly
classifiied values and off-diagornal values give incorrect values(miss-
classified).
#as this model have not classified all the observations correctily,
therfore we need accurancy measure.
accuracy_score=accuracy_score(test_y,prediction) #to check the accurecy
of the model.
print(accuracy_score) #thus we get 84% of predictions correctly.
#now we can also check the miss-classified values from the prediction
print('miss-classified values: %d',(test_y!=prediction).sum()) #thus we
got miss-classified values=1427
'''
nwo we can imporve the accurancy of this model by reducing the
insignificant input variables so that it can reduce the miss-classified
output values.
'''
#so we have to make a new model again by removing all the insignificant
input variables.
#reindexing the salary status names to 0,1 using .map function as
integer encoding as done earlier above
data2['SalStat']=data2['SalStat'].map({' less than or equal to
50,000':0,' greater than 50,000':1})
print(data2['SalStat'])
#we can find that gender, natiive country, race and job type are
insignificant
cols=['gender','nativecountry','race','JobType'] #create a list of the
insignificant variables
new_data=data2.drop(cols,axis=1) #removing 4 columns form the data2 and
saving it to new_data
new_data=pd.get_dummies(new_data,drop_first=True) #repeating the same
command as earlier
columns_list=list(new_data.columns)
features=list(set(columns_list)-set(['SalStat']))
y=new_data['SalStat'].values
x=new_data[features].values
train_x,test_x,train_y,test_y=train_test_split(x,y,test_size=0.3,random
_state=0)
logistic=LogisticRegression()
logistic.fit(train_x,train_y)
logistic.coef_
logistic.intercept_
prediction=logistic.predict(test_x)
print(prediction)
confusion_matrix=confusion_matrix(test_y,prediction)
print(confusion_matrix)
accuracy_score=accuracy_score(test_y,prediction)
print(accuracy_score)
print('miss-classified values: %d',(test_y!=prediction).sum())
#thus removing the insignificant variables does'nt improved the
prediction, the accuracy slightly decreased
#and the miss-classified values increased.