Minor Project
Minor Project
Minor Project
The challenge, however, was to identify an appropriate classification algorithm that could complete the
task for our data set with the highest level of accuracy. The accuracy of an algorithm varies based on the
kind of problem it has to answer and the data set it uses. Thus, we made the decision to assess the
accuracy levels of four algorithms—KNN, SVM, Random Forest, and Logistic Regression—in relation to our
problem and data set. We would be able to choose the best algorithm to use our predictor in the placement
management system.
In order to determine the algorithms' accuracy, we used the data set we had obtained to train each
algorithm and compared the results to some test data. We can quickly determine the True Positive, True
Negative, False Positive, and False Negative for each algorithm. It was just a matter of utilizing the accuracy
equation.
We aim to develop a placement predictor as a part of making a placement management system at college level which
predicts the probability of students getting placed and helps in uplifting their skills before the recruitment process
starts. We are using machine learning for the placement prediction. We consider K-nearest neighbour
Regression, Random Forest to classify students into appropriate clusters and the result would help them in
improving their profile.And accuracy of respected algorithms are noted and With the comparison of various
machine learning techniques, this would help both recruiters as well as students during placements and related
activities.
A. Key words
Classification
Random Forest
Regression
Dimensional Space
Optimization
B. Prediction system
In this paper we use machine learning techniques to predict the placement status of students based on a dataset. The
parameters in the dataset which are considered for the prediction are Quantitative scores, LogicalReasoning scores,
Verbal scores, Programming scores, CGPA, No. of hackathons attended, No. of certifications and current backlogs
number. The placement prediction is done by machine learning using Logical Regression, Random Forest, KNN,
SVM.
Table 1:- Dataset used for Prediction and Analyss
C. Architecture Diagram
Fig 1:- Architecture for Data Processing , Model Training, Prediction and Accuracy check.
The data frame for the machine learning algorithm is created using pandas library based on the above sample dataset.
The handling of null data fields is carried out by dataset.fillna(method='ffill'). We use sklearn which is an
A. KNN
KNN stands for k-nearest neighbors. This is a simple
algorithm that can be used to solve classification and regression type problems. It is a supervised machine learning
algorithm, meaning labels are used.
The basic working of this algorithm revolves around the concept that similar things are always in close proximity
within each other. So, for this algorithm to provide any fruitful results, this is an assumption that is taken. Similarity in
KNN is expressed using distance,
The algorithm:
Load the data to be used. .this model.
However the error of the algorithm depends on the value selected for K. So, to find the K that is best suited for the data
given, it is advised to run the algorithm many times with different values of K, so that the K with the least error can be found.
Advantages:
This is a fairly simple and easy-to-implement algorithm
Building a model, tuning several parameters or making additional assumptions are not required
This is a versatile algorithm, being able to be used in regression, classification and even search problems.
Disadvantages:
The algorithm becomes significantly slower as the number of examples and/or predictors/independent
variables increases.
The disadvantage that KNN provides makes it an impractical choice for use where predictions need to be made more
rapidly. However, provided that one has enough computational power at their disposal, then KNN can be used in problems where
similar items have to be identified.
B. SVM
supervised machine learning algorithm that can be used for both classification and regression problems. However, it is mostly used
for classification problems.
A point in the n-dimensional space is a data item where the value of each feature is the value of a particular coordinate.
Here, n is the number of features you have. After plotting the data item, we perform classification by finding the hyper-plane
that differentiates the two classes very well.
Now the problem lies in finding which hyper-plane to be chosen such that it is the right one.
Scikit-learn is a library in Python which can be used to implement various machine learning algorithms and SVM too can
be used using the scikit-learn library.
Advantages:
This algorithm performs best when there is a clear
margin of separation
Disadvantages:
Performance is affected when large data sets are used as
the required training time is more.
Performance is also affected when the data set has too
much noise
SVM doesn’t directly provide probability estimates,
rather a computationally intensive five-fold cross-
validation is required.
C. Logistic Regression
Logistic regression is a classification technique
it s very good for binary classification. It's decision boundary which is generally linear derived based on probability
interpretation. The results are in a nonlinear optimization problem for parameter estimation. Parameters can be
estimated by maximising the expression using any nonlinear optimization solver.
The goal of this technique is given a new data point, and predict the class from which the data point is likely to have
originated. Input features can be quantitative or qualitative.
Instead of a hyperplane or straight line, the logistic regression uses the logistic function to obtain the output of a
linear equation between 0 and 1.
Advantages
Disadvantages
It is useful only for predicting discrete functions.
It should not be used If the No. of observations in the
dataset are lesser than the number of features.
Assumption of linearity between the independent and
dependent variables.
D. Random Forest
We have a plethora of classification algorithms at our disposal, including, but not limited to, SVM, Logistic
regression, decision trees and Naive Bayes classifier, just to name a few. But, in the hierarchy of classifiers, the Random Forest
Classifier sits near the top. The random forest classifier is a group of individual decision trees and so, we shall look into
how decision trees work.
It is basically a flowchart-like structure in which each node excluding the leaf node is a test on a feature (i.e, what will be
the outcome if some activity, such as flipping a coin, is done), leaf nodes are used to represent the class label (the decision
taken after all features are computed) and branches represent the conjunctions of features that lead to those class labels.
The classification rules of a decision tree are the paths from the root node to the leaf node.
So then, now let us look into random forest classifiers. As mentioned earlier, it is a collection of decision trees. The basic
idea behind random forest is “the wisdom of the crowds”. It is a powerful concept wherein a large number of uncorrelated
models, or in this case trees, operating as a group, would provide a much more solid output than any of the constituent models.
So, in a random forest, each individual tree with different properties and classification rules would try to find an
appropriate class label for the problem. Each tree would give out its own answer. A voting is done within the random forest to
see which class label received the most votes. The class label with the most votes would be considered the final class
label for the problem. This provides a more accurate model for class label prediction.
Advantages:
It can balance errors in data sets where classes are imbalanced
Large data sets with higher dimensionality can be handled
It can handle thousands of input variables and could identify the most significant variables and as such, it is
a good dimensionality reduction method
Disadvantages:
It does more good of a job for classification
problems rather than regression problems
as it finds it harder to produce
continuous values rather than discrete ones
IV. RESULT AND ANALYSIS
The final result of performing various machine algorithms are mentioned in the table below. We considered KNN,
Logistic Regression, Random Forest and SVM for the analysis. We trained and predicted the placement status of
students based on the same dataset and found the True Positive, False Positive, False Negative, True Negative and
accuracy of each algorithm. And it is tabulated in the table below.
Data Sample
REFERENCES
[1]. Shreyas Harinath, Aksha Prasad, Suma H and Suraksha A. Student Placement Prediction using Machine
Learning, International Research Journal of Engineering and Technology (IRJIET) Volume : 06 Issue: 04 April 2019
Senthil Kumar Thangavel, Divya Bharathi P and
[2]. Abhijith Shankar. Student Placement Analyzer: A recommendation System Using Machine Learning, International
Conference on advanced computing and Communication systems (ICACCS-2017), Jan 06- 07,2017, Coimbatore, INDIA.
K. Sreenivasa Rao, N. Swapna, P. Praveen Kumar
Educational data mining for student placement prediction using machine Learning algorithms Research Paper,
[3]. International Research Journal of Engineering and Technology (IRJIET) 2018