Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
104 views

College Recommender System Using Student' Preferences/voting: A System Development With Empirical Study

tree
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views

College Recommender System Using Student' Preferences/voting: A System Development With Empirical Study

tree
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.

1, January 2018 87

College Recommender system using student’ preferences/voting:


A system development with empirical study
Mr. Y. Subba Reddy and Prof. P. Govindarajulu

Department of Computer Science


S.V. University, Tirupati, AP, INDIA

Abstract their decision making. It is a very tiresome job for a


Recommender Systems are an ever trending research that can be student to exercise the college profile list (like the national
applied in various domains. The college recommendation and regional rank of the college, placements, pass
systems for undergraduate students are a challenging area that percentage, staff quality, infrastructure and particularly the
needs to be explored thoroughly. A college recommendation
fee details). There is a need of a system that considers and
system provides the means to undergraduate students in their
college selection process with a good number of suggestions. In
analyzes the profile attributes of a college along with the
this paper an effective weighted clustering process WCLUSTER preferences of the students towards the college profile
is implemented using R-tree data structure. Instead of traditional attributes.
data clustering approaches, an improved approach using top-k
queries is applied for clustering the college data, based on
students’ preferences/voting. A new technique was proposed for
finding similarity measures between objects by using both values 2. Recommendation systems
of attributes and their corresponding voting / preferences / ratings
for attributes. Traditional methods use distance measures for
finding similarity between objects. Proposed method uses voting 2.1 The evolution of Recommendation systems
/ preferences / ratings for finding similarity between objects by
using top-k query ranking of objects. The preferences were The recommendation systems collect the information
obtained through a well structured questionnaire using which the regarding items. They gather preferences and profiles and
responses from college students were gathered. Based on the sets analyze the same to advise the user to make right decisions
of responses as preferences the proposed algorithm was executed. regarding products, people, policies, and services Subba
To speed up the query execution process a multidimensional Reddy.Y and Prof. P. Govindarajulu, [19]. As day after
indexing structure called R-Tree was used. Pruning techniques
were applied for scalability purpose.
day, the availability of electronic and web content is
This paper introduced a recommendation system for growing fast, researchers are relying more on content to
college/course selection. The experimental results showed that extract the vital information for better recommendations.
applying WCLUSTER in this domain is superior to traditional So, recommendation systems became popular in assisting
and previous approaches. numerous decision-making contexts.
Key words:
recommender system, voting, weighted cluster, top-k query, The basic idea of recommender systems is to utilize
reverse top-k query, multi-dimensional index tree sources of web content about customers and to infer
customer interests C.C. Aggarwal [3]. Here the user is an
entity to which the recommendation is provided, and an
1. Introduction item is a product/service being is recommended.
Recommendation analysis predicts the future preferences
There are more than 750 engineering colleges in Andhra by analyzing the previous interaction between users and
Pradesh and Telangana admission to which is based on items because past behaviors are often good indicators of
web counseling. It is a challenging thing to students to opt future choices.
a right college to join. There are many things to consider
while deciding a college for admission. Infrastructure, It is the toughest job to design and implement a large-scale
faculty, facilities, placements and other related things of a online service that can find what is to be recommended to
college influence the admission decision. Students have the customers based on the past purchase history. For
their own preferences while joining a college. They need example, Amazon gives product recommendation, yahoo
the list of colleges that meet their preferences. As the makes web page recommendation. The process of
number of colleges available in the state is big, students constructing an efficient and effective recommendations
required to analyze and get the information needed for system is a challenging task. The underlying reason is the

Manuscript received January 5, 2018


Manuscript revised January 20, 2018
88 IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018

large size of the product (object) space and context space. knowledge acquisition and management methods are
The main goal of recommender systems is to assist its needed to build consistent, robust, reliable, fault-tolerant,
users in finding their preferred objects from the large set of and effective decision support systems.
available objects. The voting of a particular customer on a
particular object is learned through a random payoff and
this payoff is received by the recommender system based 2.2 Recommendation systems for college selection
on the response details of the customer to the
Fazeli Soude, et al. [7] said that recommender systems are
recommendation system. For example, in a course
being used (have been using) in many real-world
recommendation, the payoffs are the ratings on the scale of
applications such as e-commerce based applications –
1 to 10, where the ratings are given by the students. In the
Amazon and eBay. Recommender systems must be
case of web page recommendations, the payoffs are
accurate and useful to as many numbers of users as
counted by customers ‘clicks, where the Boolean value 1
possible. The fundamental goal of the educational
denotes a click and 0 denotes no-click.
recommender systems is to satisfy many quality features
It is trivial that web mining is an important technique for such as accuracy, usefulness, effectiveness, novelty,
finding the frequent data patterns from the Internet, data completeness, and diversity. Recommender systems must
warehouse, data mart, and data set and so on. World Wide satisfy user-centric requirements. User-centric based
Web (www) is a powerful platform and it is considered to recommender systems are more useful than data-centric
be the ultimate provider of information super high way recommender systems.
used to store and retrieve information and also to mine
Recommender systems were developed for various
useful knowledge and then use the same for predicting the
domains associated with daily life of people such as
interests /requirements of customers. Web data size is huge,
product recommendation, service recommendations, and
unstructured and dynamic in nature. Hence,
people recommendations and so on. This kind of
recommendation systems are the potentially desired
recommendations increases both user convenience and
information systems used for predicting the feature values
purchase transactions of products and/or services.
according to the requirement of the customer. Web
Course/college recommendation for students is a
recommendation information systems are very useful for
challenging domain that has not reached the target
navigating through web pages and getting the desired
community thoroughly.
information quickly.
Since there are many options for colleges/courses students
Nowadays recommendation systems are popular and they
have to spend a lot of time for exploring the details and
try to suggest different types of items to different users.
they may not do it in a proper way. Students need a system
The items may be books, chairs, tables, pens, movies,
that accepts the students’ preferences and recommends the
music, washing machines, computers, printers, plotters and
right college/course. college selection is one of the issues
so on. For example, Amazon.com recommends various
that the students’ community tends to solve. Recommender
items to various users based on the knowledge – previously
systems help the students decide in what college they
visited, purchased, ordered, enquired, referred, booked and
should study. The methods existing for the
so on.
recommendation are content-based filtering, collaborative
Zhibo Wang, et al. [23] proposed a unique similarity based filtering, and rule mining approaches. Content-based
metric to find the similarity details of users in terms of filtering approach recommends an item to a user by
their lifestyles and they have constructed a Friend book clustering the items and the user pairs into groups. This
system to recommend friends based on their lifestyles. clustering is used to gain similarity between user and item.
Personal information of the user is not considered here.
Recommendation systems have developed in parallel with Queen Esther Booker creates a prototype of a system for
the web technology J. Bobadilla et al. [15]. At the initial course recommendations [18]. The system accepts user
time of their existence, they were based on demographic, requirements as keywords and recommends courses for
content-based and collaborative filtering. Now they are in students.
a position to incorporate social information also. A
knowledge-based recommendation system considers user- Collaborative filtering (CF) approach recommends an item
centric requirements rather than his/her past history in to a user by grouping similar users based on user profiles
order to make recommendations. and predicts the user interests towards the items. Hana
introduces a system based on CF approach to recommend
Hector Nunez, et al. [12] discussed the comparison of courses for a student by analyzing and matching the
different similarity measures for improving the student's academic records [11]. Then the system analyses
classification process. Authors said that automatic
IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018 89

and recommends a course that meets the student’s profile. to find ranking details of objects based on the score
Elham S.Khorasani et al. proposed a Collaborative function value which is based on voting/ preferences to
Filtering model based on Markov Chain to recommend value of attributes. Score function is a linear function that
courses based on historical data [7]. gives sum of the products of attribute values and their
corresponding voting/ preferences. Mathematically the
Rule mining approach focuses on recommending a series linear score function is denoted as
of items to a user by discovering the association rules.
Itmazi and Megias developed a recommendation system
based on rule mining to recommend learning objects [14].

Here n is the number of attributes of the object and i runs


3. The need for improvements in college from 1 to n. Voting (i) represents preference value of
recommendation systems ith attribute , object(i) represents a value of the ith attribute.
All the objects are represented in the multidimensional
Majority of the students make mistakes in their preference space. Different data sets needed to represent these
list due to lack of knowledge, inappropriate and inaccurate computations are O, C and V where O is the set of objects,
analysis of colleges and anxious predictions. Hence, they C is the set of customers and V is the set of
become unhappy and repent after admission. An automated voting/preferences of customers. O = {O ij } , C = {C ij }
system to do all the work will help the students a lot. and V = {Vij}, where, i represents rows and j represents
Today there are no such systems that consider the student columns.
preferences and recommend the right alternatives. A few HristidisVagelis, et al. [13] said that database systems
systems are there in the field that can make predictions cannot efficiently produce the top results of a given
based on the rank obtained by the students. Therefore an preference query because of the reason that they need to
improved intelligent system is needed to assist students in test and evaluate the special weight function over all the
their college selection process which considers the college tuples of the selected relation whereas the developed
profile attributes and the students’ preferences for each of PREFER system answers preference queries efficiently and
the attributes. This weighted approach can provide better effectively by using special materialized views that have
information by an efficient grouping of related items been pre-processed and stored.
(colleges). Using this weighted groups of colleges with
related profile attributes, one can suggest a better list of 4.2 Reverse top–k Queries
colleges that meet the preferences given by the students.
The reverse top-k query is directly associated with the top-
k query. Top–k queries retrieve k–number of products
whereas reverse top–k queries retrieve customers who
4. Weighted clustering with r-tree and top-k preferred their desired products to the corresponding top–k
queries result sets. Top–k queries are frequently used in the
database and information retrieval systems and
applications. Top–k queries retrieve k-most objects from
4.1 Top–k Queries the given set of objects by using a linear score function and
customer preferences. The main advantage of the reverse
Customer voting/ preferences play a major role in market
top–k query is that it identifies sets of products influenced
data analysis. The database is the backbone of any modern
by various customers and the influence of the reverse top-k
organization. Different types of queries are available for
set is defined as the cardinality of the reverse top–k result
effective database operations. Almost all business tasks
set. With the help of reverse top–k query it is possible to
need the results of different types of queries such as k–
find influence details of products and it is used in market
nearest neighbor query, range query, aggregate query,
data analysis. The reverse top-k result is directly related to
outlier detection query, group query, top–k query, reverse
the number of customers who prefer or value a particular
top-k query and so on. The query called top–k query is the
product. Many top–k queries are consolidated into one
one important database query that can be used for finding
reverse top-k query. That is, there exist one-to-many
ranks of database objects. Top–k queries are frequently
relationships from reverse top-k query to top-k queries.
used in the database and information retrieval systems and
applications. Top–k queries retrieve k-most objects from Akrivi Vlachou, et al. [1] said that finding the most
the given set of objects by using a linear score function and influential database tuples from a given database of tuples
customer preferences. The main purpose of top-k query is is very useful in real-world applications such as market
90 IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018

data analysis and decision making. Authors proposed two for efficient reverse k-nearest neighbor search in arbitrary
algorithms for finding most influential database objects. metric spaces (RkNNSAMS) and k value will be given at
The first one uses properties of the sky-band (SB) set for query run time.
limiting the maximum number of resultant candidate
objects and the second one follows branch and bound (BB)
algorithm paradigm and it uses upper bound on influence 4.3 R-tree
score
R-tree is a multidimensional indexing tree data structure.
Many techniques are available for evaluating reverse top-k R-tree is a most popular, frequently used, multidimensional,
queries but only thing is that they are costly in terms of height-balanced special indexing tree data structure and
overhead and hence they require significant processing very useful for efficient management of very large training
which results in the execution of multiple top-k queries for datasets particularly in many real-time applications
finding the total number of customers who prefer the involving data critical operations. Multidimensional R-tree
queried object. The reverse top-k query produces sets of indexing data structure is very useful and efficient for
customers based on object preferences. These sets customer voting based similarity the data structure. In
represent a number of customers who prefer to include the customer voting based similarity data search R-tree
object in their favorite lists. The reverse top-k query is one multidimensional indexing Data Structure is used with
type of tool for estimating impact or demand of the object slight modifications and a finite set of constants applied on
in the market. the bounds similarity values of the query points in inserting
indexing entries.
Vlachou Akrivi et al. [21] proposed a reverse top-k query
with two versions – monochromatic and bichromatic In general, for efficient and fast access to the very large
reverse top-k queries. Authors proposed an efficient datasets, a multidimensional data access technique is
threshold based algorithm for finding bichromatic reverse needed for many real-time tasks. The R-tree
top-k queries. multidimensional indexing tree data structure organizes
data records in the form of hyper-rectangles and these
Amit Singh, et al. [2] proposed an approximate solution to hyper-rectangles usually called minimum bounding
answer reverse nearest neighbor queries in high rectangles (MBRs) organized in the form of a tree
dimensional spaces. Authors said that the approach is hierarchy. R-tree multidimensional indexing tree data
mainly based on a feature called strong co-relation structure is height balanced and all data of objects are
between k-nearest neighbor (k-NN) and reverse the nearest stored in leaves. Small rectangles are included at the
neighbor (RNN) in connection with Boolean range query bottom level and when the R-tree is transferred from
(BRQ). bottom to the top a specific set of lower level small
rectangles are grouped into one big high-level rectangle.
Note that the performance of the reverse top k-query Lee Ken C. K., et al. [16] said that R-tree and its variants,
mainly depends on the number top-k query execution for R+-trees, R*-trees, and aR-trees are data partitioning index
each object and top k-query execution in turn depends on techniques useful for clustering data objects in terms of
voting/ preferences of customers. Reverse top-k query minimum bounding boxes with an abstract mechanism.
retrieves the set of customers to whom the object belongs They proposed a variant of reverse nearest neighbor query
to their top-k result sets. Reverse top-k sets are frequently called ranked reverse nearest neighbor query for searching
used for finding the potential demand of the objects in the and then proposed two algorithms for executing proposed
market. Reverse top-k query executions are costly. Hence query efficiently. These two algorithms are – k-counting
there is a need for approximate reverse top-k query and k-Browsing.
executions both for increased scalability and for speedup
of the overall execution. Also, effective planning Each MBR is defined by two points, lower left corner and
techniques are required. The performance of the R- tree upper right corner and is represented as M (lower x1, y1,
index Data Structure decreases as the dimensionality of the upper x2, y2). In general, the points lower x and y, and
data sets increases and the performance of all the upper x and y may not be part of the actual data set. For
algorithms that are based on R-tree will deteriorate. In such efficient query processing of customer voting based
cases, alternative efficiency and effective indexing similarity data search, index creation is inevitable for large
techniques and algorithms are needed. data and R-tree multidimensional indexing tree data
structure is mandatory for index creation.
Elke Achtert, et al. [6] said that all the existing generalized
reverse k-nearest neighbor (RkNN) search methods are Duc Thang et al. [5] said that fast, usability, simplicity and
only applicable to Euclidian distances but not for general with reasonably good performance features are always
metric objects. As a result, authors proposed first approach
IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018 91

better than the best performing algorithm only in some 5. Proposed algorithm
cases and rare usage of the algorithm because of high
complexity. Data clustering is one of the most important ALGORITHM WCLUSTER (Threshold, Root, D)
topics in data mining. Clustering is a method of arranging INPUT
data objects into convenient and meaningful subgroups for Threshold: user-specified similarity limit
further analysis, study, use, and application for effective Root: indexed tree
data management. At present, the position of k-means D: the dataset
algorithm is in the top-10 list of most important data OUTPUT
mining algorithms. The main advantages of the k-means Set of clusters
algorithm are – scalability, simplicity, robustness,
understandability, fast, and ease of use. The main 1. Initialize cluster number i = 1
disadvantages of it are – selecting initial starting number of 2. While D is not empty do
cluster centers is difficult and its time complexity is O(n2). 3. Object = first object in the D
4. Cluster set c i = Theta-Similarity-Query (Root,
Charif haydar and Anne Boyer [3] proposed a clustering Threshold, Object)
algorithm called mutual vote (MV) based on a statistical 5. i=i+1
model. Authors said that their proposed clustering 6. update input dataset D = D - c i
algorithm adjusts automatically to the data set and requires 7. End-While
minimum parameters.
DINO IENCO, et al. [4] said that the process of ALGORITHM Theta-Similarity-Query (Root, theta, q)
clustering data objects containing only categorical Input
attributes is a tedious task because defining a distance Root: root node of the R-tree
value between pairs of categorical attributes is difficult. theta: is the similarity measure threshold value
Authors proposed a framework to find a distance measure q: is the query object
between categorical attributes. Madhavi et al. [15] Output
formulated measures on the data containing categorical result-set: is the set of similar objects
attributes. They categorized existing measures as context-
free and context-sensitive measures for categorical data. 1. node = create a new tree node
Usue Mori et al. [20] said that the most famous Euclidian 2. node = Root
distance and the common measures used for non-temporal 3. if (minimum-similarity(node ,q) ≥ theta) then
data are not always the best methods for finding similarity 4. result-set = result-set UNION p for every sub-
between time series data because they do not deal with tree (node)
noise and misalignments in the time series data. Authors 5. end-if
said that Euclidian distance suffers from noise and outliers 6. if (node.type = leaf-node) then
problem. 7. for every p i in the node do
8. reverse p i vector = execute reverse top-k (p i )
Yung-Shen Lin et al. [22] said that similarity measures are 9. if (minimum-similarity(p i, q) ≥ theta) then
being used extensively in text classification and clustering. 10. result-set = result-set UNION p i
In the literature, various methods used for similarity 11. end-if
comparison are - Euclidian distance, Manhattan distance, 12. end-for
taxicab distance, cosine similarity measure, city-block 13. else
distance, Bray-Curties measure, Jaccard coefficient, 14. for every sub-tree of node do
extended Jaccard coefficient, Hamming distance, Dice 15. if (maximum–similarity(sub-tree , q)) ≥ theta then
coefficient, IT-Sim and so on. Authors have proposed a 16. node = sub-tree(node)
new measure for computing the similarity between two 17. end-if
documents and they have extended to measure the 18. end-for
similarity between two sets of documents. The proposed 19. end-if
measure is applied in many real applications such as k- 20. if (node is not empty) then
means like clustering, classification, and hierarchical 21. Theta-similarity-Query (node, theta, q)
clustering. 22. end-if
23. return (result-set)

Sub Algorithm Minimum_Similarity(p,q)


Input
92 IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018

P:is the object (college) presents in the leaf node medicine, profile, mobile, wine and so on. R-tree index
of the R-Tree structure is mainly used for a fast searching purpose.
q:is the queried object(college) During each of the search operation in each iteration a
Output node is examined and if the node satisfies the maximum-
a numeric value representing the similarity similarity value greater than or equal to the theta value,
measure between two objects then all the nodes within the sub-tree of the node are
1. a= total list of students referenced the college recursively searched and all the tuples of each node are
object p processed based on the minimum similarity condition some
2. b= total list of students referenced the college tuples or objects are added to the result set. Whenever a
object q leaf node is referenced Jaccard similarity measure is
3. similarity = applied to all the objects of the leaf node by executing
reverse top-k query for each object and at the sometimes
4. return similarity similarity measure, similar (p, q) greater than or equal is
also tested and the corresponding object is added to the
Reverse Top-k computation Algorithm Reverse Top-k result set during the computation of the similarity measure
Full() different types of pruning techniques are applied.
Reverse Topk[][]=new int [college][students]
for i=1 to number of colleges do
{
Col=0 6. Comparison of proposed algorithm with
for j=0 to number of elements in each rows in top-k traditional methods
resultsset
{ The data grouping in recommender systems traditionally
for k=0 to number of elements in row follows k-means approach. This k-means approach treats
{ each attribute alike and does not consider weights with
if (topkresultset == I ) then respect to priority attributes. In addition, the traditional
reverseTopk[i-1][col++]= j+1 approach needs high computational effort. The proposed
} approach using R-Tree saves a significant amount of
} computation time. The traditional approach needs
} comparatively more iterations for clustering than the
proposed R-tree based method. The time complexity of the
Algorithm reverseTopklist(obj) proposed approach is sub-linear, whereas the traditional
Input methods like k-means algorithm need O (n2) of time.
Obj:collegeObject Time complexity of search operation in R-Tree is O (log n)
Output in the best case when all the colleges belong to a single
List of students cluster and the R-Tree is called once. Hence best case time
for i=1 to number of colleges do complexity is O (log n). In the worst case when no two
{ engineering colleges have same profile of attributes then
if (collegelist[i][1]=obj) then the R-Tree is called n times where n is the number of
return ith row list in reverseTopk[i] engineering colleges. Hence worst case time complexity of
endif proposed algorithm is O (n log n). The average case time
} complexity of the algorithm may be anywhere between
endfor O(log n) and O(n log n)and it can be computed in best way
as
The WCLUSTER algorithm makes use of the above
similarity search algorithm. WCLUSTER provides the ≈ + ≈ ≈O(nlog n)
exhaustive set of clusters. For each step of the iterative Hence, best case, average case and worst case time
process, a cluster is separated from the whole dataset and complexities of proposed algorithm respectively are O (log
the remaining dataset is the candidate for the next iteration. n), O (n log n) and O (n log n). In many real time cases
The process ends when all the elements of the master average time complexity is considered to be the best
dataset have been clustered. estimator for algorithm time complexity.
The algorithm, Theta-Similarity-Query, returns all the Hence in terms of time complexity proposed algorithm is
similar objects of the given object q. The object may be superior than many of the traditional clustering algorithms.
any one of the items such as a tuple, product, book, patient,
IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018 93

Georgoulas Konstantinos, et al. [10] introduced a new observing the two graphs shown in Figure-1 and Figure-2
user-centric approach for finding object’s similarities. New it is clear that for the datasets with small sizes the
approach considers not only values of attributes of objects difference between execution times of existing k-means
but also preferences of attributes of objects are used in algorithm and proposed W-clustering algorithm is very
finding similarities between objects. Authors said that small and the difference in execution times will increase
proposed technique is very much useful for business rapidly as the sizes of datasets increase. For very large
organizations in finding business status details of a datasets the algorithm k-means is not scalable whereas the
particular product/object and a more efficient, effective, proposed W-cluster algorithm is scalable to the maximum
optimal marketing business policy can be established and extent and it is suitable for many real world applications
products can be clustered based on the preferences of because of the possible large data indexing capability of
customers. the R-tree indexing technique power. Figure-3 shows that
number of clusters in the proposed W-cluster technique
Table 1: Existing K-means clustering algorithm execution times increases gradually as the size of the dataset increases
Sno Number of Execution time Clusters
colleges in seconds
1 50 9 6
2 100 23 9
3 150 53 13
4 200 94 14
5 296 170 18

Table2: Execution times of proposed W-clustering algorithm with R-tree


Sno Number of Number of Execution time Cluster
colleges students in seconds s
1 50 50 1 9
Fig. 3 Number of clusters in k-means and W-cluster
2 100 100 4 13
3 150 150 12 14
4 200 200 31 17
5 296 300 46 19
7. Data preparation
The data set is collected from the students of intermediate
and B.Tech students with a sample of 2,000 students from
various colleges. Through a structured online questionnaire,
the data is gathered which consists of the student's opinions
and preferences towards the engineering colleges they
would like to join. The actual attribute information is also
collected from about 500 engineering colleges. These two
data sets were used to apply the proposed methodology.
Fig. 1 Execution times of k-means and proposed algorithms

8. The application
College recommender system is implemented in java and
its main application is to take an optimal decision in
selecting the best college for EAMCET admissions. The
proposed algorithm was applied to a college data set
Fig. 2 Execution times of k-means and proposed algorithms having 296 records in which each record contains 7
attributes. The present system also uses student data set
Experimentally obtained execution time details of both which contains their individual preferences of various
existing K-means clustering algorithm and proposed W- attributes pertaining to various colleges. During the
clustering algorithm with R-tree are respectively shown in process of college clustering, both the above data sets are
the tables TABLE-1 and TABLE-2. Two different graphs, used. The execution process is applied by dividing the data
column chart and line chart, are drawn in Fihure-1 and sets into different cases using both fixed and variable
Figure-2 respectively for the experimentally obtained data parameters. Experimentally obtained results are placed in
shown in TABLE-1 and TABLE-2 respectively. After the form of tables and figures.
94 IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018

9. Results Table 3: fixed variables set


Total number of colleges 296
Total number of students 50
The developed system is experimentally verified by taking Similarity between colleges 0.2
two real-world data sets namely, colleges and students. Maximum number of attributes 7
Different output parameter values are noted and their
relationships are plotted on graphs and charts.

Table 4: execution results for various values of k in top-k


K in
Case Execution Number of
top- Clusters data
No time Clusters
k
[1, 104, 105, 106, 108, 112, 113, 114, 115, 117, 118, 120, 121, 122, 124, 125, 136, 139, 14, 140, 142, 143, 144, 145, 146, 147, 148, 149, 150,
153, 154, 161, 167, 168, 169, 170, 172, 173, 179, 182, 185, 186, 187, 189, 190, 192, 193, 196, 197, 199, 20, 201, 203, 204, 207, 208, 209,
213, 214, 216, 217, 218, 22, 221, 222, 223, 224, 229, 231, 234, 241, 242, 245, 247, 249, 25, 255, 26, 260, 265, 268, 270, 275, 279, 28, 282,
287, 29, 31, 33, 34, 35, 36, 37, 39, 40, 46, 48, 49, 51, 52, 56, 6, 60, 63, 64, 7, 71, 73, 82, 83, 84, 89, 94, 97]
3 min 38
1 5 4 [228, 230, 235, 236, 239, 240, 251, 254, 257, 259, 267, 276, 288, 290, 291, 292, 293, 294, 30, 41, 57, 65, 68, 70, 85]
sec
[112, 139, 142, 143, 144, 145, 147, 167, 172, 173, 187, 192, 193, 196, 197, 20, 201, 204, 207, 208, 214, 22, 222, 224, 255, 260, 265, 275, 28,
29, 31, 33, 35, 36, 37, 39, 56, 60, 63, 64, 7, 71, 73, 82, 83, 84, 97]

[206, 228, 230, 235, 236, 239, 240, 251, 254, 259, 276, 280, 290, 291, 292, 293, 294, 41, 57, 68, 70]
[1, 104, 105, 106, 108, 112, 113, 114, 115, 117, 118, 120, 121, 122, 124, 125, 136, 139, 14, 140, 142, 143, 144, 145, 146, 147, 148, 149, 150,
153, 154, 161, 167, 168, 169, 170, 172, 173, 179, 182, 185, 186, 282, 284, 285, 287, 29, 291, 292, 294, 30, 31, 33, 34, 35, 36, 37, 39, 40]

[41, 46, 48, 49, 51, 52, 56, 57, 6, 60, 63, 64, 65, 66, 67, 68, 7, 70, 71, 73, 82, 83, 84, 85, 89, 94, 97,187, 189, 190, 192, 193, 196, 197, 199, 20,
201, 202, 203, 204, 206, 207, 208, 209, 213, 214, 216, 217, 218, 22, 221, 222, 223, 224]

3 min 55 [228, 229, 230, 231, 234, 235, 236, 239, 240, 241, 242, 245, 246, 247, 249, 25, 251, 254, 255, 257, 26, 260, 265, 268, 270, 275, 276, 279, 28,
2 10 6 280,
sec [259, 267, 288, 290, 293]

[112, 172, 173, 204, 22, 251, 254, 255, 265, 275, 279, 280, 282, 284, 285, 291, 292, 294, 37, 39, 57, 64, 68, 7]

[259, 272, 274, 290, 293]

[22, 240, 251, 254, 276, 291, 292, 294, 37, 41, 57, 68, 7]
[1, 104, 105, 106, 108, 112, 113, 114, 115, 117, 118, 120, 121, 122, 124, 125, 136, 139, 14, 140, 142, 143, 144, 145, 146, 147, 148, 149, 150,
153, 154, 161, 167, 168, 169, 170, 172, 173, 179, 182, 185, 186, 25, 251, 254, 255, 257, 26, 260, 265, 268, 270, 275, 276, 279, 28, 280, 282,
284, 285, 287, 29, 291, 292, 294, 30, 31, 33, 34, 35, 36, 37, 39, 40]

[187, 189, 190, 192, 193, 196, 197, 199, 20, 201, 202, 203, 204, 206, 207, 208, 209, 213, 214, 216, 217, 218, 22, 221, 222, 223, 224, 228,
229, 230, 231, 234, 235, 236]
3 min 52
3 15 6 [41, 46, 48, 49, 51, 52, 56, 57, 6, 60, 63, 64, 65, 66, 67, 68, 7, 70, 71, 73, 82, 83, 84, 85, 89, 94, 97, 239, 240, 241, 242, 245, 246, 247, 249,
sec 259, 267, 288, 290, 293]

[112, 172, 173, 204, 22, 251, 254, 255, 265, 275, 279, 280, 282, 284, 285, 291, 292, 294, 37, 39, 57, 64, 68, 7]

[259, 272, 274, 290, 293]

[22, 240, 251, 254, 276, 291, 292, 294, 37, 41, 57, 68, 7]

In a similar way for different values k in top-k experiments


are executed and the obtained results are shown in the
TABLE-4. Execution times are noted tabulated against
different values of k in top-k value.

Table 5: top-k versus execution time


Serial No. K in top-k Execution time in sec
1 5 3 min 38 sec = 218
2 10 3 min 55 sec = 235
3 15 3 min 52 sec = 232
4 20 3 min 43 sec = 223
5 25 3 min 52 sec = 232
Fig. 4 Relationship between k value in Top-k and execution Time
6 30 3 min 27 sec = 207
7 35 3 min 43 sec = 223
8 40 3 min 30 sec = 210 Figure-4 shows the relationship between k value in top-k
9 45 4 min 22 sec = 262 and the corresponding execution time. Here maximum
10 50 3 min 54 sec = 234
11 55 3 min 49 sec = 239 college data size and student preferences size are kept
12 60 3 min 54 sec = 234 constant. Figure-4 shows that there will not be drastic ups
IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018 95

and downs in execution times for various values because


data size is kept constant. The range of execution times is
approximately fixed. Here execution time is mainly based
on size of the data set. Execution time increases as the data
set size increases.

Table 6: top-k versus number of clusters


Serial No. K in top-k Number of Clusters Fig. 6 Relationship between maximum preferences and execution times
1 5 4
2 10 6 FIGURE-6 shows that there exist a linear relationship
3 15 7
4 20 10 between preferences and execution times. It depicts a
5 25 12 natural phenomenon that execution time increases as the
6 30 13 data seize increases. For smaller preference sets the
7 35 15 execution time follows linear relationship and for larger
8 40 17 preference sets the execution time follows sub-linear
9 45 19
10 50 21 relationship. That is scalability is linear for simple data sets
where as scalability is sub-linear when the number tuples
in the data set increases gradually. Also it is true that the
scalability will decrease as the dimensionality of the data
set increases.

Table 9: fixed parameter list


Total Colleges 100
Total Students 100
K in top-k 50
Maximum attributes 7

Fig. 5 top-k versus number of clusters


Table 10: various theta values
Serial Theta Execution time in Number of
Figure-5 depicts that the number of clusters will increase No. value sec clusters
when there is increase in the value of k in top-k list. This is 1 0.1 23 18
certainly true because when k value in top-k increases, the 2 0.2 25 17
same object appears in many preference lists and 3 0.3 25 14
consequently preference groups (clusters) will increase in a 4 0.4 26 11
natural manner. Hence figure-5 shows that the number of
5 0.5 23 9
clusters will progressively increase with the increase of k
6 0.6 23 7
values.
7 0.7 21 6

Table 7: fixed parameter list


Total tuples = 296
K in top-k = 50
Theta similarity = 0.2
Maximum attributes = 7

Table 8: total data set size versus variable sizes of weights and execution
times
Serial Maximum Execution Time Number of
No. Students Clusters
1 100 5 min 9 sec = 309 20
2 200 8 min 41 sec = 521 17
3 300 11 min 29 sec = 689 20 Fig. 7. Relation between similarity measure and number of clusters
4 400 14 min 5 sec = 845 21
5 500 17 min 4 sec = 1024 21
6 600 20 min 21 sec = 1221 21 FIGURE-7 shows that total number of clusters generated
7 700 23 min 56 sec = 1436 20
8 800 27 min 12 sec = 1632 21 will be decreased smoothly in a continuous manner as the
9 900 31 min 55 sec = 1925 21 similarity between cluster objects increases and this is true
10 1000 35 min 49 sec = 2149 21 because when the similarity threshold value set is very high
96 IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018

then many objects will not satisfy set threshold similarity their similarity threshold value is very less, which results
value and consequently not included in any of the clusters. decrease in the number of clusters.
Many objects are excluded from the clustering process, as

Table 11: represents the summarization of the results of the first three cases executed. The details of the rest of the cases resembling the first three cases
and so were not summarized again.
S Number of Number of Execution time Number of
Actual clusters
No. Colleges Students in sec clusters

[14, 20, 22, 25, 26, 28, 29, 30, 31, 33, 34, 35, 36, 37, 39, 40, 41, 46, 48, 49, 6, 7]
1 50 50 12 2
[112, 167, 172, 173, 193, 204, 22, 228, 230, 235, 236, 239, 240, 255, 265, 275, 276, 282, 285, 33,
35, 36, 37, 39, 41, 56, 64, 7, 70, 84]
[1, 14, 20, 22, 31, 34, 37, 48, 49, 51, 52, 6, 63, 7, 73, 82, 83, 89, 94, 97]
[28, 29, 30, 33, 35, 36, 39, 40, 41, 46, 56, 57, 60, 65, 66, 67, 68, 70, 71, 84, 85]
2 100 100 81 4
[1, 22, 31, 37, 63, 7, 83, 89, 97]
[25, 26, [64, 63]
[1, 104, 105, 106, 108, 112, 113, 114, 115, 117, 118, 120, 124, 125, 136, 14, 140, 146, 147, 150,
20, 22, 25, 26, 31, 34, 37, 48, 49, 51, 52, 6, 63, 64, 7, 73, 82, 83, 89, 94, 97]

[121, 122, 139, 142, 143, 144, 145, 30, 33, 35, 36, 40, 41, 46, 56, 57, 60, 65, 66, 67, 68, 70, 71,
85]
3 150 150 145 5
[1, 112, 147, 22, 31, 37, 52, 63, 64, 7, 73, 82, 83, 89, 97, [122, 29, 33, 35, 36, 39, 40, 41, 46, 56,
57, 67, 68, 70, 84]

[147, 20, 6, 63, 148, 149, 28, 33, 35, 36, 41, 56, 57, 67, 68, 70]

[150, 104, 83, 94, 97, 85, 33, 35, 36, 56, 84]

Table 12: relationships among colleges, students, execution time and

SNo. Number of
clusters
Number of Students Execution time in Number of
10. Conclusions
Colleges sec clusters
1 50 50 12 2 A novel technique for college recommendation was
2 100 100 81 4
3 150 150 145 5 presented. A well potent problem of college recommender
4
5
200
296
200
300
211
836
5
18
system was undertaken to solve with the proposed
6 296 400 836 20 grouping and recommendation technique. The proposed
technique is mostly suitable for present trends of data
available. Intelligent and time saving recommendation
systems can be developed embedding the proposed R-Tree
and top-k query approaches. The same was implemented
and applied to develop a recommender system for college
selection based on students’ preferences. The results
showed that the proposed technique is more reliable, more
intelligent and faster than the existing approaches.
A novel technique for top engineering college
recommendation is developed. A well potent problem of
Fig. 8 relationships among colleges, students, execution time and clusters engineering college recommender system for students is
undertaken to salve many of the problems that frequently
FIGURE-8 shows the relationships among colleges, occur during EAMCET admission process with respect to
students, execution times and clusters formed after student voting/preference/rating/opinions. A new
execution. Number of colleges and execution times intelligent and time saving system is developed based on
increase linearly up to a certain point beyond that point approaches R-Tree, Top-K query and voting/preference of
execution time curve follows exponential growth rate as is the students. The developed system is tested on the data
the case with many real world large data sets. Number of collected from various engineering colleges. The college
clusters increases smoothly as the number of colleges and data set represents all the profile attributes of engineering
students increases. colleges. Also students voting/preferences are collected
with respect to college attributes and used in the present
recommendation system. Experimental results show that
proposed system is reliable, faster, intelligent and more
useful for aspirants of engineering college admissions. In
IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018 97

the feature the system can be extended for admissions like [11] Hana Bydžovská. Course Enrollment Recommender
IITs, IIITs, and NITs and so on. In future the same setup System: Proceeding of the 9th International Conference on
can be extended for many more applications relating to Educational Data Mining, P. 312 – 317.
[12] Hector Nunez, Miquel sanchez-Marre, Ulises Cortes,
recommender systems that can exhibit the same
Joaquim Comas, Montse Martinez, Ignasi Rodriguez-Roda,
betterments. Manel Poch, “A Comaprative study on the use of similarity
measure in case based reasoning to improve the
classification of environmental system situations,”,
Acknoledgement ELSEVIER, Environmental Modeling and Software XX
(2003) xxx-xxx.
To collect the related data, an online survey is conducted [13] HristidisVagelis, Nick Koudas, Yannis Papakonstantinou,
“PREFER: A System for the Efficient Execution of
using a questionnaire. I am always thankful to all and
Multiparametric Ranked Queries”, ACM SIGMOD ’2001
sundry who participated and cooperated in data collection. Santa Barbara, California,USA
[14] Jamil Itmazi and Miguel Megias (2008), Using
References recommendation Systems in Course Management Systems
[1] Akrivi Vlachou, Christos Doulkerids, Kjetil Norvag, and to Recommend Learning Objects, P. 234 - 240.
Yannis Kotidis, “Identifying the Most Influential Data [15] J. Bobadilla et al. “Knowledge-Based System” 2013
Objects with Reverse Top-k Queries,” Proceedings of the Elsevier B.V.
VLDB Endowment, Vol. 3, No. 1, Copy right 2010 VLDB [16] Lee Ken C. K., Baihua Zheng, Wang-Chien Lee, “Ranked
Endowment 2150-8097/10/09 Reverse Nearest Neighbor Search”, IEEE Transactions on
[2] Amit Singh, Hakan Ferhatosmanoglu, and Ali Saman Tosun, knowledge and Data Engineering. Vol. 20, No.7, July 2008
“High Dimensional Reverse Nearest Neighbor Queires,” [17] Madhavi Alamuri, Bapi raju Surampudi and Atul Negi, “A
CIKM’03, November 3-8, 2003, New Orleans, Louisiana, Survey of Dustance / Similarity Measure for categorical
USA, copyright 2003 ACM 1-58113-723-0/03/0011 Data,” 2014 International Joint conference on Neural
[3] C.C. Aggarwal, Recommender Systems: The Textbook, DOI Networks (IJCNN), July 6-11, 2014, Beijing, china.
10.1007/978-3-319-29659-3 1© Springer International [18] Queen Esther Booker (2009). A Student Program
Publishing Switzerland 2016 Recommendation System Prototype: Issues in Information
[4] Charif Haydar, Anne Boyer, “A New Statistical Density Systems, P. 544 - 551.
Clustering Algorithm based on Mutual Vote and Subjective [19] Subba Reddy.Y and Prof. P. Govindarajulu,” A survey on
Logic Applied to Recommender Systems”, UMAP 2017 data mining and machine learning techniques for internet
Full Paper UMAP’17, July 9- 12, 2017,Bratislava,Slovakia voting and product/service selection”, IJCSNS International
[5] DINO IENCO, RUGGERO G. PENSA and ROSA MEO, Journal of Computer Science and Network Security,
“From Context to Distance: Learning Dissimilarity for VOL.17 No.9, September 2017
categorical Data Clustering,” ACM Journal Vol. X. 10 2009, [20] Usue Mori, Alexander Mendiburu, and Jose A.Lozano,
pages 1- 0?? “Similarity Measure Selection for Clustering Time Series
[6] Duc Thang Nguyen, Lihui Chen, Chee keong Chan, databases,” IEEE Transactions on Knowledge and Data
“Clustering with Multiviewpoint-Based Similarity Engineering. Vol. 28. No. 1. January 2016
Measure,” IEEE Transactions on Knowledge and Data [21] Vlachou Akrivi, Charitos Doulkeridis, Yannis Kotidis,
Engineering. Vol. 24. No. 6. June 2012 Kjetil Nrvag, “Reverse Top-k Queries”, ICDE Conference
[7] Elham S.Khorasani, Zhao Zhenge, and John Champaign. 2010 978-1-4244-5446-4/10
AMarkov Chain Collaborative Filtering Model for Course [22] Yung-Shen Lin, Jung-Yi Jiang, and Shie-Jue Lee, “A
Enrollment Recommendations: 2016, “IEEE International Similarity Measure for Text Classification and Clustering,”
Conference on Big Data (Big Data)”, P. 3484 – 3490 IEEE Transactions on Knowledge and Data Engineering.
[8] Elke Achtert, Christian Bohm, Peer Kroger, Peeter Kunath, Vol. 26. No. 7. July 2014
Alexy Pryakhin, Matthias, “ Efficient Reverse k-Nearest [23] Zhibo Wang, Jilong Liao, Qing Cao, Hairong Qi, and Zhi
Neighbor Search in Arbitrary Metric Spaces,” SIGMOD Wang, “Friend book: A Semantic-based Friend
2006 June 27-29, 2006 Chicago, Illinois, USA. Recommendation System for Social Networks”, IEEE
[9] Fazeli Soude, Hendrik Drachsler, Marlies Bitter-Rijpkema, Transactions on Mobile Computing.
Francis Brouns, Wim van der Vegt, and Peter B. Sloep,
“User-centric Evaluation of Recommender Systems in Y.Subba Reddy received M.Sc (Computer
Social Learning Platforms: Accuracy is Just the Tip of the Science) degree from Bharathidasan
Iceberg”, IEEE Transactions on Learning Technologies, University, Tiruchirapalli, TN and M.E
August 26, 2015 degree in Computer Science &
[10] Georgoulas Konstantinos, Akrivi Vlachou, Christos Engineering from Sathyabama University,
Doulkeridis, and Yannis Kotidis, “User-Centric Similarity Chennai, TN. He is a research scholar in
Search,” IEEE Transactions on Knowledge and Data the Department of Computer Science, Sri
Engineering, Vol. 29, No. 1, January 2017 Venkateswara University, Tirupati, AP,
India. His research focus is on Data
Mining in Clustering and Similarity measures.
98 IJCSNS International Journal of Computer Science and Network Security, VOL.18 No.1, January 2018

P.Govindarajulu, Professor, Department


of Computer Science, Sri Venkateswara
University, Tirupathi, AP, India. He
received his M. Tech., from IIT Madras
(Chennai), Ph. D from IIT Bombay
(Mumbai). His area of research are
Databases, Data Mining, Image
processing, Intelligent Systems and
Software Engineering

You might also like