Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

60376-93918-4-PB

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

ISSN (Online) : 0974-5645

ISSN (Print) : 0974-6846


Indian Journal of Science and Technology, Vol 8(4), 301–306, February 2015 DOI : 10.17485/ijst/2015/v8i4/60376

Personalized Search Engine using


Social Networking Activity
Nathaneal Ramesh* and J. Andrews
Department of Computer Science and Engineering, Sathyabama University, Chennai - 600119,
Tamil Nadu, India; nathaneal31@gmail.com, andrews_593@yahoo.com

Abstract
The main objective of this research work is to obtain a personalized search result required by the user, by creating user
profile based on social networking activity. The user profile is actually constructed by pages liked by the individual user
in Facebook on their respective user account. According to the constructed user profile the results are re-ranked and the
personalised search results are obtained. Lingo- a novel algorithm is used for clustering the data. The search results are
retrieved to user using Carrot2 API search engine. In the past, personalized search engines acquired data from surfing
history implicitly or explicitly by machine learning whereas this work acquires data implicitly through user likes and also
explicitly through user defined categories . The Facebook likes are given by each user only on their personal interest. Thus,
these data play a vital in providing accurate search results to each user and provide exact search results as per the user
interest.

Keywords: Information Retrieval, Personalization, Search Engine, Social Network and Web Search

1. Introduction Thus, Personalized Search engine returns the most


appropriate search results related to users interest. For the
Web search engine is designed to search and retrieve query “puma”, a zoologist would be interested in the puma
information from the World Wide Web (WWW). The (Cougar) animal, species living in mountainous regions
search results are presented in a list and they are often and a normal person would be interested in knowing
called to be hits. The information from the search results about puma clothes and sportswear for its new arrival and
of any given query may be of web pages, images, videos purchasing. User profiling is a fundamental component
and other different types of files. of any personalization applications. Personalized search
A query is the text that the user submits to the search engines ranks the search results based on the user inter-
engine. Most commercial search engines return the same ests. Previously these personalized search engines are
results roughly for the same given query, regardless of constructed either implicitly or explicitly.
the individual user’s real interest. Since queries that are Implicit learning through surfing activities of user,
submitted to search engines tend to be very short and that is by observing each user’s behaviour such as the
ambiguous, they are not likely able to express the user’s time spent reading an online document or by track-
precise needs. For example, if the query to be searched ing down the pages visited by the user1 or explicitly by
by a user is apple, the search results will have documents making the system to learn through training, asking for
explaining both apple fruit and apple computers. But the feedback through preferences or ratings. Explicit con-
user may be a computer person who is really interested to struction of user profiles has several drawbacks. The user
know about apple computers alone. Therefore the search might provide inconsistent or incorrect information, and
results should be in such a way that documents explaining the profile which is built will be static whereas the user’s
apple computers should be displayed at the top. interests may change over time, and the construction of

*Author for correspondence


Personalized Search Engine using Social Networking Activity

the user profile places a burden on the user which they 3. Proposed System
do not have willingness to accept. But this requires the
user intervention. This can be overcome by construction In this work a personalized search engine is proposed
of user profiles automatically and implicitly while the which implicitly constructs user profile using User’s
users browse the web. But even implicit profile construc- facebook activities. User interest are derived from FB
tion is also not very accurate in all cases as it is of system’s activities though explicit, it is done by user for his social
judgement. Improving the performance of Search engine networking needs and not for any personalization need.
for obtaining better search results is done using fuzzy Therefore it is an implicit based learning method. Search
logic2. result is derived from any popular search engine such as
But personalized search engines can be also improved Google, yahoo, Bing, AltaVista etc. These search results
in much better way by using the user’s social network- are re ranked based on user interests.
ing activities. Networking sites like Facebook, twitter can Here, Lingo - a novel algorithm for clustering search
be used to track user daily activities and therefore deduce results is presented, which emphasizes cluster description
his interests. Over the last decade, the World Wide Web quality. Lingo algorithm first tries to make sure whether a
and Web search engines have dramatically transformed human-perceivable cluster label can be created and then
the way people share their information. Recently, a new assigns documents to the category. Very importantly the
way of sharing and locating information, known as social frequently used phrases are extracted from the input docu-
networking, has become highly popular. While numer- ments, considering that these frequently used phrases will
ous studies3,4,5 have focussed on the hyperlinked structure be the most useful informative source of human-readable
of the Web and have exploited it for searching content, topic descriptions. Then lingo performs reduction of
few studies6,7 have examined the information exchange in the original term-document matrix using singular value
online social networks. This system uses few parameters decomposition (SVD). The SVD is performed and dis-
of Facebook to construct user profile and in this work the covery of any existing latent structure of diverse topics in
personalized search engine is constructed by implicit as the search result is founded out. Finally group descrip-
well as explicit learning. tions are matched with the extracted topics and assigns
A web search engine often returns thousands of pages relevant documents to them.
in response to a big query, making it very difficult for SVD breaks a t × d matrix into three matrices namely
users to browse or to identify the relevant information U, ∑ and V, Therefore A = U∑VT. Where, U is a t × t
which the user needs. Clustering methods can be used orthogonal matrix whose column vectors are called as
automatically to group the retrieved documents into a the left singular vectors of A, and V is a d × d orthogonal
list of meaningful built-in categories8,9, as it is achieved matrix whose column vectors are called as the right sin-
through Enterprise Search engines such as Northern gular vectors of A, and ∑ is a t × d diagonal matrix that
Light and Vivisimo, and the consumer search engines has the singular values of A ordered in a decreasing man-
such as PolyMeta and Helioid, or through an open source ner along its diagonal. The rank rA of, A matrix is equal
software such as Carrot210. to number of its non-zero singular values. The first rA
columns of U form the orthogonal basis for the column
space of A - an essential fact used by Lingo11.
2. Problem Statement The (Figure 1.) shows the overall architecture of the
The task of the system is to derive the user interests based system depicting the complete representation of modules.
on his/her activity in social networking sites such as face- User logs in to the search engine using their Facebook
book, twitter etc and re rank the search results based on account. Then User enters a search query and the results
the user interest profile. for it are obtained from an existing search engine like
Google. Then the search results are pre-processed to
• The system recognizes the interests of the user and extract the key terms that are related to the user query. Pre-
returns the appropriate results to the user. processing is done with the Lingo clustering algorithm.
• Enables the user to constantly feedback his interests The resultant cluster labels are said to be the key terms
without his own intentions, thus reducing the need to for the given query and are called as concepts. After the
train the system. cluster formation, the search results are retrieved to user

302 Vol 8 (4) | February 2015 | www.indjst.org Indian Journal of Science and Technology
Nathaneal Ramesh and J. Andrews

Figure 1. Functional architecture.

using Carrot2 search engine. The Carrot2 connects the


local server and pulls the relevant data from Google. The
steps involved in lingo clustering algorithm, are depicted
in flow chart and are explained in detail (Figure 2).
Figure 2. Flow of Lingo Clustering.

3.1 Pre-Processing
and pronouns is necessary so as to avoid annoying repeti-
Stemming and the stop word removal is a very com-
tion. The former is partially overcome by lingo algorithm
mon operation in the process of Information Retrieval.
using the SVD-decomposed term document matrix to
It is to be noted that these process always doesn’t provide
identify the abstract concepts. To be a candidate for a
positive results. In certain applications stemming process
labelled cluster, the general guidelines for the frequent
doesn’t show any improvement at all to the overall qual-
phrases or a single individual term is:
ity. Let it be so, but still study on recent experiments show
that pre-processing technique is of great importance in • Appearance of phrase or term in the input document
Lingo clustering approach. It is because the input snippets at least certain number of times,
are generated automatically from the original documents • Boundaries should not be cross sentence,
and they are generally very small. Though SVD can deal • Should be a entire phrase,
with noisy data, yet without pre-processing, the most of • The stop word should not be at the beginning or at
the discovered abstract concepts will be related to mean- the end.
ingless frequent terms. In pre-processing three steps are
carried out: (i) HTML tags, entities and other charac- 3.3 Cluster Label Induction
ters are removed using text filtering, except for sentence After knowing the frequent phrases and individual fre-
boundaries. (ii) Each snippet’s language is identified quent terms that exceeds the term frequency threshold,
and then appropriate stemming and the (iii) stop word they are used for induction of cluster label. Three steps
removal completes the pre-processing process. are followed for induction of cluster label. They are, (1)
building of term-document matrix, (2) abstract concept
3.2 Frequent Phrase Extraction discovery, (3) matching of phrase and label pruning.
These are phrases that occur frequently in an ordered The construction of term-document matrix is from
sequence in the given input documents. While writing single terms that exceed the predefined term frequency
about something, it is very common to use keywords that threshold. The Weight of each term is calculated using the
are related to the main subject in-order to maintain read- standard term frequency, inverse document frequency
er’s attention. And for a good writing it usage of synonymy (tf-idf) formula. In abstract concept discovery, the orthog-

Vol 8 (4) | February 2015 | www.indjst.org Indian Journal of Science and Technology 303
Personalized Search Engine using Social Networking Activity

onal basis of the term-document matrix is founded using model. Define a matrix in which each of the cluster labels
SVD method. The vectors of this basis SVD’s U matrix is represented as a column vector. Let the matrix be Q.
represent the abstract concepts that appear in the input
Let C = QTA
documents. The ‘K’ value is estimated by selecting the
Fresenius norms of the term-document matrix ‘A’ and its Where, A is the original term-document matrix for
k-rank approximation ‘Ask’. Let the threshold ‘q’ be a per- the input documents. The element cij shows the strength
centage value that determines to what extent the k-rank of membership of the jth document to the ith cluster of the
approximation should retain the original information in C matrix. If the element cij exceeds the Threshold of snip-
matrix A. Hence k is defined as the minimum value that pet assignment then that document is added to a cluster,
satisfies the following condition, which is another control parameter of the algorithm.
Documents unassigned to any of the clusters are called
||Ask||F /||Ask|| ≥ q,
“Others”. Therefore “others” are artificial clusters.
Where, ||X||F denotes the Fresenius norm of matrix
X. Matching the phrase and the label pruning phase, is 3.5 Final Cluster Formation
responsible for discovery of group descriptions depend-
ing on a serious observation, whether both abstract
At final stage the clusters are sorted in a specific order for
concepts and frequent phrases are expressed in the same display, based on their score, which is calculated using the
vector space. How close a phrase or a single term is to an following formula,
abstract concept is calculated by classic cosine distance. It Cscore = label score × ||C||
is denoted by P, a matrix of size t × (p + t) where, t and p Where, ||C|| - total number of documents assigned to
are the number of frequent terms and number of frequent the C cluster.
phrases used respectively. Vector ‘mi’ can be calculated The scoring function looks simple, but prefers well-
using, described and prefers larger groups over smaller groups,
mi = UiT P. mostly the noisy ones. Lingo algorithm does not follow
any cluster merging strategy or hierarchy induction.
The phrase corresponding to the maximum component Simultaneously when the user logs in, their Facebook
of the ‘mi’ vector must be selected as the human-readable activities are retrieved using graph API. The activity data
description of the ith abstract concept. And also, the value of the user is then segregated into likes and status posts.
of cosine becomes the score of the cluster label candi- Then the likes are segregated into specific categories by
date. Similar processing is carried out for a single abstract sending its page name to Wikipedia. These likes are then
concept that can be extended to the entire ‘Uk’ matrix. A separated into recent likes and history of likes (other
single matrix multiplication M = UkTP. than recent). The statuses and posts are retrieved and are
Last step is to prune the overlapping label descriptions again categorized based on the tags used in the posts and
in label induction. Consider ‘V’ vector of cluster label statuses. Then again they are separated into recent posts
candidates and their scores. Another term-document and history of posts (other than recent). On creating an
matrix Z is created, where the documents are the cluster overall user profile by merging the like based and the sta-
label candidates. ZT, after column length normalization tus based profile. Then re-ranking is done based on the
Z is calculated, which yields the outcome matrix that are user profile, and the personalized results are returned to
similar between the cluster labels. Finally select columns the user.
which cross the threshold of label similarity and discard
everything, but keep one cluster label candidate for each
row with the maximum score. 4. Experimental Setup
There are three systems in total for building personalised
3.4 Cluster Content Discovery search engine. (Figure 3.) depicts how the search results
The assignment of input documents to the cluster labels are retrieved and clustered from a common search engine
are done through the Vector Space Model. The assignment such as Google, Yahoo, Bing, and AltaVista etc. The clus-
process denotes document retrieval based on the VSM tered results are stored into a MySql database.

304 Vol 8 (4) | February 2015 | www.indjst.org Indian Journal of Science and Technology
Nathaneal Ramesh and J. Andrews

(Figure 4.) depicts how a user profile is constructed


using one’s Facebook activities. Based on their activities
user profile which represents interest score for each cat-
egory is constructed .These user interest score is stored in
a separate MySql database. Categories are chosen based
on the categories available in Facebook.
(Figure 5.) combines clustered database and user inter-
est score database. It gives user interest score for each URL
returned by the clustered database. The URL’s are re ranked
based on the interest score and displayed to the user.

5. Results and Discussion Figure 6. Bar Chart depicting interest relevance for
Bing and Personalized engine.
In this section, a thorough discussion about the result and
performance measures of the proposed system is discussed. into discussion (Figure 6). First user is more interested
Consider a scenario where two user profiles are taken in journals whereas the second user is more interested in
movies and journals. These data’s are obtained from the
likes given by each user in their facebook account. Now,
when these users give query as ‘Education’ in Bing search,
it returns results about Education levels or Education
System in which both users are not interested in it. But
the personalized search engine gives search results with
respect to user’s interest. The bar chart depicts how much
Figure 3. Clustering architecture. user interest is provided in the first result page for etool
(Bing) engine and for the proposed personalized search
engine. Bing engine gives the same result to both user 1
and user 2. While personalized engine gives higher results
about movies to user 2 and it gives higher results about
reading journals or newspapers for user 1.
Performance measure depicts that proposed person-
alized search engine returns different search results to
each users. The search results depend on the user interest
profile constructed from facebook activity. Thus the per-
sonalized engine is both dynamic and interest based.

Figure 5. Search result. 6. Conclusion and Future Work


The search results obtained when the user enters the
search query are re-ranked based on the social network-
ing activities of the user. The user profile is constructed
by using the social networking activity of the user by cre-
ating a like based user profile. This profile is then used
to re-rank and thus the personalized search results are
obtained according to user’s preference.
In future this work can be extended by extracting
status of user and creating status based profiles and the
processing speed of the pages can be enhanced by using
Figure 4. User profile construction. more specific algorithms because time taken to retrieve

Vol 8 (4) | February 2015 | www.indjst.org Indian Journal of Science and Technology 305
Personalized Search Engine using Social Networking Activity

and calculate score for search results is not friendly. More 6. Carmel D, Zwerdling N, Guy I, Ofek-Koifman S, Har’el N,
categories can be added to give more specific results on a Ronen I, Uziel E, Yogev S. Personalized Social Search Based
wider range of fields. on the User’s Social Network. Proceedings of the 18th ACM
Conference on Information and Knowledge Management;
2009. p. 1227–36.
7. References 7. Prates C, Fritzen E, Siqueira S, Helena M, de Andrade
LCV. Contextual web searches in Facebook using learning
1. Geetha Rani S, Sorana Mageswari M. A link-click-concept
materials and discussion messages. Journal Computers in
based Ranking Algorithm for Ranking Search Results.
Human Behavior. 2013; 29(2):386–94.
Indian Journal of Science and Technology. 2014 Oct;
8. Lu Q, Conrad JG, Al-Kofahi K, Keenan W. Legal document
7(10):1712–9.
clustering with built-in topic segmentation. Proceedings
2. Rezaei HR, Dehkordi MN, Moghadam RA. Improving
of the 20th ACM International Conference on Information
performance of search engines based on fuzzy classifica-
and Knowledge Management; 2011 Oct. p. 383–92.
tion. Indian Journal of Science and Technology. 2012 Nov;
9. Teitler BE, Sankaranarayanan J, Samet H, Adelfio MD.
5(11):3607–11.
Online document clustering using GPUs. New Trends in
3. Sieg A, Mobasher B, Burke R. Learning ontology based user
Databases and Information Systems. Springer International
profiles: a semantic approach to personalized web search.
Publishing; 2014. p. 245-254.
IEEE Intelligent Informatics Bulletin. 2009 Nov; 8(1):7–18.
10. Available from: http://en.wikipedia.org/wiki/Carrot2
4. Radlinski F, Joachims T. Evaluating the robustness of learn-
11. Waterworth A. New South Wales, Sydney: University of
ing from Implicit Feedback. ICML Workshop on Learning
Sydney. Available from: http://clusteringalgorithms.blog-
in Web Search; 2005. p. 42–50.
spot.in/2007/07/lingo-algorithm.html
5. Xu Z, Luo X, Zhang S, Wei X, Mei L, Hu C. Mining tempo-
ral explicit and implicit semantic relations between entities
using web search engines. Future Generat Comput Syst.
2014; 37:468–77.

306 Vol 8 (4) | February 2015 | www.indjst.org Indian Journal of Science and Technology

You might also like