Clubcf: A Clustering-Based Collaborative Filtering Approach For Big Data Application
Clubcf: A Clustering-Based Collaborative Filtering Approach For Big Data Application
Clubcf: A Clustering-Based Collaborative Filtering Approach For Big Data Application
EMERGING TOPICS
IN COMPUTING
Received 14 November 2013; revised 30 January 2014; accepted 25 February 2014. Date of publication 10 March 2014;
date of current version 30 October 2014.
Digital Object Identifier 10.1109/TETC.2014.2310485
ABSTRACT
Spurred by service computing and cloud computing, an increasing number of services are
emerging on the Internet. As a result, service-relevant data become too big to be effectively processed
by traditional approaches. In view of this challenge, a clustering-based collaborative filtering approach
is proposed in this paper, which aims at recruiting similar services in the same clusters to recommend
services collaboratively. Technically, this approach is enacted around two stages. In the first stage, the
available services are divided into small-scale clusters, in logic, for further processing. At the second stage,
a collaborative filtering algorithm is imposed on one of the clusters. Since the number of the services in
a cluster is much less than the total number of the services available on the web, it is expected to reduce
the online execution time of collaborative filtering. At last, several experiments are conducted to verify the
availability of the approach, on a real data set of 6225 mashup services collected from ProgrammableWeb.
INDEX TERMS
I. INTRODUCTION
Big data has emerged as a widely recognized trend, attracting attentions from government, industry and academia [1].
Generally speaking, Big Data concerns large-volume, complex, growing data sets with multiple, autonomous sources.
Big Data applications where data collection has grown
tremendously and is beyond the ability of commonly used
software tools to capture, manage, and process within a
tolerable elapsed time is on the rise [2]. The most fundamental challenge for the Big Data applications is to explore
the large volumes of data and extract useful information or
knowledge for future actions [3].
With the prevalence of service computing and cloud
computing, more and more services are deployed in cloud
infrastructures to provide rich functionalities [4]. Service
users have nowadays encounter unprecedented difficulties in
finding ideal ones from the overwhelming services. Recommender systems (RSs) are techniques and intelligent applications to assist users in a decision making process where
302
they want to choose some items among a potentially overwhelming set of alternative products or services. Collaborative filtering (CF) such as item- and user-based methods
are the dominant techniques applied in RSs [5]. The basic
assumption of user-based CF is that people who agree in
the past tend to agree again in the future. Different with
user-based CF, the item-based CF algorithm recommends a
user the items that are similar to what he/she has preferred
before [6]. Although traditional CF techniques are sound and
have been successfully applied in many e-commerce RSs,
they encounter two main challenges for big data application:
1) to make decision within acceptable time; and 2) to generate
ideal recommendations from so many services. Concretely,
as a critical step in traditional CF algorithms, to compute
similarity between every pair of users or services may take
too much time, even exceed the processing capability of
current RSs. Consequently, service recommendation based
on the similar users or similar services would either lose its
timeliness or couldnt be done at all. In addition, all services
2168-6750
2014 IEEE. Translations and content mining are permitted for academic research only.
Personal use is also permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
with sufficient information to be able to invoke the business
functions exposed by a service provider.
Although the definitions of service are distinct and
application-specific, they have common elements which
mainly include service descriptions and service functionalities. In addition, rating is an important user activity that
reflects their opinions on services. Especially in application
of service recommendation, service rating is an important
element. As more and more services are emerging on the
Internet, such huge volume of service-relevant elements are
generated and distributed across the network, which cannot
be effectively accessed by traditional database management
system. To address this problem, Bigtable is used to store
services in this paper. Bigtable [12] is a distributed storage system of Google for managing structured data that
is designed to scale to a very large size across thousands
of commodity servers. A Bigtable is a sparse, distributed,
persistent multi-dimensional sorted map. The map is indexed
by a row key, column key, and a timestamp; each value in
the map is an uninterpreted array of bytes. Column keys are
grouped into sets called column families, which form the
basic unit of access control. A column key is named using
the following syntax: family:qualifier, where family refers
to column family and qualifier refers to column key. Each
cell in a Bigtable can contain multiple versions of the same
data which are indexed by timestamp. Different versions of a
cell are stored in decreasing timestamp order, so that the most
recent versions can be read first.
In this paper, all services are stored in a Bigtable which is
called service Bigtable. The corresponding elements will be
drawn from service Bigtable during the process of ClubCF.
Formally, service Bigtable is defined as follow.
Definition 1: A service Bigtable is defined as a table
expressed in the format of < Service_ID > < Timestamp >
{< Description >: [< d1 >, < d2 >, . . .];
< Functionality > [:< f1 >, < f2 >, . . .];
< Rating >: [< u1 >, < u2 >, . . .]}
The elements in the expression are specified as follows:
1. Service_ID is the row key for uniquely identifying a
service.
2. Timestamp is used to identify time when the record is
written in service Bigtable.
3. Description, Functionality and Rating are three column
families.
4. The identifier of a description word, e.g. d1 and d2 , is used
as a qualifier of Description.
5. The identifier of a functionality e.g. f1 and f2 is used as a
qualifier of Functionality.
6. The identifier of a user, e.g. u1 and u2 is used as a qualifier
of Rating.
A slice of service Bigtable is illustrated in Table I. The
row key is s1 . The Description column family contains the
words for describing s1 , e.g., driving. The Functionality
column family contains the service functionalities,
303
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
IEEE TRANSACTIONS ON
larger
S
It
Tcan be inferred from this formula that the
D0 D0 is, the more similar the two services are. D0 D0
t
t
j
j
is the scaling factor which ensures that description similarity
is between 0 and 1.
The functionalities in Ft are gotten from service
Bigtable where row key = st and column family =
Functionality. The functionalities in Fj are gotten from
service Bigtable where row key = sj and column family =
Functionality. Then, functionality similarity between st
and sj is computed using JSC as follow:
T
Ft Fj
F_sim(st , sj ) = S
(2)
Ft Fj
Step 1.3: Compute Characteristic Similarity
Characteristic similarity between st and sj is computed
by weighted sum of description similarity and functionality
similarity, which is computed as follow:
C_sim(st , sj ) = D_sim(st , sj ) + F_sim(st , sj ) (3)
In this formula, [0, 1] is the weight of description
similarity, [0, 1] is the weight of functionality similarity
and + = 1. The weights express relative importance
between these two.
Provided the number of services in the recommender
system is n, characteristic similarities of every pair of services
are calculated and form a n n characteristic similarity
matrix D. An entry dt,j in D represents the characteristic
similarity between st and sj .
Step 1.4: Cluster Services
Clustering is a critical step in our approach. Clustering
methods partition a set of objects into clusters such that
objects in the same cluster are more similar to each other
than objects in different clusters according to some defined
criteria.
Generally, cluster analysis algorithms have been utilized
where the huge data are stored [16]. Clustering algorithms can
be either hierarchical or partitional. Some standard partitional
approaches (e.g., K means) suffer from several limitations:
1) results depend strongly on the choice of number of clusters K , and the correct value of K is initially unknown; 2) cluster size is not monitored during execution of the K -means
algorithm, some clusters may become empty (collapse),
and this will cause premature termination of the algorithm;
3) algorithms converge to a local minimum [17]. Hierarchical
clustering methods can be further classified into agglomerative or divisive, depending on whether the clustering
hierarchy is formed in a bottom-up or top-down fashion. Many current state-of-the-art clustering systems exploit
agglomerative hierarchical clustering (AHC) as their clustering strategy, due to its simple processing structure and
acceptable level of performance. Furthermore, it does not
require the number of clusters as input. Therefore, we use an
AHC algorithm [18], [19] for service clustering as follow.
Assume there are n services. Each service is initialized to
be a cluster of its own. At each reduction step, the two most
VOLUME 2, NO. 3, SEPTEMBER 2014
EMERGING TOPICS
IN COMPUTING
similar clusters are merged until only K (K < n) clusters
remains.
Algorithm 1 AHC algorithm for service clustering
Input: A set of services S = {s1 , . . . , sn },
a characteristic similarity matrix D = di,j nn ,
the number of required clusters K .
Output: Dendrogramk for k = 1 to |S|.
1. Ci = {si } , i;
2. dCi ,Cj = di,j , ij;
3. for k = |S| down to K
4.
Dendrogramk = {C1 , . . . , Ck };
5.
lm = dCi ,Cj ;
6.
Cl = Join (Cl , Cm );
7.
for each Ch S
8.
if Ch 6 = Cl and Ch 6 = Cm
9.
dCl ,Ch = Average dCl ,Ch , dCm ,Ch ;
10:
end if
11: end for
12. S = S {Cm } ;
13. end for
r
st
ui Ut Uj ui ,st
ui Ut Uj rui ,sj r sj
(4)
Here, Ut is a set of users who rated st while Uj is a set
of users who rated sj , ui is a user who both rated st and sj ,
rui ,st is the rating of st given by ui which is gotten from service
Bigtable where row key = st and column key = Rating :
ui , rui ,sj is the rating of sj given by ui which is gotten from
305
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
2 |Ut Uj |
R_sim(st , sj )
|Ut | + |Uj |
(5)
To verify ClubCF, a mashup dataset is used in the experiments. Mashup is an ad hoc composition technology of
Web applications that allows users to draw upon content
retrieved from external data sources to create value-added
VOLUME 2, NO. 3, SEPTEMBER 2014
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
TABLE 2. The experimental environments.
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
places).
0
0
D5 |
0 S 0
|D2 D5 |
|D2
= 16 .
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
with s4 .
with s1 .
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
the seven mashup services. A rating matrix is established as
Table X. The ratings are on 5-point scales and 0 means the
user did not rate the mashup. As u3 does not rate s4 (a not-yetexperienced item), u3 is regarded as an active user and s4 is
looked as a target mashup. By computing the predicted rating
of s4 , it can be determined whether s4 is a recommendable
service for u3 . Furthermore, s1 is also chosen as another target
mashup. Through comparing the predicted rating and real
rating of s1 , the accuracy of ClubCF will be verified in such
case.
Since s4 and s1 are both belong to the cluster C2 , rating similarity and enhanced rating similarity are computed between
mashup services within C2 by using formula (4) and (5). The
rating similarities and enhanced rating similarities between s4
and every other mashup service in C2 are listed in Table XI
while such two kinds of similarities between s1 and every
other mashup service in C2 are listed in Table XII.
Step 2.2: Select Neighbors
Rating similarity is computed using Pearson correlation
coefficient which ranges in value from 1 to +1. The value
of 1 indicates perfect negative correlation while the value
of +1 indicates perfect positive correlation. Without loss of
generality, the rating similarity threshold in formula (6) is
set to 0.4.
Since the enhanced rating similarity between s4 and s1 is
0
0.467 ( i.e., R_sim (s4 , s1 ) = 0.467) and the enhanced rating
0
similarity between s4 and s3 is 0.631 (i.e., R_sim (s4 , s3 ) =
0.631), which are both greater than , s1 and s3 are chosen as
the neighbors of s4 , i.e., N (s4 ) = {s1 ,s3 }.
Since the enhanced rating similarity between s1 and s3 is
0
0.839 (i.e., R_sim (s1 , s3 ) = 0.839) and the enhanced rating
0
similarity between s1 and s4 is 0.467 (i.e., R_sim (s1 , s4 ) =
0.467), which are both greater than , s3 and s4 are chosen as
the neighbors of s1 , i.e., N (s1 ) = {s3 ,s4 }.
Step 2.3: Compute Predicted Rating
According to formula (7), the predicted rating of s4 for u3 ,
i.e., Pu3 ,s4 = 1.97 and the predicted rating of s1 for u3 , i.e.,
Pu3 ,s1 = 1.06.
Thus, s4 is not a good mashup service for u3 and will not
be recommended to u3 . In addition, as the real rating of s1
given by user u3 is 1 (see Table X) while its predicted rating
is 1.06, it can be inferred that ClubCF may gain an accurate
prediction.
B. Experimental Evaluation
In fact, ClubCF is a revised version of traditional itembased CF approach for adapting to big data environment.
Therefore, to verify its accuracy, we compare the MAE of
ClubCF with a traditional item-based CF approach (IbCF)
described in [26]. For each test mashup service in each fold,
its predicted rating is calculated based on IbCF and ClubCF
approach separately.
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
In all, ClubCF spends less computation time than Itembased CF. Since the number of services in a cluster is
fewer than the total number of services, the time of rating
similarity computation between every pair of services
will be greatly reduced.
As the rating similarity threshold increase, the computation time of ClubCF decrease. It is due to the number
of neighbors of the target service decreases when
increase. However, only when = 0.4, the decrease
of computation time of IbCF is visible. It is due to the
number of neighbors found from a cluster may less than
that of found from all, and then it may spend less time
on computing predicted ratings in ClubCF.
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
users. However, implicit feedback does not always provide
sure information about the users preference.
In ClubCF approach, the description and functionality
information is considered as metadata to measure the characteristic similarities between services. According to such
similarities, all services are merged into smaller-size clusters. Then CF algorithm is applied on the services within
the same cluster. Compared with the above approaches, this
approach does not require extra inputs of users and suits
different types of services. Moreover, the clustering algorithm used in ClubCF need not consider the dependence of
nodes.
VII. CONCLUSION AND FUTURE WORK
[8] Z. Liu, P. Li, Y. Zheng, and M. Sun, Clustering to find exemplar terms for
keyphrase extraction, in Proc. Conf. Empirical Methods Natural Lang.
Process., May 2009, pp. 257266.
[9] X. Liu, G. Huang, and H. Mei, Discovering homogeneous web service
community in the user-centric web environment, IEEE Trans. Services
Comput., vol. 2, no. 2, pp. 167181, Apr./Jun. 2009.
[10] H. H. Li, X. Y. Du, and X. Tian, A review-based reputation evaluation
approach for Web services, J. Comput. Sci. Technol., vol. 24, no. 5,
pp. 893900, Sep. 2009.
[11] K. Zielinnski, T. Szydlo, R. Szymacha, J. Kosinski, J. Kosinska, and
M. Jarzab, Adaptive SOA solution stack, IEEE Trans. Services Comput.,
vol. 5, no. 2, pp. 149163, Jun. 2012.
[12] F. Chang et al., Bigtable: A distributed storage system for structured
data, ACM Trans. Comput. Syst., vol. 26, no. 2, pp. 139, Jun. 2008.
[13] R. S. Sandeep, C. Vinay, and S. M. Hemant, Strength and accuracy
analysis of affix removal stemming algorithms, Int. J. Comput. Sci. Inf.
Technol., vol. 4, no. 2, pp. 265269, Apr. 2013.
[14] V. Gupta and G. S. Lehal, A survey of common stemming techniques and
existing stemmers for Indian languages, J. Emerging Technol. Web Intell.,
vol. 5, no. 2, pp. 157161, May 2013.
[15] A. Rodriguez, W. A. Chaovalitwongse, L. Zhe, H. Singhal, and H. Pham,
Master defect record retrieval using network-based feature association,
IEEE Trans. Syst., Man, Cybern., Part C, Appl. Rev., vol. 40, no. 3,
pp. 319329, Oct. 2010.
[16] T. Niknam, E. Taherian Fard, N. Pourjafarian, and A. Rousta, An efficient algorithm based on modified imperialist competitive algorithm and
K-means for data clustering, Eng. Appl. Artif. Intell., vol. 24, no. 2,
pp. 306317, Mar. 2011.
[17] M. J. Li, M. K. Ng, Y. M. Cheung, and J. Z. Huang, Agglomerative
fuzzy k-means clustering algorithm with selection of number of clusters, IEEE Trans. Knowl. Data Eng., vol. 20, no. 11, pp. 15191534,
Nov. 2008.
[18] G. Thilagavathi, D. Srivaishnavi, and N. Aparna, A survey on efficient
hierarchical algorithm used in clustering, Int. J. Eng., vol. 2, no. 9,
Sep. 2013.
[19] C. Platzer, F. Rosenberg, and S. Dustdar, Web service clustering using
multidimensional angles as proximity measures, ACM Trans. Internet
Technol., vol. 9, no. 3, pp. 11:111:26, Jul. 2009.
[20] G. Adomavicius and J. Zhang, Stability of recommendation algorithms,
ACM Trans. Inf. Syst., vol. 30, no. 4, pp. 23:123:31, Aug. 2012.
[21] J. Herlocker, J. A. Konstan, and J. Riedl, An empirical analysis of design
choices in neighborhood-based collaborative filtering algorithms, Inf.
Retr., vol. 5, no. 4, pp. 287310, Oct. 2002.
[22] A. Yamashita, H. Kawamura, and K. Suzuki, Adaptive fusion method
for user-based and item-based collaborative filtering, Adv. Complex Syst.,
vol. 14, no. 2, pp. 133149, May 2011.
[23] D. Julie and K. A. Kumar, Optimal web service selection scheme with
dynamic QoS property assignment, Int. J. Adv. Res. Technol., vol. 2, no. 2,
pp. 6975, May 2012.
[24] J. Wu, L. Chen, Y. Feng, Z. Zheng, M. C. Zhou, and Z. Wu,
Predicting quality of service for selection by neighborhood-based collaborative filtering, IEEE Trans. Syst., Man, Cybern., Syst., vol. 43, no. 2,
pp. 428439, Mar. 2013.
[25] Y. Zhao, G. Karypis, and U. Fayyad, Hierarchical clustering algorithms
for document datasets, Data Mining Knowl. Discovery, vol. 10, no. 2,
pp. 141168, Nov. 2005.
[26] Z. Zheng, H. Ma, M. R. Lyu, and I. King, QoS-aware web service recommendation by collaborative filtering, IEEE Trans. Services Comput.,
vol. 4, no. 2, pp. 140152, Feb. 2011.
[27] M. R. Catherine and E. B. Edwin, A survey on recent trends in cloud
computing and its application for multimedia, Int. J. Adv. Res. Comput.
Eng. Technol., vol. 2, no. 1, pp. 304309, Feb. 2013.
[28] X. Liu, Y. Hui, W. Sun, and H. Liang, Towards service composition based on mashup, in Proc. IEEE Congr. Services, Jul. 2007,
pp. 332339.
[29] X. Z. Liu, G. Huang, Q. Zhao, H. Mei, and M. B. Blake,
iMashup: A mashup-based framework for service composition,
Sci. China Inf. Sci., vol. 57, no. 1, pp. 120, Jan. 2013.
[30] H. Elmeleegy, A. Ivan, R. Akkiraju, and R. Goodwin, Mashup advisor:
A recommendation tool for mashup development, in Proc. IEEE Int. Conf.
Web Services, Oct. 2008, pp. 337344.
[31] O. Greenshpan, T. Milo, and N. Polyzotis, Autocompletion for mashups,
Proc. VLDB Endowment, vol. 2, no. 1, pp. 538549, Aug. 2009.
VOLUME 2, NO. 3, SEPTEMBER 2014
IEEE TRANSACTIONS ON
[32] M. Weiss, S. Sari, and N. Noori, Niche formation in the mashup ecosystem, Technol. Innov. Manag. Rev., May 2013.
[33] S. An, W. Liu, S. Venkatesh, and H. Yan, Unified formulation of linear
discriminant analysis methods and optimal parameter selection, Pattern
Recognit., vol. 44, no. 2, pp. 307319, Feb. 2011.
[34] W. Dou, X. Zhang, J. Liu, and J. Chen, HireSome-II: Towards privacyaware cross-cloud service composition for big data applications, IEEE
Trans. Parellel Distrib. Syst., 2013, to be published.
[35] S. Agarwal, and A. Nath, A study on implementing Green IT in
Enterprise 2.0, Int. J. Adv. Comput. Res., vol. 3, no. 1, pp. 4349,
Mar. 2013.
[36] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl, Evaluating collaborative filtering recommender systems, ACM Trans. Inf. Syst.,
vol. 22, no. 1, pp. 553, Jan. 2004.
[37] J. Mai, Y. Fan, and Y. Shen, A neural networks-based clustering collaborative filtering algorithm in e-commerce recommendation system, in Proc.
Int. Conf. Web Inf. Syst. Mining, pp. 616619, Jun. 2009.
[38] N. Mittal, R. Nayak, M. C. Govil, and K. C. Jain, Recommender
system framework using clustering and collaborative filtering, in
Proc. 3rd Int. Conf. Emerging Trends Eng. Technol., Nov. 2010,
pp. 555558.
[39] X. Li and T. Murata, Using multidimensional clustering based
collaborative filtering approach improving recommendation diversity, in
Proc. IEEE/WIC/ACM Int. Joint Conf. Web Intell. Intell. Agent Technol.,
Dec. 2012, pp. 169174.
[40] Z. Zhou, M. Sellami, W. Gaaloul, M. Barhamgi, and B. Defude, Data
providing services clustering and management for facilitating service
discovery and replacement, IEEE Trans. Autom. Sci. Eng., vol. 10, no. 4,
pp. 116, Oct. 2013.
[41] M. C. Pham, Y. Cao, R. Klamma, and M. Jarke, A clustering approach
for collaborative filtering recommendation using social network analysis,
J. Univ. Comput. Sci., vol. 17, no. 4, pp. 583604, Apr. 2011.
[42] R. D. Simon, X. Tengke, and W. Shengrui, Combining collaborative
filtering and clustering for implicit recommender system, in Proc. IEEE
27th Int. Conf. Adv. Inf. Netw. Appl., Mar. 2013, pp. 748755.
EMERGING TOPICS
IN COMPUTING
WANCHUN DOU was born in 1971. He is a
Professor with the Department of Computer Science and Technology, Nanjing University, China.
He received the Ph.D. degree from the Nanjing
University of Science and Technology, China, in
2001. Then, he continued his research work as a
Post-Doctoral Researcher with the Department of
Computer Science and Technology, Nanjing University, from 2001 to 2002. In 2005, he visited the
Hong Kong University of Science and Technology
as a Visiting Scholar. His main research interests include knowledge management, cooperative computing, and workflow technologies.
313