ML & Knowledge Discovery in Databases I
ML & Knowledge Discovery in Databases I
ML & Knowledge Discovery in Databases I
13
Series Editors
Volume Editors
Dimitrios Gunopulos
University of Athens, Greece
E-mail: dg@di.uoa.gr
Thomas Hofmann
Google Switzerland GmbH, Zurich, Switzerland
E-mail: thofmann@google.com
Donato Malerba
University of Bari “Aldo Moro”, Bari, Italy
E-mail: malerba@di.uniba.it
Michalis Vazirgiannis
Athens University of Economics and Business, Greece
E-mail: mvazirg@aueb.gr
CR Subject Classification (1998): I.2, H.2.8, H.2, H.3, G.3, J.1, I.7, F.2.2, F.4.1
timely completion of the review process under strict deadlines. Special thanks
should also be given to the Workshop Co-chairs, Bart Goethals and Katharina
Morik, the Tutorial Co-chairs, Fosca Giannotti and Maguelonne Teisseire, the
Discovery Challenge Co-chairs, Alexandros Kalousis and Vassilis Plachouras, the
Industrial Session Co-chairs, Alexandros Ntoulas and Michail Vlachos, the Demo
Track Co-chairs, Michelangelo Ceci and Spiros Papadimitriou, and the Best Pa-
per Award Co-chairs, Sunita Sarawagi and Michèle Sebag. We further thank the
keynote speakers, workshop organizers, the tutorial presenters and the organizers
of the discovery challenge.
Furthermore, we are indebted to the Publicity Co-chairs, Annalisa Appice
and Grigorios Tsoumakas, who developed and implemented an effective dissem-
ination plan and supported the Program Chairs in the production of the pro-
ceedings, and also to Margarita Karkali for the development, support and timely
update of the conference website. We further thank the members of the ECML
PKDD Steering Committee for their valuable help and guidance.
The conference was financially supported by the following generous sponsors
who are worthy of special acknowledgment: Google, Pascal2 Network, Xerox, Ya-
hoo Labs, COST-MOVE Action, Rapid-I, FP7-MODAP Project, Athena RIC /
Institute for the Management of Information Systems, Hellenic Artificial Intel-
ligence Society, Marathon Data Systems, and Transinsight. Additional support
was generously provided by Sony, Springer, and the UNESCO Privacy Chair Pro-
gram. This support has given us the opportunity to specify low registration rates,
provide video-recording services and support students through travel grants for
attending the conference. The substantial effort of the Sponsorship Co-chairs,
Ina Lauth and Ioannis Kopanakis, was crucial in order to attract these spon-
sorships, and therefore, they deserve our special thanks. Special thanks should
also be given to the five organizing institutions, namely, University of Bari “Aldo
Moro”, Athens University of Economics and Business, University of Athens, Uni-
versity of Ioannina, and University of Piraeus for supporting in multiple ways
our task.
We would like to especially acknowledge the members of the Local Organiza-
tion team, Maria Halkidi, Despoina Kopanaki and Nikos Pelekis, for making all
necessary local arrangements and Triaena Tours & Congress S.A. for efficiently
handling finance and registrations. The essential contribution of the student vol-
unteers also deserves special acknowledgment.
Finally, we are indebted to all researchers who considered this conference as
a forum for presenting and disseminating their research work, as well as to all
conference participants, hoping that this event will stimulate further expansion
of research and industrial activities in machine learning and data mining.
Finally, we would like to thank the General Chairs, Aristidis Likas and Yannis
Theodoridis, for their critical role in the success of the conference, the Tutorial,
Workshop, Demo, Industrial Session, Discovery Challenge, Best Paper, and Local
Chairs, the Area Chairs and all reviewers, for their voluntary, highly dedicated
and exceptional work, and the ECML PKDD Steering Committee for their help
and support. Our last and warmest thanks go to all the invited speakers, the
speakers, all the attendees, and especially to the authors who chose to submit
their work to the ECML PKDD conference and thus enabled us to build up this
memorable scientific event.
General Chairs
Aristidis Likas University of Ioannina, Greece
Yannis Theodoridis University of Piraeus, Greece
Program Chairs
Dimitrios Gunopulos University of Athens, Greece
Thomas Hofmann Google Inc., Zurich, Switzerland
Donato Malerba University of Bari “Aldo Moro,” Italy
Michalis Vazirgiannis Athens University of Economics and Business,
Greece
Workshop Chairs
Katharina Morik University of Dortmund, Germany
Bart Goethals University of Antwerp, Belgium
Tutorial Chairs
Fosca Giannotti Knowledge Discovery and Delivery Lab, Italy
Maguelonne Teisseire TETIS Lab. Department of Information
System and LIRMM Lab. Department of
Computer Science, France
Publicity Chairs
Annalisa Appice University of Bari “Aldo Moro,” Italy
Grigorios Tsoumakas Aristotle University of Thessaloniki, Greece
Sponsorship Chairs
Ina Lauth IAIS Fraunhofer, Germany
Ioannis Kopanakis Technological Educational Institute of Crete,
Greece
Organizing Committee
Maria Halkidi University of Pireaus, Greece
Despina Kopanaki University of Piraeus, Greece
Nikos Pelekis University of Piraeus, Greece
Steering Committee
José Balcázar Bart Goethals
Francesco Bonchi Katharina Morik
Wray Buntine Dunja Mladenic
Walter Daelemans John Shawe-Taylor
Aristides Gionis Michèle Sebag
Organization XIII
Area Chairs
Program Committee
Additional Reviewers
Pedro Abreu Brigitte Boden
Raman Adaikkalavan Samuel Branders
Artur Aiguzhinov Janez Brank
Darko Aleksovski Agnès Braud
Tristan Allard Giulia Bruno
Alessia Amelio Luca Cagliero
Panayiotis Andreou Yuanzhe Cai
Fabrizio Angiulli Ercan Canhas
Josephine Antoniou Eugenio Cesario
Andrea Argentini George Chatzimilioudis
Krisztian Balog Xi Chen
Teresa M.A. Basile Anna Ciampi
Christian Beecks Marek Ciglan
Antonio Bella Tyler Clemons
Aurelien Bellet Joseph Cohen
Dominik Benz Carmela Comito
Thomas Bernecker Andreas Constantinides
Alberto Bertoni Michael Corsello
Jerzy Blaszczynski Michele Coscia
Organization XVII
Sponsoring Institutions
We wish to express our gratitude to the sponsors of ECML PKDD 2011 for their
essential contribution to the conference: Google, Pascal2 Network, Xerox, Yahoo
Lab, COST/MOVE (Knowledge Discovery from Movin, Objects), FP7/MODAP
(Mobility, Data Mining, and Privacy), Rapid-I (report the future), Athena/IMIS
(Institute for the Management of Information Systems), EETN (Hellenic Arti-
ficial Intelligence Society), MDS Marathon Data Systems, Transinsight, SONY,
UNESCO Privacy Chair, Springer, Università degli Studi di Bari “Aldo Moro,”
Athens University of Economics and Business, University of Ioannina, National
and Kapodistrian University of Athens, and the University of Piraeus.
Table of Contents – Part I
Smart Cities: How Data Mining and Optimization Can Shape Future
Cities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Olivier Verscheure
Regular Papers
Preference-Based Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Riad Akrour, Marc Schoenauer, and Michele Sebag
Regular Papers
Common Substructure Learning of Multiple Graphical Gaussian
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Satoshi Hara and Takashi Washio
Regular Papers
Sparse Kernel-SARSA(λ) with an Eligibility Trace . . . . . . . . . . . . . . . . . . . 1
Matthew Robards, Peter Sunehag, Scott Sanner, and
Bhaskara Marthi
Fast Projections Onto 1,q -Norm Balls for Grouped Feature Selection . . . 305
Suvrit Sra
The Minimum Code Length for Clustering Using the Gray Code . . . . . . . 365
Mahito Sugiyama and Akihiro Yamamoto
Demo Papers
Celebrity Watch: Browsing News Content by Exploiting Social
Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Omar Ali, Ilias Flaounas, Tijl De Bie, and Nello Cristianini
Rakesh Agrawal
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 1–2, 2011.
c Springer-Verlag Berlin Heidelberg 2011
2 R. Agrawal
References
1. World Bank Knowledge for Development. World Development Report 1998/1999
(1998)
2. Agrawal, R., Gollapudi, S., Kannan, A., Kenthapadi, K.: Enriching Textbooks with
Web Images. Working paper (2011)
3. Agrawal, R., Gollapudi, S., Kannan, A., Kenthapadi, K.: Identifying Enrichment
Candidates in Textbooks. In: WWW (2011)
4. Agrawal, R., Gollapudi, S., Kannan, A., Kenthapadi, K., Srivastava, N., Velu, R.:
Enriching Textbooks Through Data Mining. In: First Annual ACM Symposium on
Computing for Development, ACM DEV (2010)
5. Gillies, J., Quijada, J.: Opportunity to Learn: A High Impact Strategy for Improving
Educational Outcomes in Developing Countries. In: USAID Educational Quality
Improvement Program, EQUIP2 (2008)
6. Hanushek, E.A., Woessmann, L.: The Role of Education Quality for Economic
Growth. Policy Research Department Working Paper 4122, World Bank (2007)
7. Mohammad, R., Kumari, R.: Effective Use of Textbooks: A Neglected Aspect of
Education in Pakistan. Journal of Education for International Development 3(1)
(2007)
8. Oakes, J., Saunders, M.: Education’s Most Basic Tools: Access to Textbooks and In-
structional Materials in California’s Public Schools. Teachers College Record 106(10)
(2004)
9. Stein, M., Stuen, C., Carnine, D., Long, R.M.: Textbook Evaluation and Adoption.
Reading & Writing Quarterly 17(1) (2001)
Human Dynamics: From Human Mobility to
Predictability
Albert-László Barabási
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 3, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Embracing Uncertainty: Applied Machine
Learning Comes of Age
Christopher Bishop
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 4, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Highly Dimensional Problems in Computational
Advertising
Andrei Broder
Yahoo! Research
broder@yahoo-inc.com
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 5, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Learning from Constraints
Marco Gori
University of Siena
marco@dii.unisi.it
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 6, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Permutation Structure in 0-1 Data
Heikki Mannila
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 7, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Reading Customers Needs and Expectations
with Analytics
Vasilis Aggelis
Pireus Bank
aggelisv@piraeusbank.gr
Abstract. Customers are the greatest asset for every bank. Do we know
them in whole? Are we ready to fulfill their needs and expectations? Use
of analytics is one of the keys in order to make better our relation with
customers. In advance, analytics can bring gains both for customers and
banks. Customer segmentation, targeted cross- and up-sell campaigns,
data mining utilization are tools that drive in great results and contribute
to customer centric turn.
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 8, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Algorithms and Challenges on the GeoWeb
Radu Jurca
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 9, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Data Science and Machine Learning at Scale
Neel Sundaresan
Abstract. Large Social Commerce Network sites like eBay have to con-
stantly grapple with building scalable machine learning algorithms for
search ranking, recommender systems, classification, and others. Large
data availability is both a boon and curse. While it offers a lot more
diverse observation, the same diversity with sparsity and lack of reli-
able labeled data at scale introduces new challenges. Also, availability of
large data helps take advantage of correlational factors while requiring
creativity in discarding irrelevant data. In this talk we will discuss all of
this and more from the context of eBays large data problems.
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 10, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Smart Cities: How Data Mining and
Optimization Can Shape Future Cities
Olivier Verscheure
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 11, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Preference-Based Policy Learning
TAO
CNRS − INRIA − Université Paris-Sud
FirstName.Name@inria.fr
1 Introduction
Many machine learning approaches in robotics, ranging from direct policy learn-
ing [3] and learning by imitation [7] to reinforcement learning [20] and inverse
optimal control [1, 14], rely on the use of simulators used as full motion and
world model of the robot. Naturally, the actual performance of the learned poli-
cies depends on the accuracy and calibration of the simulator (see [22]).
The question investigated in this paper is whether robotic policy learning can
be achieved in a simulator-free setting, while keeping the human labor and com-
putational costs within reasonable limits. This question is motivated by machine
learning application to swarm robotics [24, 29, 18]. Swarm robotics aims at ro-
bust and reliable robotic systems through the interaction of a large number of
small-scale, possibly unreliable, robot entities. Within this framework, the use
of simulators suffers from two limitations. Firstly, the simulation cost increases
more than linearly with the size of the swarm. Secondly, robot entities might dif-
fer among themselves due to manufacturing tolerance, entailing a large variance
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 12–27, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Preference-Based Policy Learning 13
among the results of physical experiments and severely limiting the accuracy of
simulated experiments.
The presented approach, inspired from both energy-based learning [21] and
learning-to-rank [12, 4], is called Preference-based policy learning (PPL). PPL
proceeds by iterating a 4-step process:
• In the demonstration phase, the robot demonstrates a candidate policy.
• In the teaching or feedback phase, the expert ranks this policy compared to
archived policies, based on her agenda and prior knowledge.
• In the learning phase, a policy return estimate (energy criterion) is learned
based on the expert ranking, using an embedded learning-to-rank algorithm. A
key issue concerns the choice of the policy representation space (see below).
• In the self-training phase, new policies are generated, using an adaptive trade-
off between the policy return estimate and the policy diversity w.r.t. the archive.
A candidate policy is selected to be demonstrated and the process is iterated.
PPL initialization proceeds by demonstrating two policies, which are ordered by
the expert.
A first contribution of the PPL approach compared to inverse reinforcement
learning [1, 14] is that it does not require the expert to know how to solve the task
and to demonstrate a perfect behavior (see also [27]); the expert is only required
to know whether some behavior is more able to reach the goal than some other
one. Compared to learning by demonstration [7], the demonstrated trajectories
are feasible by construction (whereas the teacher and the robot usually have
different degrees of freedom). Compared to policy gradient approaches [20], the
continued interaction with the expert offers some means to detect and avoid the
convergence to local optima, through the expert’s feedback.
In counterpart, PPL faces one main challenge. The human ranking effort
needed to yield a competent policy must be limited; the sample complexity
must be of the order of a few dozen to a couple hundred. Therefore the policy
return estimate must enable fast progress in the policy space, which requires
the policy representation space to be carefully designed (a general concern, as
noted by [1]). On the other hand, the simulator-free setting precludes the use of
any informed features such as the robot distance to obstacles or other robots.
The second contribution of the paper is an original representation of the pol-
icy space referred to as behavioral representation (BvR), built as follows. Each
policy demonstration generates a robotic log, reporting the sensor and actuator
values observed at each time step. The robotic logs are incrementally processed
using -clustering [11]; each cluster is viewed as a sensori-motor state (sms).
A demonstration can thus be represented by a histogram, associating to each
sensori-motor state the fraction of time spent in this sms. The BvR represen-
tation complies with a resource-constrained framework and it is agnostic w.r.t.
the policy return and the environment. Although the number of states exponen-
tially increases with the clustering precision and the intrinsic dimension of the
sensori-motor space, it can however be controlled to some extent as is set by
the expert. BvR refinement is a perspective for further research (section 5).
14 R. Akrour, M. Schoenauer, and M. Sebag
2 Related Work
Policy learning is most generally tackled as an optimization problem. The main
issues regard the search space (policy or state/action value function), the defini-
tion of the objective, and the exploration of the search space, aimed at optimizing
the objective function. The infinite horizon case is considered throughout this
section for notational convenience.
Among the most studied approaches is reinforcement learning (RL) [26],
for its guarantees of global optimality. The background notations involve a state
space S, an action space A, a reward function r defined on the state space
r : S → IR, and a transition function p(s, a, s ), expressing the probability of
arriving in state s on making action a in state s under the Markov property. A
policy π : (S, A) → IR maps each state in S on some action in A with a given
probability. The policy return is defined as the expectation of cumulative reward
gathered along time, conditionally to the initial state s0 . Denoting ah ∼ π(sh , a)
the action selected by π in state sh , sh+1 ∼ p(sh , ah , s) the state of the robot
at step h + 1 conditionally to being in state sh and selecting action ah at step
h, and rh+1 the reward collected in sh+1 , then the policy return is
∞
J(π; s0 ) = IEπ,s0 γ rh
h
h=0
where γ < 1 is a discount factor enforcing the boundedness of the return, and
favoring the reaping of rewards as early as possibly in the robot lifetime. The
so-called Q-value function Qπ (s, a) estimates the expectation of the cumulative
reward gathered by policy π after selecting action a in state s. As estimation
errors cumulate along time, the RL scalability is limited with respect to the
size of the state and action spaces. Moreover in order to enforce the Markov
property, the description of state s must provide any information about the
robot past trajectory relevant to its further decisions − thereby increasing the
size of the state space.
Preference-Based Policy Learning 15
As RL suffers from both the scalability issue and the difficulty of defining a
priori a reward function conducive to the task at hand [19], a way to “seed”
the search with a good solution was sought, referred to as inverse optimal
control. One possibility, pioneered by [3] under the name of behavioral cloning
and further developed by [7] under the name of programming by demonstration
relies on the exploitation of the expert’s traces. These traces can be used to turn
policy learning into a supervised learning problem [7]. Another possibility is to
use the expert’s traces to learn a reward function, along the so-called inverse
reinforcement learning approach [19]. The sought reward function should admit
the expert policy π ∗ (as evidenced from his traces) as solution, and further en-
force some return margin between the actual expert actions and other actions
in the same state, in order to avoid indeterminacy [1] (since a constant reward
function would have any policy as an optimal solution). The search space is care-
fully designed in both cases. In the former approach, an “imitation metric” is
used to account for the differences between the expert and robot motor spaces.
In the latter one, a few informed features (e.g. the average speed of the vehi-
cle, the number of times the car deviates from the road) are used together with
their desired sign (the speed should be maximized, the number of deviations
should be minimized) and the policy return is sought as a weighted sum of the
feature expectations. A min-max relaxation thereof is proposed by [27], maxi-
mizing the minimum policy return over all weight vectors. Yet another approach,
inverse optimal heuristic control [14] proposes a hybrid approach between be-
havioral cloning and inverse optimal control, addressing the low-dimensionality
restrictions on optimal inverse control. Formally, an energy-based action selec-
tion model is proposed using a Gibbs model which combines the reward function
(as in inverse optimal control) and a logistic regression function (learned using
behavioral cloning from the available trajectories).
The main limitation of inverse optimal control lies in the way it is seeded:
the expert’s traces can hardly be considered to be optimal in the general case,
even more so if the expert and the robot live in different sensori-motor spaces.
While [27] gets rid of the expert’s influence, its minmax approach yields very
conservative policies, as the relative importance of the features is unknown.
A main third approach aims at direct policy learning, using policy gradient
methods [20] or global optimization methods [28]. Direct policy learning most
usually assumes a parametric policy space Θ; policy learning aims at finding the
optimal θ∗ parameter in the sense of a policy return function J:
Find θ∗ = arg max {J(θ), θ ∈ Θ}
Depending on J, three cases are distinguished. The first case is when J is ana-
lytically known on a continuous policy space Θ ⊂ IRD . It then comes naturally
to use a gradient-based optimization approach, gradually moving the current
policy θt along the gradient ∇J [20] (θt+1 = θt + αt ∇θ J(θt )). The main issues
concern the adjustment of αt , the possible use of the inverse Hessian, and the
rate of convergence toward a (local) optimum. A second case is when J is only
known as a computable function, e.g. when learning a Tetris policy [28]. In such
cases, a wide scope optimization method such as Cross-Entropy Method [8] or
16 R. Akrour, M. Schoenauer, and M. Sebag
PPL is initialized from two policies with specific properties (section 3.5), which
are ordered by the expert; the policy and constraint archives are initialized ac-
cordingly. After detailing all PPL components, an analytical convergence study
of PPL is presented in section 3.6.
BvR thus differs from the feature counts used in [1, 14] as features are learned
from policy trajectories as opposed to being defined from prior knowledge. Com-
pared to the discriminant policy representation built from the set of policy
demonstrations [17], the main two differences are that BvR is independent of
the policy return estimate, and that it is built online as new policies explore new
regions of the sensori-motor space. As the number of states nt and thus BvR di-
mension increases along time, BvR consistency follows from the fact μπ ∈ [0, 1]nt
is naturally mapped onto (μπ , 0) in [0, 1]nt +1 (the i-th coordinate of μπ is 0 if si
was not discovered at the time π was demonstrated).
The price to pay for the representation consistency is that the number of
states exponentially increases with precision parameter ; on the other hand, it
increases like b where b is the intrinsic dimension of the sensori-motor data.
A second limitation of BvR is to be subject to the initialization noise, meant
as the initial location of the robot when the policy is launched. The impact of
the initial conditions however can be mitigated by discarding the states visited
during the burn-in period of the policy.
18 R. Akrour, M. Schoenauer, and M. Sebag
For the sake of notational simplicity, μi will be used instead of μπi when no
confusion is to fear. Let the policy archive Π t be given as {μ1 , . . . μt } at step t,
assuming with no loss of generality that μi s are ordered after the expert prefer-
ences. Constraint archive C t thus includes the set of t(t−1)
2 ordering constraints
μi
μj for i > j.
Using a standard constrained convex optimization formulation [4, 13], the
policy return estimate Jt is sought as a linear mapping Jt (μ) = wt , μ with
wt ∈ IRnt solution of (P ):
t
Minimize 12 ||w||2 + C i,j=1,i>j ξi,j
(P )
subject to (w, μi − w, μj ≥ 1 − ξi,j ) and (ξi,j ≥ 0) for all i > j
The policy return estimate Jt features some good properties. Firstly, it is consis-
tent despite the fact that the dimension nt of the state space might increase with
t; after the same arguments as above, the wt coordinate related to a state which
has not yet been discovered is set to 0. By construction, it is independent on the
policy parameterization and can be transferred among different policy spaces;
likewise, it does not involve any information but the information available to the
robot itself; therefore it can be computed on-board and provides the robot with
a self-assessment.
Finally, although Jt is defined at the global policy level, wt provides some
valuation of the sensori-motor states. If a high positive weight wt [i] is associated
to the i-th sensori-motor state si , this state is considered to significantly and
positively contribute to the quality of a policy, comparatively to the policies
considered so far. Learning Jt thus significantly differs from learning the optimal
RL value function V ∗ . By definition, V ∗ (si ) estimates the maximum expected
cumulative reward ahead of state si , which can only increase as better policies are
discovered. In contrast, wt [i] reflects the fact that visiting state si is conducive
to discovering better policies, comparatively to policies viewed so far by the
expert. In particular, wt [i] can increase or decrease along t; some states might
be considered as highly beneficial in the early stages of the robot training, and
discarded later on.
3.5 Initialization
The PPL initialization is challenging for policy behaviors corresponding to uni-
formly drawn θ usually are quite uninteresting and therefore hard to rank. The
first two policies are thus generated as follows. Given a set P0 of randomly gener-
πd1 is selected as the policy in P0 with maximal information quantity
ated policies,
(J0 (μ) = − i=1 μ[i] log μ[i]). Policy π2 is the one in P0 with maximum diversity
w.r.t. π1 (π2 = argmax {Δ(μ, π1 ), μ ∈ P0 }). The rationale for this initialization
is that π1 should experiment as many distinct sensorimotor states as possible;
and π2 should experiment as many sensorimotor states different from that of
π1 as possible, to facilitate the expert ordering and yield an informative policy
return estimate J2 . Admittedly, the use of the information quantity and more
generally BvR only make sense if the robot faces a sufficiently rich environment.
In an empty environment, i.e. when there is nothing to see, the robot can only
visit a single sensori-motor state.
20 R. Akrour, M. Schoenauer, and M. Sebag
1 2 3 .. N Target
1 2 3 i N
Policy
right right right right left ... ... state
(a) State and Action Space (b) Parametric and Behavioral representation
Policy generation (section 3.4) considers both the current Jt and the diversity Et
respective to the policy archive Πt , given as Et (μ) = min{Δ(μ, μu ), μu ∈ Πt }.
In the RiverSwim setting, it comes:
|
(μ) −
(μu )| − min(
(μ),
(μu ))
Δ(μ, μu ) =
(μ)
(μu )
Preference-Based Policy Learning 21
For i < j, |i−j|−√min (i,j)
= i − 2
j
j . It follows that the function i →
i
√ ij
|(μ)−i|−min((μ),i)
√ is strictly decreasing on [1,
(μ)] and strictly increasing on
(μ) i
[
(μ), +∞). It reaches its minimum value of −1 for i =
(μ). As a consequence,
for any policy μ in the archive, Et (μ) = −1.
Let μt and μ∗ respectively denote the best policy in the archive Πt and the
policy that goes exactly one step further right. We will now prove the following
result:
Proposition 2: The probability that μ∗ is generated at step t is bounded from
below by eN1
. Furthermore, after μ∗ has been generated, it will be selected accord-
ing to the selection criterion Ft (section 3.4), and demonstrated to the expert.
Proof
Generation step: Considering the boolean parametric representation space {0, 1}N ,
the generation of new policies proceeds using the standard bitflip mutation, flip-
ping each bit of the current μt with probability N1 . The probability to generate
μ∗ from μt is lower-bounded by N1 (1 − N1 )(μt ) (flip bit
(μt ) + 1 and do not flip
any bit before that). Furthermore, (1 − N1 )(μt ) > (1 − N1 )N −1 > 1e and hence
the probability to generate μ∗ from μt is lower-bounded by eN 1
.
∗
Selection step. As shown above, Et (μ ) > −1 and Et (μ) = −1 for all μ ∈ Πt . As
the candidate policy selection is based on the maximization of Ft (μ) = wt , μ +
αt Et (μ), and using wt , μt = wt , μt+1 = t, it follows that Ft (μt ) < Ft (μ∗ ),
and more generally, that Ft (μ) < Ft (μ∗ ) for all μ ∈ Πt . Consider now a policy
μ that is not in the archive though
(μ) < t (the archive does not need to
contain all possible policies). From Proposition 1, it follows that w, μ < t − N1 .
Because Et (μ) is bounded, there exists a sufficiently small αt such that Ft (μ) =
wt , μ + αt Et (μ) < Ft (μ∗ ).
Furthermore, thanks to the monotonicity of Et (μ) w.r.t.
(μ), one has
Ft (μt+i ) < Ft (μt+j ) for all i < j where
(μt+i ) =
(μt ) + i. Better RiverSwim
policies will thus be selected along the policy generation and selection step.
4 Experimental Validation
This section reports on the experimental validation of the PPL approach. For
the sake of reproducibility, all reported results have been obtained using the
publicly available simulator Roborobo [5].
4.2 2D RiverSwim
In this 4 × 4 grid world, the robot starts in the lower left corner of the grid; the
sought behavior is to reach the upper right corner and to stay there. The expert
2
In each time step the candidate policy with highest average diversity compared to
its k-nearest neighbors in the policy archive is selected, with k = 1 for the 2D
RiverSwim, and k = 5 for both other experiments.
Preference-Based Policy Learning 23
preference goes toward the policy going closer to the goal state, or reaching
earlier the goal state. The action space includes the four move directions with
deterministic transitions (the robot stays in the same position upon marching to
the wall). For a time horizon H = 16, the BvR includes only 6334 distinguishable
policies (to be compared with the actual 416 distinct policies). The small size of
this grid world allows one to enumeratively optimize the policy return estimate,
with and without the exploration term. Fig. 2 (left) displays the true policy
return vs the number of calls to the expert. In this artificial problem, Novelty-
Search outperforms PPL and discovers an optimal policy in the 20-th step. PPL
discovers an optimal policy at circa the 30-th step. The exploration term plays an
important role in PPL performance; Fig. 2 (right) displays the average weight αt
of the exploration term. In this small-size problem, a mere exploration strategy
is sufficient and typically Novelty-Search discovers two optimal policies in the
first 100 steps on average; meanwhile, PPL discovers circa 25 optimal policies in
the first 100 steps on average.
10 8
Exploration weight
True Policy Return
5 6
Epsilon
0 4
PPL - no exploration
-5 PPL 2
Novelty Search
0 50 100 10 30 50 70 90
Number of calls to the expert Number of calls to the expert
10 30 50 70 90
Number of calls to expert
Fig. 3. The maze experiment: the arena (left) and best performances averaged out of
41 runs
needs 140 demonstrations (Fig. 4, left). This speed-up is explained as PPL fil-
ters out unpromising policies. Novelty-Search performs poorly on this problem
comparatively to PPL and even comparatively to Expert-Only, which suggests
that Novelty-Search might not scale up well with the size of the policy space.
This experiment also establishes the merits of the behavioral representation.
Fig. 4 (right) displays the average performance of PPLBvR and PPLparam when
no exploration term is used, to only assess the accuracy of the policy return
estimate. As can be seen, learning is very limited when done in the parametric
representation by PPLparam, resulting in no speedup compared to Expert-Only
(Fig. 3, right). A third lesson is that the entropy-based initialization does not
seem to bring any significant improvement, as PPLBvR and PPLw/o i get same
results after the very first steps (Fig. 3).
1000
-500
True Policy Return
0
-1000
PPL-parametric
PPL
-1000
-1500
Expert only
-2000 PPL -2000
10 30 50 70 90 10 30 50 70 90
Number of calls to the expert Number of calls to the expert
Fig. 4. The maze experiment: PPL vs Expert-Only (left); Accuracy of parametric and
behavioral policy return estimate(right)
unit distances at the moment the tile is explored. Both robots start in (distinct)
random positions. The difficulty of the task is that most policies wander in the
arena, making it unlikely for the two robots to be anywhere close to one another
at any point in time.
The experimental results (Fig. 5, right) mainly confirm the lessons drawn from
the previous experiment. PPL improves on Expert-Only, with a speed-up circa
10, while Novelty-Search is slightly but significantly outperformed by Expert-
Only. In the meanwhile, the entropy-based initialization does not make much of
a difference after the first steps, and might even become counter-productive in
the end.
300
True Policy Return
0
10 30 50 70 90
Number of calls to expert
Fig. 5. Synchronous exploration: the arena (left) and the best average performance out
of 41 runs (right)
phases, taking inspiration from online learning on a budget [9]. A third research
avenue is to reconsider the expert preferences in a Multiple-Instance perspective
[10]; clearly, what the expert likes/dislikes might be a fragment of the policy
trajectory, more than the entire trajectory.
References
[1] Abbeel, P., Ng, A.Y.: Apprenticeship Learning via Inverse Reinforcement Learn-
ing. In: Brodley, C.E. (ed.) Proc. 21st Intl. Conf. on Machine Learning (ICML
2004). ACM Intl. Conf. Proc. Series, vol. 69, p. 1. ACM, New York (2004)
[2] Auger, A.: Convergence Results for the (1,λ)-SA-ES using the Theory of ϕ-
irreducible Markov Chains. Theoretical Computer Science 334(1-3), 35–69 (2005)
[3] Bain, M., Sammut, C.: A Framework for Behavioural Cloning. In: Furukawa, K.,
Michie, D., Muggleton, S. (eds.) Machine Intelligence, vol. 15, pp. 103–129. Oxford
University Press, Oxford (1995)
[4] Bakir, G., Hofmann, T., Scholkopf, B., Smola, A.J., Taskar, B., Vishwanathan,
S.V.N.: Machine Learning with Structured Outputs. MIT Press, Cambridge (2006)
[5] Bredeche, N.: http://www.lri.fr/~ bredeche/roborobo/
[6] Brochu, E., de Freitas, N., Ghosh, A.: Active Preference Learning with Discrete
Choice Data. In: Proc. NIPS 20, pp. 409–416 (2008)
[7] Calinon, S., Guenter, F., Billard, A.: On Learning, Representing and Generalizing
a Task in a Humanoid Robot. IEEE Trans. on Systems, Man and Cybernetics, Spe-
cial Issue on Robot Learning by Observation, Demonstration and Imitation 37(2),
286–298 (2007)
[8] de Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A Tutorial on the
Cross-Entropy Method. Annals OR 134(1), 19–67 (2005)
[9] Dekel, O., Shalev-Shwartz, S., Singer, Y.: The Forgetron: A Kernel-Based Percep-
tron on a Budget. SIAM J. Comput. 37, 1342–1372 (2008)
[10] Dietterich, T.G., Lathrop, R., Lozano-Perez, T.: Solving the Multiple-Instance
Problem with Axis-Parallel Rectangles. Artif. Intelligence 89(1-2), 31–71 (1997)
[11] Duda, R.O., Hart, P.E.: Pattern Classification and scene analysis. John Wiley and
sons, Menlo Park, CA (1973)
[12] Joachims, T.: A Support Vector Method for Multivariate Performance Measures.
In: De Raedt, L., Wrobel, S. (eds.) Proc. 22nd ICML. ACM Intl. Conf. Proc.
Series, vol. 119, pp. 377–384. ACM, New York (2005)
[13] Joachims, T.: Training Linear SVMs in Linear Time. In: Eliassi-Rad, T., et al.
(eds.) Proc. 12th Intl. Conf. KDDM, pp. 217–226. ACM, New York (2006)
[14] Zico Kolter, J., Abbeel, P., Ng, A.Y.: Hierarchical Apprenticeship Learning with
Application to Quadruped Locomotion. In: Proc. NIPS 20. MIT Press, Cambridge
(2007)
[15] Lehman, J., Stanley, K.O.: Exploiting Open-Endedness to Solve Problems
Through the Search for Novelty. In: Proc. Artificial Life XI, pp. 329–336 (2008)
[16] Lehman, J., Stanley, K.O.: Exploiting Open-Endedness to Solve Problems through
the Search for Novelty. In: Proc. ALife 2008, MIT Press, Cambridge (2008)
Preference-Based Policy Learning 27
[17] Levine, S., Popovic, Z., Koltun, V.: Feature Construction for Inverse Reinforce-
ment Learning. In: Proc. NIPS 23, pp. 1342–1350 (2010)
[18] Liu, W., Winfield, A.F.T.: Modeling and Optimization of Adaptive Foraging in
Swarm Robotic Systems. Intl. J. Robotic Research 29(14), 1743–1760 (2010)
[19] Ng, A.Y., Russell, S.: Algorithms for Inverse Reinforcement Learning. In: Langley,
P. (ed.) Proc. 17th ICML, pp. 663–670. Morgan Kaufmann, San Francisco (2000)
[20] Peters, J., Schaal, S.: Reinforcement Learning of Motor Skills with Policy Gradi-
ents. Neural Networks 21(4), 682–697 (2008)
[21] Ranzato, M.-A., Poultney, C.S., Chopra, S., LeCun, Y.: Efficient Learning of
Sparse Representations with an Energy-Based Model. In: Schölkopf, B., Platt,
J.C., Hoffman, T. (eds.) Proc. NIPS 19, pp. 1137–1144. MIT Press, Cambridge
(2006)
[22] Saxena, A., Driemeyer, J., Ng, A.Y.: Robotic Grasping of Novel Objects using
Vision. Intl. J. Robotics Research (2008)
[23] Schwefel, H.-P.: Numerical Optimization of Computer Models. John Wiley & Sons,
New York (1981) 2nd edn. (1995)
[24] Stirling, T.S., Wischmann, S., Floreano, D.: Energy-efficient Indoor Search by
Swarms of Simulated Flying Robots without Global Information. Swarm Intelli-
gence 4(2), 117–143 (2010)
[25] Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC Model-free
Reinforcement Learning. In: Airoldi, E.M., Blei, D.M., Fienberg, S.E., Goldenberg,
A., Xing, E.P., Zheng, A.X. (eds.) ICML 2006. LNCS, vol. 4503, pp. 881–888.
Springer, Heidelberg (2007)
[26] Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cam-
bridge (1998)
[27] Syed, U., Schapire, R.: A Game-Theoretic Approach to Apprenticeship Learning.
In: Proc. NIPS 21, pp. 1449–1456. MIT Press, Cambridge (2008)
[28] Thiery, C., Scherrer, B.: Improvements on Learning Tetris with Cross Entropy.
ICGA Journal 32(1), 23–33 (2009)
[29] Trianni, V., Nolfi, S., Dorigo, M.: Cooperative Hole Avoidance in a Swarm-bot.
Robotics and Autonomous Systems 54(2), 97–103 (2006)
Constraint Selection for Semi-supervised
Topological Clustering
1 Introduction
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 28–43, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Constraint Selection for Semi-supervised Topological Clustering 29
2.1 Constraints
The operating assumption behind all constrained clustering methods is that
the constraints provide information about the true (desired) partition, and that
more information will increase the agreement between the output partition and
the true partition. Constraints provide guidance about the desired partition and
make it possible for clustering algorithms to increase their performance [10].
In this paper, we propose to adapt a topology based clustering to hard con-
straints: Must-Link constraint (ML): involving xi and xj , specifies that they
must be placed into the same cluster. Cannot-Link constraint (CL): involving
xi and xj , specifies that they must be placed into deferent clusters.
The quality of the partition (Pc )c∈C and its associated prototype vectors
(wc )c∈C is given by the following energy function [20]:
E T ((Pc )c∈C , (wc )c∈C ) = hT (δ(f (xi ), c)) wc − xi 2 (1)
xi ∈X c∈C
where Pr represents the set of elements which belong to the neuron r. Since
hT (0) = 1 (when r = c), E T can be decomposed on two terms:
2
E1T = wr − xi (3)
r∈C xi ∈Pr
2
E2T = hT (δ(r, c)) wr − xi (4)
=r xi ∈Pc
r∈C c
E1T corresponds to the distortion used in the partitioning based clustering algo-
rithms like K-Means. E2T is specific to SOM algorithm. Note, that if the neigh-
boring relationship is not considered (δ(.) = 0), optimizing the objective function
subject to constraints could be resolved by the known COP-Kmeans [4].
Our first contribution consists here to adapt SOM to M L and CL constraints
by minimizing the equation (1) subject to these constraints. For this optimiza-
tion problem we can consider several versions proposed by [21,20,15]. All these
versions proceed on tow steps: assignment step for calculating f and adaptation
step for calculating wc .
In this work, we use the version proposed by Heskes and Kappen [21]. The
assignment step consists to minimize E T with wc fixed and the adaptation step
minimizes the same objective function but with the prototypes as fixed. Al-
though the two-optimizations are performed accurately, we can not guarantee
that the energy is generally minimized by this algorithm. However, if we fixe the
neighborhood structure (T is fixed), the algorithm converges towards a stable
state after a finite number of steps [20].
Since the energy is a sum of independent equations, we can replace the two
optimization problems by a set of equivalent simple problems. The formulation
of (1) shows that the energy is constructed as the sum over all observations of a
measure of adequacy of RD × C into R+ defined by:
γ T (x, r) = hT (δ(r, c)) wc − x2 (5)
c∈C
which gives : E T ((Pc )c∈C , (wc )c∈C ) = γ T (xi , f (xi )) (6)
xi ∈X
32 K. Allab and K. Benabdeslem
where
0 if ∃xj ∈ X|(f (xj ) = c) ∧ (xi , xj ) ∈ ΩCL
gc (xi ) = (11)
1 otherwise
Equations (7) and (10) represent the modified batch version of SOM algorithm
(that we call S3OM, for Semi-Supervised Self Organizing Map) with our changes
in both, assignment and adaptation steps. The algorithm takes in a data set (X),
a must-link constraints set (ΩML ), and a cannot-link constraints set(ΩCL ). It
returns a partition of the observations in X that satisfies all specified constraints.
The major modification is that, when updating cluster assignments, we ensure
that none of the specified constraints are violated. We attempt to assign each
point xi to its closest neuron c in the map. This will succeed unless a constraint
would be violated. If there is another point xj that must be assigned to the same
neuron as xi , but that is already in some other neuron, or there is another point
xk that cannot be grouped with xi but is already in c or in the neighboring of
c, then xi cannot be placed either in c or its neighboring.
Constraint Selection for Semi-supervised Topological Clustering 33
We continue down the sorted list of neurons until we find one that can legally
host xi . Constraints are never broken; if a legal neuron cannot be found for xi ,
the empty partition (∅) is returned. Note that with the use of equation(11), each
reference vector wc is updated only by the observations xi ∈ X which have not
CL constraints with any element xj belonging to c.
3 Constraint Selection
Regarding to other results obtained in [3,4,14,22,23], we observed that integrat-
ing constraints generally improve the clustering performance. But sometimes,
they could have ill effects even when they are generated from the data labels
that are used to evaluate accuracy. So it is more important to know why do
some constraint sets increase clustering accuracy while others have no effect or
even decrease accuracy. For that, the authors in [13] have defined two impor-
tant measures, informativeness and coherence, that capture relevant properties
of constraint sets.
Fig. 2. Two properties of constraints (red lines for M L and green lines for CL): (a)
Informatisness: m and c are informative. (b) Coherence: projected overlap between m
and c (overc m) is not null. So, The coherence of the subset {m,c} is null.
3.1 Informativeness
This measure represents the amount of conflict between the constraints and
the underlying objective function and search bias of an algorithm. It is based on
measuring the number of constraints that the clustering algorithm cannot predict
using its default bias. Given a possibly incomplete set of constraints Ω and an
algorithm A, we generate the partition PA by running A on the data set without
any constraints (Figure 2(a)). We then calculate the fraction of constraints in Ω
that are unsatisfied by PA :
1
IA (Ω) = unsat(α, PA ) (12)
|Ω|
α∈Ω
3.2 Coherence
This measure represents the amount of agreement between the constraints them-
selves, given a metric d that specifies the distance between points. It does not
require knowledge of the optimal partition P ∗ and can be computed directly.
The coherence of a constraint set is independent of the algorithm used to per-
form constrained clustering. One view of an M L(x, y) (or CL(x, y)) constraint
is that it imposes an attractive (or repulsive) force within the feature space
along the direction of a line formed by (x, y), within the vicinity of x and y.
Two constraints, one an M L constraint (m) and the other a CL constraint (c),
are incoherent if they exert contradictory forces in the same vicinity. Two con-
straints are perfectly coherent if they are orthogonal to each other. To determine
the coherence of two constraints, m and c, we compute the projected overlap of
each constraint on the other as follows.
Let −→
m and − →
c be vectors connecting the points constrained by m and c re-
spectively. The coherence of a given constraint set Ω is defined as a fraction of
constraints pairs that have zero projected overlap (Figure 2(b)):
m∈ΩM L ,c∈ΩCL δ(overc m = 0 ∧ overm c = 0)
Cohd (Ω) = (13)
|ΩM L | |ΩCL |
where overc m represents the distance between the two projected points linked
by m over c. δ is the number of the overlapped projections. Please, see [13] for
more details.
From the equation (13), we can easily define a specific measure for each con-
straint as follows:
δ(overc m = 0)
Cohd (m) = c∈ΩCL (14)
|ΩCL |
δ(overm c = 0)
Cohd (c) = m∈ΩM L (15)
|ΩML |
4 Experimental Results
Extensive experiments were carried out over the data sets in Table 1. These
Data sets are voluntarily chosen for evaluating the clustering performance of
S3OM and comparing it with other state of the art techniques: COP-KMeans
(CKM) [4], PC-KMeans (PKM), M-KMeans (MKM), MPC-KMeans (MPKM)
[14], Cop-Bcoloring (CBC) [8], (lpSM) [16] and Belkin-Niyogi’s Approach [16].
36 K. Allab and K. Benabdeslem
Table 2. Average Rand Index of S3OM vs five constrained clustering algorithms (for
1000 trials). Firstly without constraints (Unc) and secondly, with 25 randomly selected
constraints (Con).
Fig. 3. Results of S3OM according to coherence rate, with fully informative constraints
To understand how this constraint set properties affect our constrained al-
gorithm, we performed the following experiment. We measured the accuracy
according to random informative constraint sets but with different rate of co-
herence (calculated by (13)). We can see that all databases exhibited important
increases of accuracy when coherence increases in a set of randomly selected
constraints (Figure 3). In fact, for all data sets, the accuracy increases steadily
and quickly when integrating “good” background information.
38 K. Allab and K. Benabdeslem
Table 3. Average Rand Index of S3OM vs two semi-supervised learning algorithms (for
100 trials) with some FCPS data sets. Firstly without constraints (Unc) and secondly,
with 100 softly selected constraints (Con).
(Accuracy = 100 %) [16]. We can also remark that S3OM realizes better per-
formance than those of the other methods and confirms the contribution of the
selected constraints in the improvement of the quality of the clustering model.
For example, for EngyTime, S3OM without constraints realizes an accuracy of
61.21% without any constraint. By using the 100 softly selected constraints,
S3OM can reach an optimal averaged accuracy of 96.8%.
Table 4. Performance of S3OM over “Leukemia” data set with hard selected con-
straints vs soft selected ones
Hard Soft
#ML #CL Acc #ML #CL Acc
1 4 62.0 2 7 55.9
5 5 70.9 6 9 66.5
11 4 75.7 13 7 77.0
11 9 86.7 14 17 84.7
8 17 89.6 11 21 91.1
14 16 97.2 16 21 88.6
4.5 Visualization
In this section, we present some visualization inspections of “Chainlink” data
set which represents an important problem for data structure. It describes two
chains tied in 3D space. The aim is to see if our proposals are able to disentangle
the two clusters and represent the real structure of data.
Constraint Selection for Semi-supervised Topological Clustering 41
Fig. 7. Behaviors of Chainlink maps done by S3OM. Orange neurons represent the
first chain (Majorelle Blue neurons for the second one). Green lines for M L constraints
(Red lines for CL).
5 Conclusion
In this work, the contributions are two-fold. First, a new algorithm S3OM was
developed for semi-supervised clustering by considering instance level constraints
in the objective function optimization of batch version of SOM. The determin-
istic aspect of this algorithm allowed us to perform hard constraints satisfaction
in a topological clustering. Second, we studied a constraint set properties, in-
formativeness and coherence, that provided a quantitative basis for explaining
why a given constraint set increases or decreases performance. These measures
was used for selecting the most useful constraints for clustering. The constraint
selection was done in both, hard and soft fashions.
References
1. Basu, S., Davidson, I., Wagstaff, K.: Constrained clustering: Advances in algo-
rithms, theory and applications. Chapman and Hall/CRC Data Mining and Knowl-
edge Discovery Series (2008)
2. Frank, A., Asuncion, A.: Uci machine learning repository. Technical report, Uni-
versity of California (2010)
3. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning a mahalanobis metric
from equivalence constraints. Journal of Machine Learning Research 6, 937–965
(2005)
4. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Clustering with instance level
constraints. In: Proc. of the 18th International Conference on Machine Learning,
pp. 577–584 (2001)
5. Lu, Z., Leen, T.K.: Semi-supervised learning with penalized probabilistic cluster-
ing. In: Advances in Neural information Processing Systems 17 (2005)
6. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Clustering with instance level
constraints. In: Proc. of the 17th International Conference on Machine Learning,
pp. 1103–1110 (2000)
7. Davidson, I., Ravi, S.S.: Agglomerative hierarchical clustering with constraints:
theorical and empirical results. In: Proc. of ECML/PKDD, pp. 59–70 (2005)
8. Elghazel, H., Benabdeslem, K., Dussauchoy, A.: Constrained graph b-coloring
based clustering approach. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK
2007. LNCS, vol. 4654, pp. 262–271. Springer, Heidelberg (2007)
9. Davidson, I., Ravi, S.S.: The complexity of non-hierarchical clustering with instance
and cluster level constraints. Data Mining and Knowledge Discovery 61, 14–25
(2007)
10. Davidson, I., Ravi, S.S.: Clustering with constraints: feasibility issues and the k-
means algorithm. In: Proc. of the SIAM International Conference on Data Mining,
pp. 138–149 (2005)
11. Kulis, B., Basu, S., Dhillon, I., Mooney, R.: Semi-supervised graph clustering, a
kernel approach. In: Proc. of the 22th International Conference on Machine Learn-
ing, pp. 577–584 (2005)
12. Davidson, I., Ester, M., Ravi, S.S.: Efficient incremental clustering with constraints.
In: Proc. of 13th ACM Knowledge Discovery and Data Mining (2007)
13. Davidson, I., Wagstaff, K., Basu, S.: Measuring constraint-set utility for partitional
clustering algorithms. In: Proc. of ECML/PKDD (2006)
Constraint Selection for Semi-supervised Topological Clustering 43
14. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning
in semi-supervised clustering. In: Proc. of the 21th International Conference on
Machine Learning, pp. 11–18 (2004)
15. Kohonen, T.: Self organizing Map. Springer, Berlin (2001)
16. Herrmann, L., Ultsch, A.: Label propagation for semi-supervised learning in self-
organizing maps. In: Proc. of the 6th WSOM (2007)
17. Belkin, M., Niyogi, P.: Using manifold structure for partially labelled classification.
In: Proc. of Advances in Neural Information Processing Systems (2003)
18. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training.
In: Proc. of COLT: Proc. of the Workshop on Computational Learning Theory, pp.
92–100 (1998)
19. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning. The MIT Press,
Cambridge (2006)
20. Cheng, Y.: Convergence and ordering of kohonen’s batch map. Neural Computa-
tion 9(8), 1667–1676 (1997)
21. Heskes, T., Kappen, B.: Error potentials for self-organization. In: Proc. of IEEE
International Conference on Neural Networks, pp. 1219–1223 (1993)
22. Xing, E.P., Ng, A.Y., Jordan, M.I., Russel, S.: Distance metric learning, with ap-
plication to clustering with side-information. Advances in Neural Information Pro-
cessing Systems 15, 505–512 (2003)
23. Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-
level constraints: Making the most of prior knowledge in data clustering. In: Proc.
of the 19th International Conference on Machine Learning, pp. 307–313 (2002)
24. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,
Coller, H., Loh, M., Downing, L.R., Caligiuri, M.A., Bloomfield, C.D., Lander,
E.S.: Molecular classification of cancer: Class discovery and class prediction by
gene expression monitoring. Science 15 286(5439), 531–537 (1999)
25. Ultsch, A.: Fundamental clustering problems suite (fcps). Technical report, Uni-
versity of Marburg (2005)
26. Vesanto, J., Alhoniemi, E.: Clustering of the self organizing map. IEEE Transac-
tions on Neural Networks 11(3), 586–600 (2000)
27. Kalyani, M., Sushmita, M.: Clustering and its validation in a symbolic framework.
Pattern Recognition Letters 24(14), 2367–2376 (2003)
28. Rand, W.M.: Objective criteria for the evaluation of clustering method. Journal of
the American Statistical Association 66, 846–850 (1971)
Is There a Best Quality Metric for Graph
Clusters?
1 Introduction
The problem of graph clustering consists of discovering natural groups (or clus-
ters) in a graph [17]. Graph clustering has become very popular recently, given
the large number of applications it has in areas like social network analysis
(finding groups of related people), e-commerce (doing recommendations based
on relations in a group) and bioinformatics (classifying gene expression data,
studying the spread of a disease in a population).
What is the basic structure of a group is open to discussion, but the most
classical and adopted view is the one based on the concept of homophily: similar
elements have a greater tendency to group with each other than with other
elements [16]. When working with graphs, homophily is usually viewed in terms
of edge densities, with clusters having more edges linking their elements among
themselves (high internal density) than linking them to the rest of the graph
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 44–59, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Is There a Best Quality Metric for Graph Clusters? 45
2 Related Work
3 Quality Metrics
use only a graph’s topological information, like vertex distance or edge density,
to evaluate the quality of a given cluster.
3.2 Modularity
One of the most popular validation metrics for topological clustering, modularity
states that a good cluster should have a bigger than expected number of internal
edges and a smaller than expected number of inter-cluster edges when compared
to a random graph with similar characteristics [14]. The modularity score Q for a
clustering is given by Equation 1, where e is a symmetric matrix whose element
eij is the fraction of all edges in the network that link vertices in communities i
and j, and T r(e) is the trace of matrix e, i.e., the sum of elements from its main
diagonal.
Q = T r(e) − ||e2 || (1)
The modularity index Q often presents values between 0 and 1, with 1 repre-
senting a clustering with very strong community characteristics. However, some
limit cases may even present negative values. One example of such cases is in
the presence of clusters with only one vertex. In this case, those clusters have 0
internal edges and, therefore, contribute nothing to the trace. Sufficiently large
numbers singleton clusters in a given clustering might cause its trace value to be
so low as to overshadow other, possibly better formed, of its clusters and lead
to very low modularity values regardless.
48 H. Almeida et al.
This metric uses concepts of cohesion and separation to evaluate clusters, using
the distance between nodes to measure their similarity [21]. The silhouette index
for a given vertex i is given by Equation 2
v∈Ci Sv b v − av
S(Ci ) = , where Sv = (2)
|Ci | max(av , bv )
Where av is the average distance between vertex v and all the other vertices
in the same cluster as it is, and bv is the average distance between v and all
the vertices in the nearest cluster that is not v’s. The silhouette index for a
given cluster is the average value of silhouette for all its member vertices. The
silhouette index can assume values between −1 and 1, with a negative value
being undesirable, as it means that the average internal distance of the cluster
is greater than the external one.
The silhouette index presents some limitations, though. First of all, it is a very
expensive metric to calculate, requiring an all pairs shortest path execution. The
other is how it behaves in the presence of singleton clusters. Since a singleton
possesses no internal edges, its internal distance will be 0, causing its silhouette
to wrongly score a perfect 1. This way, clusterings with many singletons will
always have high silhouette scores, no matter the quality of the other clusters.
3.4 Conductance
The conductance [8] of a cut is a metric that compares the size of a cut (i. e.,
the number of edges cut) and the weight of the edges in either of the two sub-
graphs induced by that cut. The conductance φ(G) of a graph is the minimum
conductance value between all its clusters.
Consider a cut that divides G into k non-overlapping clusters C1 , C2 . . . Ck .
The conductance of any givencluster φ(Ci ) can be obtained as shown in Equa-
tion 3, where a(Ci ) = u∈Ci v∈V w(u, v) is the sum of the weights of all edges
with at least one endpoint in Ci . This φ(Ci ) value represents the cost of one cut
that bisects G into two vertex sets Ci and V \Ci . Since we want to find a number
k of clusters, we will need k − 1 cuts to achieve that number. In this paper we
assume the conductance for the whole clustering to be the average value of those
(k − 1) φ cuts, as formalized in Equation 4.
u∈Ci v
∈C i w({u, v})
φ(Ci ) = (3)
min(a(Ci ), a(C̄i ))
φ(G) = avg(φ(Ci )) , Ci ⊆ V (4)
Although the use of both internal and external conductance gives a better, well
rounded view of both internal density and external sparsity of a cluster, many
works use only the external conductance while evaluating cluster quality [10,11].
So, in this paper we will likewise use only the external conductance, referred from
now on simply as conductance, to evaluate if it is a good enough quality metric by
itself. One negative characteristic of conductance that can be pointed out is that
it might have a tendency of giving better scores to clusterings with fewer clusters,
as more clusters will probably have more cut-edges. Also, the lack of internal edge
density information used in this kind of conductance may cause problems, as can
be seen in Figure 1, where both clusterings presented would have the same con-
ductance score, even though the one in Figure 1b is obviously better.
3.5 Coverage
The coverage of a clustering C (where C = C1 , C2 , . . . , Ck ) is given as the fraction
of the weight of all intra-cluster edges with respect to the total weight of all edges
in the whole graph G [1], as shown in Equation 7:
w(C)
coverage(C) = , where (7)
w(G)
k
w(C) = w(E(vx , vy )); vx , vy ∈ Ci
i=1
Coverage values usually range from 0 to 1. Higher values of coverage mean that
there are more edges inside the clusters than edges linking different clusters,
which translates to a better clustering. From its formulation, we can observe
50 H. Almeida et al.
that the main clustering characteristic needed for a high value of coverage is
inter-cluster sparsity. Internal cluster density is in no way taken into account
by this metric, and it probably causes a strong bias toward clusterings with less
clusters. This can be seen in the example on Figure 1, where the clustering with
two clusters would receive a better score than the clearly better clustering with
three clusters.
3.6 Performance
This metric counts the number of internal edges in a cluster along with the edges
that don’t exist between the cluster’s nodes and other nodes in the graph [22],
as can be seen in Equation 8
f (C) + g(C)
perf (C) = , where (8)
2 n(n − 1)
1
k
f (C) = |E(Ci )|
i=1
k
g(C) = | {{u, v}
∈ E|u ∈ Ci , v ∈ Cj }|
i=1 j>i
This formulation assumes an unweighted graph, but there are also variants for
weighted graphs [1]. Values range from 0 to 1, and higher values indicate that
a cluster is both internally dense and externally sparse and, therefore, a better
cluster. However, if we consider that complex networks tend to be sparse in
nature, when performance is applied to larger graphs, there is a great possibility
that g(C) becomes so high that it will dominate all other factors in its formula,
awarding high scores indiscriminately.
4 Clustering Algorithms
To be able to compare different clusterings with the validation metrics avail-
able, we selected representatives from four different, representative categories of
clustering algorithms. The chosen algorithms were Markov clustering, bisection
K-means, spectral clustering and normalized cut.
The clustering process of MCL consists of two iterative steps: expansion and
inflation. The expansion step of the algorithm is done taking the power of the
normalized adjacency matrix representing the graph using traditional matrix
multiplication. The inflation step consists in taking the Hadamard power of the
expanded matrix, followed by a scaling step to make the matrix stochastic again,
with the elements of each column corresponding to a probability value. MCL does
not need to have a pre-defined number of clusters as input, it’s only parameter
being the inflation value, which affects the coarsening of the graph (the lower
the value, the coarser the clustering).
the cost function given by Equation 9, where the volume of a set is the sum of
the weights of all edges with at least one endpoint inside it.
1 1
cut(A, B) = + (9)
V ol(A) V ol(B)
This cost function is designed to penalize cuts that generate subsets with highly
different sizes. So, by minimizing the normalized cut of a graph, we are dividing
sets of vertices with low similarity and that potentially have high internal simi-
larity. This technique also requires the desired number of clusters to be given as
an input.
5 Experiments
This section presents the experiments used to help evaluating the quality metrics
studied. We will briefly describe our methodology and graphs used, following
with a discussion of the obtained results.
5.1 Methodology
We implemented the five quality metrics discussed in Section 3. To evaluate their
behavior, we applied them to clusters obtained through the execution of the four
classical graph clustering algorithms discussed in Section 4 on five large, real
world graphs that will be briefly discussed in the next subsection. This variety
of clustering algorithms and graphs is necessary to minimize the pollution of
the results by possible correlations between metrics algorithms and/or graph
structures.
We used freely available implementations for all clustering algorithms: the
MCL implementation by Van Dongen, which is available within many Linux dis-
tributions, the implementation of bisecting K-means available in the Cluto1 suite
of clustering algorithms, the spectral clustering algorithm implementation avail-
able in SCPS, by Nepusz [13] and the normalized cut clustering implementation
GRACLUS, by Dhillon [3].
Three different inflation indexes where chosen for the MCL algorithm, based
on the values suggested by the algorithm’s documentation: 1.5, 2, and 3. The
number of clusters found by each MCL configuration was used as the input for
the other algorithms, so that we could compare clusterings with roughly the
same number of clusters.
Graphs. We used 7 different datasets derived from real complex networks. Two
of them are smaller, but with known expected partitions that could be used for
comparison, and the other five are bigger and with unknown expected partitions.
All graphs used are undirected and unweighted.
1
http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
Is There a Best Quality Metric for Graph Clusters? 53
The first small dataset is the Karate club network. It was first presented
by Zachary [23] and depicts the relationships between the students in a karate
dojo. During Zachary’s study a fight between two teachers caused a division of
the dojo in two, with the students more related to one teacher moving to his
new dojo. Even though this dataset is small (34 vertices), it is interesting to
consider because it possesses information about the real social partition of the
graph, providing a ground truth for the clustering.
The other small dataset used was the American College football team’s
matches [5]. It represents a graph where the vertices are football teams and an
edge links two teams if they have played against each together. Since the teams
play mostly with other teams in the same league as theirs, with the exception
of some military school teams, which belong to no league and can play against
anyone, there is also an expected clustering already known for this graph. It is
composed of 115 vertices and 616 edges.
The five remaining networks were obtained from the Stanford Large Network
Dataset Collection2 . Two of them represent the network of collaborations in
papers submitted to the arXiv e-prints in two different areas of study, namely
Astrophysics and High Energy Physics. In those networks, researchers are the
vertices, and they are linked by edges if they collaborated in at least one paper.
The Astrophysics network is composed by 18,772 vertices and 396,160 edges,
while the High Energy Physics has 12,008 vertices and 237,010 edges. Another
network based on the papers submitted to the arXiv e-prints was used, but
covering the citation network of authors in the High Energy Physics category.
In this case, an edge links two authors if one cites the other. This network has
34,546 vertices and 421,578 edges.
The last two networks are snapshots from a Gnutella P2P file sharing network,
taken in two different dates. Here the vertices are the Gnutella clients and the
edges are the overlay network connections between them. The first snapshot was
collected in August, 4 2002 and comprises 10,876 vertices and 39,994 edges. The
second one was collected in August, 30 2002 and has 36,682 vertices and 88,328
edges.
5.2 Results
We first used the smaller datasets, the karate club and the college football, in
order to check how the algorithms and quality metrics behaved in small net-
works where the expected result was already known. The results for the Karate
club dataset can be seen on Table 1. The College Football dataset gave similar
results and was omitted for brevity. The results shown represent the case with
two clusters, which is the expected number for this dataset. It can be observed
that the scores obtained were fairly high. Also, the resulting clusters were very
similar to the expected ones, with variations of 2 or 3 wrongly clustered vertices.
However, those two study cases were very small and classical, so good results
2
http://snap.stanford.edu/data/
54 H. Almeida et al.
Table 1. Karate Club dataset and its quality indexes for two clusters
here were more than expected, as most of the quality metric biases we pointed
out in Section 3 were connected to bigger networks with many clusters.
Now, for the larger datasets. The quality metric values for the Astrophysics
Collaboration network are available in Table 2. It’s already possible to observe
some trends on the quality metrics’ behavior, no matter what clustering algo-
rithm is used. For example, modularity, coverage and conductance always give
better results for smaller numbers of clusters. Also, we can see that, as expected
from our observations in Section 3, performance values have no discriminating
power to compare any of our results. The silhouette index presents a somewhat
erratic behavior in this case, without a clear tendency of better or worse results
for more or less clusters.
For the High Energy Physics Collaboration network, as we can see on Table 3,
the tendencies observed in the last network are still true. Also, silhouette index
shows a more pronounced bias toward larger numbers of clusters. If we look at
the cumulative distribution function (CDF) of cluster sizes (as shown in Figure 2
for just two instances of our experiments, but that is consistent with the rest of
the obtained results), we can see that bigger clusterings tend to have a larger
number of smaller clusters. So, this bias of the silhouette index is expected from
our observations in Section 3. Those same tendencies occur in the High Energy
Physics Citation network, as seen in Table 4.
Is There a Best Quality Metric for Graph Clusters? 55
0.8 0.8
0.7 0.7
0.6 0.6
P(X <= x)
P(X <= x)
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 100 200 300 400 500 600 0 5 10 15 20 25
Cluster size Cluster size
The quality metric scores for one of the Gnutella snapshot networks can be
seen in Table 5. The scores for the other one were very similar to it, so we
suppressed them for brevity. It is possible to notice that the results for those
graphs still present the same tendencies shown in the other cases, but with a
key difference: while silhouette and performance results show no big difference
from the other datasets, as they are easily fooled by high numbers of singleton
clusters and network size, respectively, modularity, coverage and conductance
give abysmally low quality results. This happens because the structure of a
Gnutella network, with common peers connected only to “superpeers”, and those
superpeers also connected with each other. This structure leads to a very low
occurrence probability of 3-cliques (0.5% for the Gnutella networks against 31.8%
for the Astrophysics Collaboration network, for example). Also, the Gnutella
networks presented here are way sparser than the other studied networks, with
only 6.76% of all possible edges present in the graph for the 08/04/2002 snapshot
against 32.88% for the High Energy Physics citation one, for example.
Discussion. For all the generated cases, coverage, modularity and conductance
have better values for smaller numbers of clusters. This behavior is expected
from the formulation of coverage, since it observes the number of inter-cluster
edges, which tends to be smaller if there are less clusters to link to. The same
thing happens to conductance, as more inter-cluster edges mean more expensive
cuts. Without balancing the external conductance with the internal conductance,
results will only give us a partial and biased results.
Concerning modularity, we already know that singleton clusters have a very
bad impact on the modularity score, and the more the clusters, the bigger the
chance for singletons to occur. It is interesting to notice that giving low scores
to singleton clusters is not wrong per se, but since those scores will influence in
the overall score, they can obfuscate the existence of well scored clusters in the
final tally.
Silhouette Index generally gives better results for more clusters, which can
also be attributed to the larger occurrence of singletons, clusters that wrongly
give optimal results for SI.
56 H. Almeida et al.
Table 3. High energy physics collaboration network clusters and their quality indexes
Table 4. High energy physics citation network clusters and their quality indexes
Table 5. Gnutella peers network (08/04/2002) clusters and their quality indexes
quality metrics to identify said clusters. This observation that different kinds
of cluster structures exist and that the usual clustering methods wouldn’t work
with them was already discussed by Nepusz [12]. In that case, she defended that,
in a bipartite graph, each one of the sides of the bipartition should be consid-
ered as a cluster. Kumar et al. [9] also cite the existence of this kind of cluster
structure, pointing out that there are many on-line communities that behave as
bipartite subgraphs and giving the websites of cellphone carriers as an example:
they represent the same category of service, but will not have direct links to each
other.
It is interesting to notice that even the most simple instances of a bipartite
graph would score poorly on the quality metrics studied in this paper, as their
internal density is nonexistent and all their edges connect to other clusters. For
example, consider a small, 10 vertex bipartite graph with two 5 vertex partitions
connected by 11 edges. This simple case would give scores such as −0.29 for
silhouette index, −5 for modularity, 0 for coverage, 0.31 for performance and 2
for conductance, results that are indeed very poor.
6 Conclusion
In this paper we presented a study of some of the most popular quality met-
rics for graph clustering, namely, the Silhouette Index, Modularity, Coverage,
Performance and Conductance. To evaluate those metrics, we compared their
results for clusters generated by four different clustering algorithms: Markovian,
Bisecting K-means, Spectral and Normalized Cut. We used seven different real
datasets in our experiments, with two of them having an already known opti-
mal clustering based on the semantics of the relationships between the elements
represented by their graphs.
Based on our experiments, we could identify some interesting behaviors for
those cluster quality assessing metrics. For example, Modularity, Conductance
58 H. Almeida et al.
and Coverage have a bias toward giving better results for smaller numbers of
clusters, while the other studied metrics have a completely opposite bias. This
indicates that all those metrics do not share a common view of what a true
clustering should look like.
Our results suggest that there is no such a thing as a “best” quality metric
for graph clustering. Even more, the currently used quality metrics have strong
biases that do not always point in the direction of what is assumed to be a
well-formed cluster. Also, those biases can get even more pronounced in large
graphs, which are the ones that depend on those metrics the most, as they are
the hardest to manually evaluate any results.
Another point observed was that the structure of clusters can be different for
graphs with different origins. In our case, we saw clear differences in the results of
technological and social networks. Current clustering and evaluation techniques
seem to be inadequate to tackle those different kinds of complex networks.
As future work, we intend to study how particular aspects of a graph topology
can affect the structure of a cluster, so that we can evaluate clusters with different
characteristics, and not only the clique-like ones. We will also consider adding
other dimensions to the graph, such as weights and labels.
References
1. Brandes, U., Gaertler, M., Wagner, D.: Engineering graph clustering: Models and
experimental evaluation. J. Exp. Algorithmics 12, 1–26 (2008)
2. Danon, L., Dı́az-Guilera, A., Duch, J., Arenas, A.: Comparing community structure
identification. Journal of Statistical Mechanics: Theory and Experiment 2005(09),
P09008 (2005)
3. Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a
multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 1944–1957
(2007)
4. Van Dongen, S.: Graph clustering via a discrete uncoupling process. SIAM Journal
on Matrix Analysis and Applications 30(1), 121–141 (2008)
5. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works. Proceedings of the National Academy of Sciences of the United States of
America 99(12), 7821–7826 (2002)
6. Good, B.H., de Montjoye, Y.A., Clauset, A.: Performance of modularity maximiza-
tion in practical contexts. Physical Review E 81(4), 046106+ (2010)
7. Gustafson, M., Lombardi, A.: Comparison and validation of community structures
in complex networks. Physica A: Statistical Mechanics and its Application 367,
559–576 (2006)
8. Kannan, R., Vempala, S., Vetta, A.: On clusterings: Good, bad and spectral. J.
ACM 51(3), 497–515 (2004)
9. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the web for
emerging cyber-communities. Comput. Netw. 31, 1481–1493 (1999)
Is There a Best Quality Metric for Graph Clusters? 59
10. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of
community structure in large social and information networks. In: WWW 2008:
Proceeding of the 17th International Conference on World Wide Web, pp. 695–704.
ACM, New York (2008)
11. Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for
network community detection. In: Proceedings of the 19th International Conference
on World Wide Web, WWW 2010, pp. 631–640. ACM, New York (2010)
12. Nepusz, T., Bazso, F.: Likelihood-based clustering of directed graphs, pp. 189–194
(March 2007)
13. Nepusz, T., Sasidharan, R., Paccanaro, A.: Scps: a fast implementation of a spectral
method for detecting protein families on a genome-wide scale. BMC Bioinformat-
ics 11(1), 120 (2010)
14. Newman, M.E., Girvan, M.: Finding and evaluating community structure in net-
works. Physical Review E 69(2) (February 2004)
15. Newman, M.E.J.: Mixing patterns in networks. Phys. Rev. E 67(2), 26126 (2003)
16. Newman, M.E.J., Girvan, M.: Mixing Patterns and Community Structure in Net-
works. In: Pastor-Satorras, R., Rubi, M., Diaz-Guilera, A. (eds.) Statistical Me-
chanics of Complex Networks. Lecture Notes in Physics, vol. 625, pp. 66–87.
Springer, Berlin (2003)
17. Schaeffer, S.E.: Graph clustering. Computer Science Review 1(1), 27–64 (2007)
18. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern
Anal. Mach. Intell. 22, 888–905 (2000)
19. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering tech-
niques. In: Grobelnik, M., Mladenic, D., Milic-Frayling, N. (eds.) KDD-2000 Work-
shop on Text Mining, Boston, MA, August 20, pp. 109–111 (2000)
20. Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure
for association patterns. In: KDD 2002: Proceedings of the Eighth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 32–41.
ACM, New York (2002)
21. Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-
Wesley Longman Publishing Co., Inc., Boston (2005)
22. van Dongen, S.M.: Graph Clustering by Flow Simulation. PhD thesis, University
of Utrecht, The Netherlands (2000)
23. Zachary, W.W.: An information flow model for conflict and fission in small groups.
Journal of Anthropological Research 33, 452–473 (1977)
24. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute
similarities. Proc. VLDB Endow. 2(1), 718–729 (2009)
Adaptive Boosting for Transfer Learning Using
Dynamic Updates
1 Introduction
Transfer learning methods have recently gained a great deal of attention in the
machine learning community and are used to improve classification of one dataset
(referred to as target set) via training on a similar and possibly larger auxiliary
dataset (referred to as the source set). Such knowledge transfer can be gained by
integrating relevant source samples into the training model or by mapping the
source set training models to the target models. The knowledge assembled can be
transferred across domain tasks and domain distributions with the assumption
that they are mutually relevant, related, and similar. One of the challenges of
transfer learning is that it does not guarantee an improvement in classification
since an improper source domain can induce negative learning and degradation in
the classifier’s performance. Pan and Yang [13] presented a comprehensive
survey of transfer learning methods and discussed the relationship between trans-
fer learning and other related machine learning techniques. Methods for trans-
fer learning include an adaptation of Gaussian processes to the transfer learning
scheme via similarity estimation between source and target tasks [2]. A SVM
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 60–75, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Adaptive Boosting for Transfer Learning Using Dynamic Updates 61
framework was proposed by Wu and Dietterich [18] where scarcity of target data is
offset by abundant low-quality source data. Pan, Kwok, and Yang [11] used learn-
ing a low-dimensional space to reduce the distribution difference between source
and target domains by exploiting Borgwardt’s Maximum Mean Discrepancy Em-
bedding (MMDE) method [1], which was originally designed for dimensionality
reduction. Pan et al. [12] proposed a more efficient feature-extraction algorithm,
known as Transfer Component Analysis (TCA), to overcome the computationally
expensive cost of MMDE. Several boosting-based algorithms have been modified
for transfer learning and will be more rigorously analyzed in this paper. The rest of
the paper is organized as follows: In Section 2, we discuss boosting-based transfer
learning methods and highlight their main weaknesses. In Section 3, we describe
our algorithm and provide its theoretical analysis. In Section 4, we provide an
empirical analysis of our theorems. Our experimental results along with related
discussion are given in Section 5. Section 6 concludes our work.
Notation Description
X feature space, X ∈ Rd
Y label space = {−1, 1}
d number of features
F mapping function X → Y
D domain
src source (auxiliary) instances
tar target instances
εt classifier error at boosting iteration “t”
w weight vector
N number of iterations
n number of source instances
m number of target instances
t index for boosting iteration
..
ft weak classifier at boosting iteration “t”
1I Indicator function
62 S. Al-Stouhi and C.K. Reddy
TrAdaBoost [5] is the first and most popular transfer learning method that
uses boosting as a best-fit inductive transfer learner. As outlined in Algorithm
11 , TrAdaBoost trains the base classifier on the weighted source and target set
in an iterative manner. After every boosting iteration, the weights of misclassi-
fied target instances are increased and the weights of correctly classified target
instances are decreased. This target update mechanism is based solely on the
training error calculated on the normalized weights of the target set and uses a
strategy adapted from the classical AdaBoost [8] algorithm. The Weighted Ma-
jority Algorithm (WMA) [10] is used to adjust the weights of the source set by
iteratively decreasing the weight of misclassified source instances by a constant
factor, set according to [10], and preserving the current weights of correctly clas-
sified source instances. The basic idea is that the source instances that are not
correctly classified on a consistent basis would converge to zero by N2 and would
not be used in the final classifier’s output since that classifier only uses boosting
iterations N2 → N .
Algorithm 1. TrAdaBoost
Require: Source and Target Instances : D = {(xsrci , ysrci ) ∪ (xtari , ytari )},
..
Maximum number of iterations(N), Base Learning algorithm ( f )
Ensure: Weak classifiers for boosting iterations : N
2
→N
Procedure:
1: for t = 1 to N do ..
2: Find the candidate weak learner for f t : X → Y that minimizes error for D
3: Update source weights via WMA to decrease weights of misclassified instances
4: Update target weights via AdaBoost using target error rate εttar
5: Normalize weights for D
6: end for
3 Proposed Algorithm
We will theoretically and empirically demonstrate the cause of early convergence
in TrAdaBoost and highlight the factors that cause it. We will incorporate an
adaptive “Correction Factor” in our proposed algorithm, Dynamic-TrAdaBoost,
to overcome some of the problems discussed in the previous section.
64 S. Al-Stouhi and C.K. Reddy
Algorithm 2. Dynamic-TrAdaBoost
Require:
• Source domain instances Dsrc = {(xsrci , ysrci )}
• Target domain instances Dtar = {(xtari , ytari )}
• Maximum number of iterations : N
..
• Base learner : f
.
Ensure: Target Classifier Output : f : X → Y
..
. N t − ft N t −1
2
f = sign β tar − β tar
t= 1
2
t= 1
2
1
Procedure: wsrc =
wsrc , . . . , wsrc
n
1
1: Initialize the weight vector D = {Dsrc ∪ Dtar }, where: wtar = wtar , . . . , wtar
m
w = {wsrc ∪ wtar }
2: Set βsrc = 21 ln(n)
1+ N
3: for t = 1 to N do
w
4: Normalize Weights: w = n
m
wsrci + wtarj
i j
..
5: Find the candidate weak learner f t : X → Y that minimizes error for D
weighted according to w
..
6: Calculate the error
of f t on Dtar :
..
j
m [wtar ]1I ytarj =fjt
εtar =
t
m
i
j=1 [wtar ]
i=1
εttar
7: Set βtar =
1 − εtar
t
8: C t = 2 1 − εttar
..
1I ysrci =fit
9: wsrc
t+1
i
= C t wsrc
t
β
i src
where i ∈ Dsrc
..
1I ytari =fit
10: wtari = wtari β tar
t+1 t t
where i ∈ Dtar
11: end for
We will refer to the cost (C t ) on line 8 as “Correction Factor” and prove that it
addresses the source instances’ rapid weight convergence, which will be termed
as “Weight Drift”.
This assumption will not statistically hold true for real datasets. It allows us
to ignore the stochastic difference in the classifiers’ error rates at individual
boosting iterations. This is done so we can calculate a “Correction Factor” value
for any boosting iterations. It will be later demonstrated in (Theorem 5) and
subsequent analysis that there is an inverse correlation between Axiom 1 and
the impact of the “Correction Factor”. The impact of the “Correction Factor”
approaches unity (no correction needed) as the source error (εtsrc ) increases and
the assumption in Axiom 1 starts to break down.
With all source instances classified correctly, the source weights would not change
as:
t
t+1 wsrc t
wsrc =n = wsrc
t
wsrci
i=1
TrAdaBoost, on the other hand, updates the same source weights as:
t
t+1 wsrc
wsrc = ..
n
m 1I ytarj =fjt
t t 1−εttar
wsrc i
+ wtar j εttar
i=1 j=1
66 S. Al-Stouhi and C.K. Reddy
This indicates that in TrAdaBoost, all source weights are converging by a factor
in direct correlation to the value of:
..
m
1I ytarj =fjt
t 1 − εttar
wtar
j=1
j
εttar
Now that the cause of quick convergence of source instances was examined, the
factors that bound this convergence and make it appear stochastic will be ana-
lyzed. It is important to investigate these bounds as they have significant impact
when trying to understand the factors that control the rate of convergence. These
factors will reveal how different datasets and classifiers influence that rate.
t+1 t
wsrc
mint wsrc =
m,n,εtar
n
m 1I ytarj =f..jt
t t 1−εttar
max wsrc i
+ wtar j εttar
m,n,εttar i=1 j=1
This equation shows that the rate of convergence can be maximized as:
1. εttar → 0.
2. m/n → ∞.
It should be noted that the absolute value of m also indirectly bounds εttar as
t
m−1 ≤ εtar < 0.5.
1
Theorem 2 illustrates that a fixed cost cannot control the convergence rate since
the cumulative effect of m, n, and εttar changes at every iteration. A new term has
to be calculated at every boosting iteration to compensate for “Weight Drift”.
1I ytar
..
n
m =f t
t 1−εttar
j j
t
wsrc + wtar
i j εt
tar
i=1 j=1
t
wsrc
= t
nwsrc +A+B
Substituting for A and B would simplify the source update of TrAdaBoost to:
t
t+1 wsrc
wsrc = t t (1 − εt )
nwsrc + 2mwtar tar
..
ytar =f t
m 1−εt
1I j j
t
wtar tar
j εt
tar
j=1
t t
wtar wtar
= A+B
= t
2mwtar (1−εttar )
t
wtar
= 2(1)(1−εttar )
68 S. Al-Stouhi and C.K. Reddy
Applying the “Correction Factor” to the source instances’ weight update would
equate the target instances’ weight update mechanism of Dynamic-TrAdaBoost
to that of AdaBoost since:
t
t+1 wtar
wtar = nwtsrc +2mwtar t (1−εttar )
t
wtar
= C t nwtsrc +2mwtart (1−εttar )
t
wtar
= 2(1−εttar )nw tsrc +2mwtart (1−εttar )
t
wtar
= 2(1−εttar )(nw tsrc +mwtart )
t
wtar
= t
2(1−εtar )(1)
n
n
Theorem 5: The assumptions in Axiom 1 can be approximated, t+1 ≈ wt
wsrc ,
i srci
i=1 i=1
regardless of εtsrc , by increasing the number of boosting iterations (N ).
t t
1I ysrci =fit
= nwsrc (εsrc ) βsrc ..
t
= nwsrc (εtsrc ) βsrc since 1I ysrcj = fjt = 1
t t t
= nwsrc − ε
(1 src ) + nwsrc
(εtsrc) βsrc
t εt
= nwtar 1− src
N
since βsrc = 1
2 ln(n)
1+ 2 ln(n) 1+ N
4 Empirical Analysis
4.1 “Weight Drift” and “Correction Factor” (Theorems 1, 2, 3, 5)
A simulation is used to demonstrate the effect of “Weight Drift” on source and
target weights. In Figure 1(a), the number of instances was constant (n =
10000,m = 200) and the source error rate was set to zero as per Axiom
t+1 1.
t
According to the WMA, the weights should not change, wsrc =wsrc , since
εtsrc = 0. The ratio of the weights of TrAdaBoost to that of the WMA was
plotted at different boosting iterations and with different target error rates
εttar ∈ {0.1, 0.2, 0.3, 0.4}. The simulation validates the following theorems:
The figure also demonstrates that for N = 30 and a weak learner with εttar ≈ 0.1,
TrAdaBoost would not be able to benefit from all 10,000 source instances even
though they were never misclassified. The final classifier uses boosting
iterations N/2 → N , or 15 → 30, where the source instances’ weights would
have already converged to zero. Dynamic-TrAdaBoost conserves these instances’
weights in order to utilize them for classifying the output label.
1 1
0.9
Correctly Classified Source Weight
0.7
0.6 0.98
0.5
0.97
0.4
1%
0.3
2%
0.96
0.2 10%
5%
20% Correction Applied
0.1 30%
40%
Correction Applied 0.95
0 0.1 0.2 0.3 0.4 0.5
2 4 6 8 10 12 14 16 18 20
Boosting Iteration Target Classifier Error
(a) (b)
Fig. 1. The ratio of a correctly classified source weight for TrAdaBoost/WMA (a) For
20 iterations with different target error rates (b) After a single iteration with different
number of target instances and error rates.
tion made in Axiom 1. The following steps can be applied to minimize this term:
The first experiment analyzed the effects of N and n on the sum of source
weights. The source error rate (εtsrc ) was set to 0.2, while the number of source
instances (n) varied from 100 to 10,100 and N ∈ {20, 40, 60, 80}. The plot in
Figure 2(a) demonstrates that the number of source instances (n) has little im-
pact on the total sum of source weights while N has more significance. This is
expected since the logarithmic value of n is already small and increases logarith-
mically with the increase in the number of source instances.
The second experiment considered the effects of N and εtsrc on the sum of
source weights. The number of source instances (n) was set to 1000 with εttar ∈
{0.05, . . . , 0.5} and N ∈ {20, 40, 60, 80}. It can be observed in Figure 2(b) that
the error rate does have a significant effect on decreasing the total weight for
t+1. This effect can be only partially offset via increasing N and it would require
a large value of N for a reasonable adjustment. However, this problem is negated
Adaptive Boosting for Transfer Learning Using Dynamic Updates 71
1
N=20
1 N=40
N=60
N=80
0.99 N=20 No Error
0.95
N=40
0.98
N=60
No Error 0.9
0.96
0.95
0.85
0.94
0.93
0.8
0.92
0.91
0.75
0.9 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Source Training Instances (n) Source Error Rate
(a) (b)
Fig. 2. (a) The ratio of a correctly classified source weight for “t + 1”/“t” (a) For
different number of source instances
and number of boosting iterations (N ) (b) For
different source error rate εtsrc and number of boosting iterations (N )
1. The number of source instances (n) has a negligible impact on the sum of
source weights as it increases logarithmically.
2. The number of boosting iterations (N ) has significant impact on the sum of
source weights and can be used to strengthen the assumption in Axiom 1.
3. High source error rates, εtsrc → 1, weakens the assumption in Axiom 1 but
this will be negated by the fact that the impact of the correction factor is
reduced at high error as since it reaches unity (No Correction) as:
t
lim {C} = t lim 2 1 − εttar ≈ t lim 2 1 − εtsrc = 1
εtar →0.5 εtar →0.5 εsrc →0.5
0.74
0.74
0.72 0.75
Classification Accuracy
Classification Accuracy
Classification Accuracy
0.72 0.7
0.7
0.68
0.7
0.66
0.65
0.68 0.64
0.62
0.66
0.6
0.6
Dynamic-TrAdaBoost Dynamic-TrAdaBoost Dynamic-TrAdaBoost
0.64 Fixed-TrAdaBoost 0.58
Fixed-TrAdaBoost Fixed-TrAdaBoost
TrAdaBoost TrAdaBoost TrAdaBoost
0.55
0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0% 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0% 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0%
Ratio of Target/Source Ratio of Target/Source Ratio of Target/Source
This extension can dynamically speed up the weight convergence of the labels
that exhibit low error rate and would slow it for labels that exhibit high error
rates. The value cost ∈ {R} was also included to allow the user to set the em-
phasis on balancing the labels’ error rate. This cost value controls the steepness
of the convergence rate for a given label as mandated by this label’s error rate
(εttarlabel ).
6 Conclusion
We investigated boosting-based transfer learning methods and analyzed their
main weaknesses. We proposed an algorithm with an integrated dynamic cost
to resolve a major issue in the most popular boosting-based transfer algorithm,
TrAdaBoost. This issue causes source instances to converge before they can be
used for transfer learning. We theoretically and empirically demonstrated the
cause and effect of this rapid convergence and validated that the addition of our
dynamic cost improves classification of several popular transfer learning datasets.
In the future, we will explore the possibility of using multi-resolution boosted
models [15] in the context of transfer learning.
References
1. Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.P., Schlkopf, B., Smola,
A.J.: Integrating structured biological data by kernel maximum mean discrepancy.
Bioinformatics 22(14), e49–e57 (2006)
2. Cao, B., Pan, S.J., Zhang, Y., Yeung, D., Yang, Q.: Adaptive transfer learning. In:
Proceedings of the AAAI Conference on Artificial Intelligence, pp. 407–412 (2010)
3. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: Improving
prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todor-
ovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119.
Springer, Heidelberg (2003)
Adaptive Boosting for Transfer Learning Using Dynamic Updates 75
4. Dai, W., Xue, G.R., Yang, Q., Yu, Y.: Co-clustering based classification for out-
of-domain documents. In: Proceedings of the 13th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 210–219 (2007)
5. Dai, W., Yang, Q., Xue, G.R., Yu, Y.: Boosting for transfer learning. In: Proceed-
ings of the International Conference on Machine Learning, pp. 193–200 (2007)
6. Eaton, E., desJardins, M.: Set-based boosting for instance-level transfer. In: Pro-
ceedings of the 2009 IEEE International Conference on Data Mining Workshops,
pp. 422–428 (2009)
7. Eaton, E.: Selective Knowledge Transfer for Machine Learning. Ph.D. thesis, Uni-
versity of Maryland Baltimore County (2009)
8. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning
and an application to boosting. In: Proceedings of the Second European Conference
on Computational Learning Theory, pp. 23–37 (1995)
9. Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the 12th
International Machine Learning Conference, pp. 331–339 (1995)
10. Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. In: Proceedings
of the 30th Annual Symposium on Foundations of Computer Science, pp. 256–261
(1989)
11. Pan, S.J., Kwok, J.T., Yang, Q.: Transfer learning via dimensionality reduction.
In: Proceedings of the National Conference on Artificial Intelligence, pp. 677–682
(2008)
12. Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer
component analysis. In: Proceedings of the 21st International Jont Conference on
Artifical Intelligence, pp. 1187–1192 (2009)
13. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowl-
edge and Data Engineering 22(10), 1345–1359 (2010)
14. Pardoe, D., Stone, P.: Boosting for regression transfer. In: Proceedings of the 27th
International Conference on Machine Learning, pp. 863–870 (2010)
15. Reddy, C.K., Park, J.H.: Multi-resolution boosting for classification and regression
problems. Knowledge and Information Systems (2011)
16. Sun, Y., Kamel, M.S., Wong, A.K., Wang, Y.: Cost-sensitive boosting for classifi-
cation of imbalanced data. Pattern Recognition 40(12), 3358–3378 (2007)
17. Venkatesan, A., Krishnan, N., Panchanathan, S.: Cost-sensitive boosting for con-
cept drift. In: Proceedings of the 2010 International Workshop on Handling Con-
cept Drift in Adaptive Information Systems (2010)
18. Wu, P., Dietterich, T.G.: Improving svm accuracy by training on auxiliary data
sources. In: Proceedings of the Twenty-First International Conference on Machine
Learning, pp. 871–878 (2004)
19. Yao, Y., Doretto, G.: Boosting for transfer learning with multiple sources. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 1855–1862 (2010)
Peer and Authority Pressure in
Information-Propagation Models
1 Introduction
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 76–91, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Peer and Authority Pressure in Information-Propagation Models 77
u v Authority
nodes
Peer
nodes
.
Fig. 1. Influence graph of a network consisting of peer nodes and peer and authority
nodes
example, mass media can strongly affect the opinions of individuals, whereas the
influence of any one individual on the mass media is most likely infinitesimal.
This distinction is often modeled with the use of edge weights.
In this paper we focus on this distinction and we apply a simple way to model
it: we posit that some agents are authorities (such as the mass media). These
nodes have high visibility in the network and they typically influence a large
number of non-authority nodes. We call these latter nodes the peer nodes. Peers
freely exchange information among themselves and therefore there are influence
links between them. Peers have no influence on authorities. That is, an influence
link that joins an authority to a peer is unidirectional from the authority to the
peer. In our model, we also ignore the influence that one authority node might
have on another. At a global level, the network of authorities and peers looks as
in Figure 1(b). That is, peers and authorities are clustered among themselves,
and there are only directed one-way links from authorities to peers.
The existence of authority nodes allows us to incorporate the authority influ-
ence (or pressure) into the classic information-propagation models. Given a net-
work of peers and authorities and the observed propagation patterns of different
information items (e.g., products, trends, or fads) our goal is to develop a frame-
work that allows us to categorize the items as authority- or peer-propagated.
To do so, we define a model that associates every propagated item with two
parameters that quantify the effect that authority and peer pressure has played
on the item’s propagation. Given data about the adoption of the item by the
nodes of a network we develop a maximum-likelihood framework for learning
the parameters of the item and use them to characterize the nature of its prop-
agation. Furthermore, we develop a randomization test, which we call the time-
shuffle test. This test allows us to evaluate the statistical significance of our
findings and increase our confidence that our findings are not a result of noise in
the input data. Our extensive experiments on real data from online media and
collaboration networks reveal the following interesting finding. In online social-
media networks, where the propagated items are news memes, there is evidence
of authority-based propagation. On the other hand, in a collaboration network
of scientists, where the items that propagate are themes, there is evidence that
peer influence governs the observed propagation patterns to a stronger degree
than authority pressure.
78 A. Anagnostopoulos, G. Brova, and E. Terzi
The main contribution of this paper lies in the introduction of authority pres-
sure as a part of the information-propagation process. Quantifying the effect that
peer and authority pressure plays in the diffusion of information items will give
us a better understanding of the underpinnings of viral markets. At the same
time, our proposed methodology will allow for the development of new types
of recommendation systems, advertisement strategies and election campaigns.
For example, authority nodes are better advertisement targets for authority-
propagated products. On the other hand, election campaign slogans, might gain
popularity due to the network effect and therefore be advertised accordingly.
Roadmap: The rest of the paper is organized as follows: In Section 2 we give
a brief overview of the related work. Sections 3 and 4 give an overview of our
methodology for incorporating authority pressure into the peer models. We show
an extensive experimental evaluation of our framework in Section 5 and we
conclude the paper in Section 6.
2 Related Work
Despite the large amount of work on peer models and on identification of au-
thority nodes, to our knowledge, our work is the first attempt to combine peer
and authority pressure into a single information-propagation model. Also, con-
trary to the goal of identifying authority nodes, our goal is to classify propagated
trends as being peer- or authority-propagated.
One of the first models to capture peer influence was by Bass [4], who de-
fined a simple model for product adoption. While the model does not take into
account the network structure, it manages to capture some commonly-observed
phenomena, such as the existence of a “tipping point.” More recent models such
as the linear-threshold model [9, 10] or the cascade model [10] introduce the de-
pendence of influence on the set of peers, and since then there has been a large
number of generalizations.
In a series of papers based on the analysis of medical data and offline social
networks Christakis, Fowler, and colleagues showed the existence of peer influ-
ence on social behavior and emotions, such as obesity, alcoholism, happiness,
depression, loneliness [7, 5, 14]. An important characteristic in these analyses is
the performance of statistical tests through modifying the social graph to provide
evidence for peer influence. It was found that in general influence can extend up
to three degrees of separation. Around the same time, Anagnostopoulos et al. [2]
and Aral et al. [3], provided evidence that a lot of the correlated behavior among
peers can be attributed to other factors such as homophily, the tendency of indi-
viduals to associate and form ties with similar others. The time-shuffle test that
we apply later is a randomization test used in [2] to rule out influence effects
from peers. Although clearly related, the above work is only complementary to
ours: none of the above papers considers authorities as a factor that determines
the propagation of information.
Recently, there have been many studies related to the spreading of ideas, news
and opinions in the blogosphere. Authors refer to all these propagated items as
Peer and Authority Pressure in Information-Propagation Models 79
memes. Gomez-Rodriguez et al. [8] try to infer who influences whom based on the
time information over a large set of different memes. Contrary to our work where
the underlying network is part of the input, Gomez-Rodrigues et al. assume that
the network structure is unknown. In fact, their goal is to discover this hidden
network and the key assumption of the method is that a node only gets influenced
by its neighbors. Therefore, they do not account for authority influence.
More recently, Yang and Leskovec [16] applied a nonparametric modeling
approach to learn the direct or indirect influence of a set of nodes (e.g. news
sites) to other blogs or tweets. Although one can consider the discovered set of
nodes as authority nodes, the model of Yang and Leskovec does not take into
account the network of peers. Our work is mostly focused on the interaction and
the separation of peer and authority influence within a social-network ecosystem.
Recent work by Wu et al. [15] focuses on classifying twitter users as “elite”
and “ordinary”; elite users (e.g., celebrities, media sources, or organizations) are
those with large influence on the rest of the users. Exploiting the twitter-data
characteristics the authors discover that a very small fraction of the popula-
tion (0.05%) is responsible for the generation of half of the content in twitter.
Although related, the focus of our paper is rather different: our goal is not to
identify the authorities and the peers of the network. Rather, we want to clas-
sify the trends as those that are being authority-propagated versus those being
peer-propagated.
Related in spirit is also the work of Amatriain et al. [1]; their setting and
their techniques, however, are entirely different than ours: they consider the
problem of collaborative filtering and they compare the information obtained by
consulting “experts” as opposed to “nearest neighbors” (i.e., nodes similar to the
node under consideration). The motivation for that work is the fact that data
on nearest neighbors is often sparse and noisy, as opposed to the more global
information of experts.
Fashion trends, news items or research ideas propagate amongst peers and
authorities. We collectively refer to all the propagated trends as information
items (or simply items). We call the nodes (peers or authorities) that have
adopted a particular item active and the nodes that have not adopted the same
item as inactive.
We assume that the propagation of every item happens in discrete time steps;
we assume that we have a limited observation period from timestamp 1 to times-
tamp T . At every point in time t ∈ {1, . . . , T }, each inactive node u decides
whether to become active. The probability that an inactive node u becomes ac-
tive is a function P (x, y) of the number x of peers that can influence u that are
already active and the number y of active authorities. In principle, function P
can be any function that is increasing in both x and y. As we will see in the next
section, we will focus on a simple function that fits our purposes.
4 Methodology
In this section, we present our methodology for measuring peer and authority
pressure in information propagation. Based on that we offer a characterization
of trends as peer- or authority-propagated trends. Peer-propagated trends are
those whose observed propagation patterns can be largely explained due to peer
pressure. Authority-propagated trends are those that have been spread mostly
due to authority influence.
We start in Section 4.1 by explaining how logistic regression can be used to
quantify the extent of peer and authority pressure. In Section 4.2 we define a
randomization test that we use in order to quantify the statistical significance
of the logistic regression results.
eα ln(x+1)+β ln(y+1)+γ
P (x, y) = , (1)
1 + eα ln(x+1)+β ln(y+1)+γ
where α, β and γ are the coefficients of the logistic function. The values of α
and β capture respectively the strength of peer and authority pressure in the
propagation of item i. More specifically α, β take values in R. Large values of α
provide evidence for peer influence in the propagation of item i. Large values of
β provide evidence for authority influence in the propagation of i. For every item
i, we call α the peer coefficient and β the authority coefficient of i. Parameter
γ models the impact of factors other than peer and authority pressure in the
propagation of the item. For example, the effect of random chance is encoded
Peer and Authority Pressure in Information-Propagation Models 81
While in general there is no closed form solution for the above maximum likeli-
hood estimation problem, there are many software packages that can solve such
a problem quite efficiently. For our experiments, we have used Matlab’s statistics
toolbox.
We apply this analysis to every propagated item and thus obtain the maximum-
likelihood estimates of the peer and authority coefficients for each one of them.
estimation on input H, D . Let D be the set of all possible randomized versions
that can be created from the input dataset D via the time-shuffle test. Then we
define the strength of peer influence Sα to be the fraction of randomized datasets
D ∈ D for which α > α(D ), namely,
Sα = PrD (α > α(D )) . (4)
Note that the probability is taken over all possible randomized versions D ∈ D
of the original dataset D.
Similarly, we define the strength of authority influence Sβ , to be the fraction
of randomized datasets D for which β > β(D ), namely,
Sβ = PrD (β > β(D )) . (5)
Both the peer and the authority strengths take values in [0, 1]; the larger the
value of the peer (resp. authority) strength the stronger the evidence of peer
influence (resp. authority influence) in the data.
5 Experimental Results
In this section, we present our experimental evaluation both on real and synthetic
data. Our results on real data coming from online social media and computer-
science collaboration networks reveal the following interesting findings: In on-
line social-media networks the fit of our model indicates a stronger presence of
authority pressure, as opposed to the scientific collaboration network that we
examine. Our results on synthetically-generated data show that our methods
recover the authority and peer coefficients accurately and efficiently.
sites, 13,095 news sites and 124,694 directed links between the blogs (edges).
Although the dataset contains 71,568 memes, their occurrence follows a power-
law distribution and many memes occur very infrequently. In our experiments,
we only experiment with the set of 100 most frequently-appearing memes. We
denote this set of memes by MF . For every meme m ∈ MF we construct a
different extended influence graph. The set of authorities for this meme Am is
the subset of the top-50 authorities in A that have most frequently used this
particular meme.
The Bibsonomy dataset [11]. BibSonomy is an online system with which in-
dividuals can bookmark and tag publications for easy sharing and retrieval. In
this dataset, the influence graph G = (V, E) consists of peers that are scientists.
There is a link between two scientists if they have co-authored at least three
papers together. The influence links in this case are bidirectional since any of
the co-authors can influence each other. The items that propagate in the net-
work are tags associated with publications. A node is active with respect to a
particular tag if at least one of the node’s publications has been associated with
the tag. For a given tag t, the set of authorities associated with this tag, At ,
are the top-20 authors with the largest number of papers tagged with t. These
authors are part of the extended influence graph of tag t, but not part of the
original influence graph.
For our experiments, we have selected papers from conferences. There are a
total of 62,932 authors, 9,486 links and 229 tags. Again, we experiment with the
top-100 most frequent tags.
Implementation. Our implementation consists of two parts. For each item, we
first count the number of users who were active and inactive at each time period;
that is, we evaluate the matrices N (x, y) and N (x, y) in Equation (3). Then, we
run the maximum likelihood regression to find the best estimates for α, β, and γ
in Equation (1). We ran all experiments on a AMD Opteron running at 2.4GHz.
Our unoptimized MATLAB code processes one meme from the MemeTracker
dataset in about 204 seconds. On average, the counting step requires 96% of this
total running time. The rest 4% is the time required to run the regression step.
For the Bibsonomy dataset, the average total time spent on a tag is 38 seconds.
Again, 95% of this time is spent on counting and 5% on regression.
The goal of our first experiment is to demonstrate that the integration of au-
thority influence in the information-propagation models can help to explain ob-
served phenomena that peer models had left unexplained. For this we use the
MemeTracker and the Bibsonomy datasets to learn the parameters α, β and γ
for each of the propagated items. At the same time, we use the peer-only version
of our model by setting β = 0 and learn the parameters α and γ for each one
of the propagated items. This way, the peer-only model does not attempt to
distinguish authority influence, and is similar to the models currently found in
the literature.
84 A. Anagnostopoulos, G. Brova, and E. Terzi
25 35
Peer and authority model Peer and authority model
Peer only model Peer only model
30
20
25
15
Frequency
Frequency
20
15
10
10
5
5
0 0
−0.5 0 0.5 1 1.5 2 2.5 3 −16 −14 −12 −10 −8 −6 −4 −2
Recovered α Recovered γ
Fig. 2. MemeTracker dataset. Histogram of the values of the peer and externality co-
efficients α and γ recovered for the pure peer (β = 0) and the integrated peer and
authority model.
The results for the MemeTracker dataset are shown in Figure 2. More specif-
ically, Figure 2(a) shows the histogram of the recovered values of α and α we
obtained. The two histograms show that the distribution of the values of the
peer coefficient we obtain using the two models are very similar. On the other
hand, the histogram of the values of the externality coefficients obtained for the
two models (shown in Figure 2(b)) are rather distinct. In this latter pair of his-
tograms, we can see that the values of the externality coefficient obtained in the
peer-only model are larger than the corresponding values we obtain using our
integrated peer and authority model. This indicates that the peer-only model
could only explain a certain portion of the observed propagation patterns asso-
ciating the unexplained patterns to random effects. The addition of an authority
parameter explains a larger portion of the observed data, attributing much less
of the observations to random factors. The results for the Bibsonomy dataset
(Figure 3) indicate the same trend.
12 30
Peer and authority model Peer and authority model
Peer only model Peer only model
10 25
8 20
Frequency
Frequency
6 15
4 10
2 5
0 0
−0.5 0 0.5 1 1.5 2 2.5 3 −16 −14 −12 −10 −8 −6 −4 −2
Recovered α Recovered γ
Fig. 3. Bibsonomy dataset. Histogram of the values of the peer and externality co-
efficients α and γ recovered for the pure peer (β = 0) and the integrated peer and
authority model.
Peer and Authority Pressure in Information-Propagation Models 85
30 35
30
25
25
20
Number of items
Number of items
20
15
15
10
10
5
5
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Strength of peer influence Strength of authority influence
.
Fig. 4. MemeTracker dataset. Frequency distribution of the recovered strength of peer
and authority influence.
The above result is illustrated in Figure 4. These histograms show the number
of memes that have a particular strength of peer (Figure 4(a)) and authority
influence (Figure 4(b)). We obtain these results by estimating the strength of
peer and authority influence using 100 random instances of the influence graph
generated by the time-shuffle test (see Section 4.2). The two histograms shown
in Figures 4(a) and 4(b) indicate that for most of the memes in the dataset,
authority pressure is a stronger factor affecting their propagation compared to
peer influence. More specifically, the percentage of memes with peer strength
greater than 0.8 is only 18% while the percentage of memes with authority
strength greater than 0.8 is 45%. Also, 46% of memes have peer strength below
0.2, while only 22% of of memes have authority strength below 0.2.
We demonstrate some anecdotal examples of peer- and authority-propagated
memes in Figure 5. The plot is a two-dimensional scatterplot of the peer and
authority strength of each one of the top-100 most frequent memes. The size of
the marker associated with each meme is proportional to the meme’s frequency.
A lot of the memes in the lower left, that is, memes with both low peer and
authority strength, are commonly seen phrases that are arguably not subject to
social influence. The memes in this category tend to be short and generic. For
example, “are you kidding me” and “of course not” were placed in this category.
The meme “life liberty and the pursuit of happiness” is also placed in the same
category. Although this last quote is not as generic as the others, it still does
not allude to any specific event or controversial political topic. The low social
correlation attributed to these memes indicates that they sporadically appear
in the graph, without relation to each other. Using an equi-depth histogram we
extract the the top-5 most frequent memes with low peer and low authority
strength and show them in Group 1 of Table 1.
86 A. Anagnostopoulos, G. Brova, and E. Terzi
Authority strength
0.6
mark my words it will not be six months before the
world tests barack obama like they did john kennedy
0.4
i think we should all be fair and balanced don’t you
0.2
life liberty and the pursuit of happiness
of course not
0 are you kidding me
0 0.2 0.4 0.6 0.8 1
Peer strength
Fig. 5. MemeTracker dataset. Peer strength (x-axis) and authority strength (y-axis) of
the top-100 most frequent memes. The size of the circles indicate is proportional to the
frequency of the meme.
Diagonally opposite, in the upper right part of the plot, are the memes with
high peer and high authority strength. These are particularly widely-spread
quotes that were pertinent to the 2008 U.S. Presidential Election, and that
frequently appeared in both online news media sites and blog posts. Examples
of memes in this category include “joe the plumber” and President Obama’s
slogan, “yes we can.” Finally, the meme “this is from the widows the orphans
and those who were killed in iraq” is also in this category. This is a reference to
the much-discussed incident where an Iraqi journalist threw a shoe at President
Bush. The top-5 most frequent memes with high peer and authority strengths
are also shown in Group 2 of Table 1. Comparing the memes in Groups 1 and
2 in Tables 1, one can verify that, on average, the quotes with high peer and
high authority strength are much longer and more specific than those with low
peer and low authority strengths. As observed before, exceptions to this trend
are the presidential campaign memes “joe the plumber” and “yes we can”.
Memes with low peer and high authority strength (left upper part of the
scatterplot in Figure 5) tend to contain quotes of public figures, or refer to
events that were covered by the news media and were then referenced in blogs.
One example is “I barack hussein obama do solemnly swear,” the first line of
the inaugural oath. The inauguration was covered by the media, so the quotes
originated in news sites and the bloggers began to discuss it immediately after.
Typically, memes in this group all occur within a short period of time. In con-
trast, memes with both high peer and high authority influence are more likely to
gradually gain momentum. The top-5 most frequent memes with low peer and
high authority strength are also shown in Group 3 of Table 1.
We expect that memes with high peer and low authority strength (right lower
part of the scatterplot in Figure 5) are mostly phrases that are not present in
the mainstream media, but are very popular within the world of bloggers. An
example of such a meme, as extracted by our analysis, is “mark my words it
Peer and Authority Pressure in Information-Propagation Models 87
Table 1. MemeTracker dataset. Examples of memes with different peer and authority
strengths. Bucketization was done using equi-depth histograms.
Group 1: Top-5 frequent memes with low peer and low authority strength.
1. life liberty and the pursuit of happiness
2. hi how are you doing today
3. so who are you voting for
4. are you kidding me
5. of course not
Group 2: Top-5 frequent memes with high peer and high authority strength.
1. joe the plumber
2. this is from the widows the orphans and those who were killed in iraq
3. our opponent is someone who sees america it seems as being so imperfect
imperfect enough that he’s palling around with terrorists who would target their
own country
4. yes we can yes we can
5. i guess a small-town mayor is sort of like a community organizer
except that you have actual responsibilities
Group 3: Top-5 frequent memes with low peer and high authority strength.
1. i need to see what’s on the other side
i know there’s something better down the road
2. i don’t know what to do
3. oh my god oh my god
4. how will you fix the current war on drugs in
america and will there be any chance of decriminalizing marijuana
5. i barack hussein obama do solemnly swear
Group 4: Top-5 frequent memes with high peer and low authority strength.
1. we’re in this moment and if we fail to do the right thing heaven help us
2. if you know what i mean
3. what what are you talking about
4. i think we should all be fair and balanced don’t you
5. our national leaders are sending u s soldiers on a task that is from god
will not be six months before the world tests barack obama like they did john
kennedy.” This is a quote by Joe Biden that generates many more high-ranked
blog results than news sites on a Google search. Another example is “i think we
should all be fair and balanced don’t you,” attributed to Senator Schumer in an
interview on Fox News, which was not covered by mainstream media but was
an active topic of discussion for bloggers. The top-5 most frequent memes with
high peer and low authority strength are also shown in Group 4 of Table 1.
network. One should interpret tags as research topics or themes. Our findings
indicate that when choosing a research direction, scientists are more likely to be
influenced by people they collaborated with rather than experts in their field.
The above result is illustrated in Figure 6. These histograms show the num-
ber of tags that have a particular strength of peer (Figure 6(a)) and authority
influence (Figure 6(b)). We obtain these results by estimating the strength of
peer and authority influence using 100 random dataset instances generated by
the time-shuffle test (see Section 4.2).
50 20
45 18
40 16
35 14
Number of items
Number of items
30 12
25 10
20 8
15 6
10 4
5 2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Strength of peer influence Strength of authority influence
−0.1
1 −0.1 1
−0.2
−0.2 −0.3
2 2
−0.3 −0.4
α 3 −0.4
3 −0.5
α
−0.6
−0.5
4 4
−0.7
−0.6
−0.8
5 5
−0.7
−0.9
−0.8
6 6 −1
1 2 3 4 5 6 1 2 3 4 5 6
β β
Fig. 7. Synthetic data. Relative error for the peer coefficient α and the authority
coefficient β.
relative recovery error. If x is the value of the coefficient used in the generation
process and x̂ is the recovered value, then the relative error is given by
|x − x̂|
RelErr(x, x̂) = .
x
The relative error takes values in the range [0, ∞), where a smaller value indicates
better accuracy of the maximum-likelihood estimation method.
Figure 7 shows the relative recovery errors for different sets of values for α and
β in this simulated graph, where darker colors represent smaller relative errors.
In most cases, for both the peer and the authority coefficients, the relative error
is below 0.2 indicating that the recovered values are very close to the ones used
in the data-generation process.
7 6
Recovered Recovered
Shuffled Shuffled
6 y=x y=x
5
5
Predicted value of α
Predicted value of β
4
3
3
2
2
1
1
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Input value of α Input value of β
Fig. 8. Synthetic data. Recovering the peer coefficient α and the authority coefficient
β for the real data and the data after time-shuffle randomization.
Time-Shuffle Test on Synthetic Data: Figures 8(a) and 8(b) show the
recovered value of peer and authority coefficient respectively, as a function of the
value of the same parameter used in the data-generation process. One can observe
that in both cases the estimated value of the parameter is very close to the input
parameter. Visually this is represented by the proximity of the recovered curve
90 A. Anagnostopoulos, G. Brova, and E. Terzi
6 Conclusions
Given the adoption patterns of network nodes with respect to a particular item,
we have proposed a model for deciding whether peer or authority pressure played
a central role in its propagation. For this, we have considered an information-
propagation model where the probability of a node adopting an item depends
on two parameters: (a) the number of the node’s neighbors that have already
adopted the item and (b) the number of authority nodes that appear to have
the item. In other words, our model extends traditional peer-propagation mod-
els with the concept of authorities that can globally influence the network. We
developed a maximum-likelihood framework for quantifying the effect of peer
and authority influence in the propagation of a particular item and we used this
framework for the analysis of real-life networks. We find that accounting for au-
thority influence helps to explain more of the signal which many previous models
classified as noise. Our experimental results indicate that different types of net-
works demonstrate different propagation patterns. The propagation of memes in
online media seems to be largely affected by authority nodes (e.g., news-media
sites). On the other hand, there is not evidence for authority pressure in the
propagation of research trends within scientific collaboration networks.
There is a set of open research questions that arise from our study. First,
various generalizations could fit in our framework: peers or authorities could
influence authorities, nodes or edges could have different weights indicating
stronger/weaker influence pressures, and so on. More importantly, while our
methods compare peer and authority influence, it would be interesting to ac-
count for selection effects [2, 3] that might affect the values of the coefficients
Peer and Authority Pressure in Information-Propagation Models 91
of our model. Such a study can give a stronger signal about the exact source
of influence in the observed data. Furthermore, in this paper we have consid-
ered that the set of authority nodes are predefined. It would be interesting to see
whether the maximum-likelihood framework we have developed can be extended
to automatically identify the authority nodes, or whether some other approach
(e.g., one based on the HITS algorithm).
References
1. Amatriain, X., Lathia, N., Pujol, J.M., Kwak, H., Oliver, N.: The wisdom of the
few: A collaborative filtering approach based on expert opinions from the web. In:
SIGIR (2009)
2. Anagnostopoulos, A., Kumar, R., Mahdian, M.: Influence and correlation in social
networks. In: KDD (2008)
3. Aral, S., Muchnik, L., Sundararajan, A.: Distinguishing influence-based contagion
from homophily-driven diffusion in dynamic networks. Proceedings of the National
Academy of Sciences, PNAS 106(51) (2009)
4. Bass, F.M.: A new product growth model for consumer durables. Management
Science 15, 215–227 (1969)
5. Caccioppo, J.T., Fowler, J.H., Christakis, N.A.: Alone in the crowd: The structure
and spread of loneliness in a large social network. Journal of Personality and Social
Psychology 97(6), 977–991 (2009)
6. Christakis, N., Fowler, J.: Connected: The surprising power of our social networks
and how they shape our lives. Back Bay Books (2010)
7. Fowler, J.H., Christakis, N.A.: The dynamic spread of happiness in a large social
network: Longitudinal analysis over 20 years in the framingham heart study. British
Medical Journal 337 (2008)
8. Gomez-Rodriguez, M., Leskovec, J., Krause, A.: Inferring networks of diffusion and
influence. In: KDD (2010)
9. Granovetter, M.: Threshold models of collective behavior. The American Journal
of Sociology 83, 1420–1443 (1978)
10. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through
a social network. In: KDD (2003)
11. Knowledge and Data Engineering Group, University of Kassel, Benchmark Folk-
sonomy Data from BibSonomy, Version of June 30th (2007)
12. Leskovec, J., Backstrom, L., Kleinberg, J.M.: Meme-tracking and the dynamics of
the news cycle. In: KDD (2009)
13. Onnela, J.-P., Reed-Tsochas, F.: Spontaneous emergence of social influence in on-
line systems. Proceedings of the National Academy of Sciences, PNAS (2010)
14. Rosenquist, J.N., Fowler, J.H., Christakis, N.A.: Social network determinants of
depression. Molecular Psychiatry 16(3), 273–281 (2010)
15. Wu, S., Hofman, J.M., Mason, W.A., Watts, D.J.: Who says what to whom on
twitter. In: WWW, pp. 705–714 (2011)
16. Yang, J., Leskovec, J.: Modeling information diffusion in implicit networks. In:
ICDM (2010)
Constrained Logistic Regression for
Discriminative Pattern Mining
1 Introduction
In many real-world applications, it is often crucial to quantitatively characterize the
differences across multiple subgroups of complex data. Consider the following moti-
vating example from the biomedical domain: Healthcare experts analyze cancer data
containing various attributes describing the patients and their treatment. These experts
are interested in understanding the difference in survival behavior of the patients be-
longing to different racial groups (Caucasian-American and African-American) and in
measuring this difference across various geographical locations. Such survival behav-
ior distributions of these two racial groups of cancer/non-cancer patients are similar in
one location but are completely different in other locations. The experts would like
to simultaneously (i) model the cancer patients in each location and (ii) quantify the
differences in the racial groups across various locations. The problem goes one step
further: the eventual goal is to rank the locations based on the differences in the can-
cer cases of the two racial groups. In other words, the experts want to find the locations
where the difference in the predictive (cancer) models for the two racial groups is higher
and the locations where such difference is negligible. Depending on such information,
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 92–107, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Constrained Logistic Regression for Discriminative Pattern Mining 93
more health care initiatives will be organized in certain locations to reduce the racial
discriminations in cancer patients [22].
In this problem, the main objective is not only to classify the cancer and non-cancer
patients, but also to identify the discriminations (distribution difference) in the cancer
patients across multiple subpopulations (or subgroups) in the data. The traditional so-
lutions for this research problem partially addresses the dissimilarity issue, but fails to
provide any comprehensive technique in terms of the prediction models. It is vital to de-
velop an integrated framework that can model the discriminations and simultaneously
develop a predictive model.
To handle such problems, the methods for modeling the data should go beyond opti-
mizing a standard prediction metric and should simultaneously identify and model the
differences between two multivariate data distributions. Standard predictive models in-
duced on the datasets capture the characteristics of the underlying data distribution to a
certain extent. However, the main objective of such models is to accurately predict on
the future data (from the same distribution) and will not capture the differences between
two multivariate data distributions.
Our approach is to consider the change between datasets as the change in underly-
ing class distributions. Our Supervised Distribution Difference (SDD) measure defined
in Sec. 3.2 aims to detect the change in the classification criteria. To understand the
kind of distribution changes we are supposedly trying to find can be illustrated using
an example. Figure 1(a) visualizes two binary datasets. Performing any univariate or
multivariate distribution difference analysis will give us the conclusion that these two
datasets are “different” or provide us with some rules which differ in support (con-
trast set mining). We agree with such analysis, but only to the extent of considering
these two datasets without their class labels. When we consider these two datasets hav-
ing two classes which need to be separated as much as possible using some classifi-
cation method. We conclude that these two datasets are not different in terms of
their classification criteria. A nearly similar Logistic Regression (LR) classifier (or
any other linear classification models) can be used to separate classes in both these
datasets. Thus, our criteria of finding distribution change is in terms of change in the
classification model.
10
Dataset 1 class 1
8 Dataset 1 class 2
Dataset 2 class 1
6 Dataset 2 class 2
−2
−3 −2 −1 0 1 2 3 4 5 6
(a)
10
Dataset 1 class 1
Dataset 1 class 2
8
LR classifier
−2
−3 −2 −1 0 1 2 3 4 5 6
(b)
10
6 Dataset 2 class 1
Dataset 2 class 2
4 LR classifier
−2
−3 −2 −1 0 1 2 3 4 5 6
(c)
10
Dataset 1 class 1
8
Dataset 1 class 2
Dataset 2 class 1
Dataset 2 class 2
6 Base LR model
LR model dataset 1
LR model dataset 2
4
−2
−3 −2 −1 0 1 2 3 4 5 6
(d)
Fig. 1. (a) Two binary datasets with similar classification criteria (b) Dataset 1 with linear class
separators (c) Dataset 2 with linear class separators and (d) Base LR model with Dataset 1 and
Dataset 2 model
96 R. Anand and C.K. Reddy
available for representing the datasets. By placing an accuracy threshold on the selected
models, we further reduce the number of such models and simultaneously ensure that
the new models are still able to classify each dataset accurately.
In this paper, we propose a new framework for constrained learning of predictive
models that can simultaneously predict and measure the differences between datasets
by enforcing some additional constraints in such a way that the induced models are as
similar as possible. Data can be represented by many forms of a predictive model, but
not all of these versions perform well in terms of their predictive ability. Each predic-
tive modeling algorithm will heuristically, geometrically, or probabilistically optimize
a specific criterion and obtains an optimal model in the model space. There are other
models in the model space that are also optimal or close to the optimal model in terms of
the specific performance metric (such as accuracy or error rate). Each of these models
will be different but yet will be a good representation of the data as long as its predic-
tive accuracy is close to the prediction accuracy of the most optimal model induced. In
our approach, we search for two such models corresponding to the two datasets under
the constraint that they must be as similar as possible. The distance between these two
models can then be used to quantify the difference between the underlying data dis-
tributions. Such constrained model building is extensively studied in the unsupervised
scenarios [3] and is relatively unexplored in the supervised cases. We chose to develop
our framework using LR models due to their popularity, simplicity, and interpret ability
which are critical factors for the problem that we are dealing with in this paper.
The rest of the paper is organized as follows: Section 2 discusses the previous works
related to the problem described. Section 3 introduces the notations and concepts use-
ful for understanding the proposed algorithm. Section 4 introduces the proposed con-
strained LR framework for mining distribution changes. The experimental results on
both synthetic and real-world datasets are presented in Section 5. And finally, Section 6
concludes our discussion.
2 Related Work
In this section, we describe some of the related topics available in the literature and
highlight some of the primary contributions of our work.
(1) Dataset Distribution Differences - Despite the importance of the problem, only
a small amount of work is available in describing the differences between two data
distributions. Earlier approaches for measuring the deviation between two datasets used
simple data statistics after decomposing the feature space into smaller regions using tree
based models [22,12]. However, the final result obtained is a data-dependent measure
and do not give any understanding about the features responsible for measuring that
difference. One of the main drawbacks of such an approach is that they construct a
representation that is independent of the other dataset thus making it hard for any sort
of comparison. On the contrary, if we incorporate the knowledge of the other class
while building models for both the subgroups, they provide more information about the
similarities and dissimilarities in the distributions. This is the basic idea of our approach.
Some other statistical and probabilistic approaches [25] measure the differences in the
data distributions in an unsupervised setting without the use of class labels.
Constrained Logistic Regression for Discriminative Pattern Mining 97
(2) Discriminative Pattern mining - Majority of pattern based mining for different,
unusual statistical characteristics [19] of the data fall into the categories of contrast
set mining [4,14], emerging pattern mining [8], discriminative pattern mining [10,21]
and sub-group discovery [11,16]. Applying most of these methods on a given dataset
with two subgroups will only give us the difference in terms of the attribute-value pair
combinations (or patterns) without any quantitative measures, i.e. difference of class
distribution within a small space of the data and does not provide a global view of the
overall difference. In essence, though these approaches attempt to capture statistically
significant rules that define the differences, they do not measure the data distribution
differences and also do not provide any classification model. The above pattern mining
algorithms do not take into account the change in distribution of class labels, instead
they define the difference in terms of change in attribute value combinations only.
(3) Change Detection and Mining - There had been some works on change detection
[17] and change mining [26,24] algorithms which typically assume that some previous
knowledge about the data is known and measure the change of the new model from a
data stream. The rules that are not same in the two models are used to indicate changes
in the dataset. These methods assume that we have a particular model/data at a given
snapshot and then measure the changes for the new snapshot. The data at the new snap-
shot will typically have some correlation with the previous snapshot in order to find any
semantic relations in the changes detected.
(4) Multi-task Learning and Transfer Learning - The other seemingly related family
of methods proposed in the machine learning community is transfer learning [7,23],
which adapts a model built on source domain DS (or distribution) to make a prediction
on the target domain DT . Some variants of transfer learning had been pursued under
different names: learning to learn, knowledge transfer, inductive transfer, and multi-task
learning. In multi-task learning [5], different tasks are learned simultaneously and may
benefit from common (often hidden) features benefiting each task. The primary goal of
our work is significantly different form transfer learning and multi-task learning, since
these methods do not aim to quantify the difference in the data distributions and they
are primarily aimed at improving the performance on a specific target domain. These
transfer learning tasks look for commonality between the features to enable knowledge
transfer or assume inherent distribution difference to benefit the target task.
The major distinction of our work compared to the above mentioned methods is that
none of the existing methods explore the distribution difference based on a ‘model’
built on the data. The primary focus of the research available in the literature for com-
puting the difference between two data distributions had been ‘data-based’, whereas,
our method is strictly ‘model-based’. In other words, all of the existing methods utilize
the data to measure the differences in the distributions. On the contrary, our method
computes the difference using constrained predictive models induced on the data. Such
constrained models have the potential to simultaneously model the data and compare
multiple data distributions. Hence, a systematic way to build a continuum of predictive
models is developed in such a manner that the models for the corresponding two groups
98 R. Anand and C.K. Reddy
are at the extremes of the continuum and the model corresponding to the original data
is lying somewhere on this continuum. It should be highlighted that we compute the
distance between two datasets from the models alone; without referring back to the
original data. The major contributions of this work are:
– Develop a measure of the distance between two data distributions using the differ-
ence between predictive models without referring back to the original data.
– Develop a constrained version of logistic regression algorithm that can capture the
differences in data distributions.
– Experimental justification that the results from the proposed algorithm quantita-
tively capture the differences in data distributions.
3 Preliminaries
The notations used in this paper are described in Table 1. In this section, we will also
describe some of the basic concepts of the Logistic Regression and explain the notion
of supervised distribution difference.
Notation Description
Di ith dataset
Fi ith dataset classification boundary
C Regularization factor
L Objective function
wk kth component of weight vector w
Wj j th weight vector
diag(v) Diagonal matrix of vector v
sN Modified Newton N th step s
Z Scaling matrix
H Hessian Matrix
Jv Jacobian matrix of |v|
Constraint on weight values
eps Very small value (1e-6)
where, g (z) = 1+e1−z . Let (x1 , x2 , ..., xn ) denotes a set of training examples and
(y1 , y2 , ..., yn ) be the corresponding labels. xik is the k th feature of the ith sample. The
joint distribution of the probabilities of class labels of all the n examples is:
n
Pr (y = y1 |x1 ) Pr (y = y2 |x2 ) ... Pr (y = yn |xn ) = Pr (y = yi |xi ) (3)
i=1
l
where zi = k=0 wk xik . To maximize Eq. (4), Newton’s method which iteratively
updates the weights using the following update equation is applied:
−1
(t+1) ∂2L
(t) ∂L
w =w − (5)
∂w∂w ∂w
n n
∂L ∂
= log g (yi zi ) = yi xik g (−yi zi ) (6)
∂wk ∂wk i=1 i=1
n
∂2L
=− xij xik g (yi zi ) g (−yi zi ) (7)
∂wj ∂wk i=1
n
∂L
=− yi xik g (−yi zi ) + Cwk (9)
∂wk i=1
n
∂2L
=− x2ik g (−yi zi ) + C (10)
∂wk ∂wk i=1
100 R. Anand and C.K. Reddy
Let D1 , D2 be two datasets having the same number of features and the curve F1 and
F2 represents the decision boundary for the dataset D1 and D2 correspondingly. D
represents the combined dataset (D1 ∪ D2 ) and F is the decision boundary for the
combined dataset. For LR model, these boundaries are defined as a linear combination
of attributes resulting in a linear decision boundary. We induce constrained LR models
for D1 , D2 which are as close as possible to that of D and yet have significant accuracy
for D1 , D2 respectively. In other words, F1 and F2 having minimum angular distance
from F . Since, there exists many such decision boundaries, we optimize for minimum
angular distance from F that has higher accuracy. Supervised Distribution Difference
(SDD) is defined as the change in the classification criteria in terms of measuring the
deviation in classification boundary while classifying as accurately as possible.
4 Proposed Algorithm
We will now develop a constrained LR model which can measure the supervised distri-
bution difference between multivariate datasets. Figure 2 shows the overall framework
of the proposed algorithm. We start by building a LR model (using Eq.(8)) for the com-
bined dataset D and the weight vector obtained for this base model is denoted by R.
The regularization factor C for D is obtained using the best performance for 10-fold
cross validation (CV) and then the complete model is obtained using the best value of
C. Similarly, LR models on datasets D1 and D2 are also obtained. For datasets D1 and
D2 , the CV accuracy for the best C is denoted by Acc for each dataset. The best value
of C obtained for each dataset is used while building the constrained model. After all
the required input parameters are obtained, constrained LR models are separately learnt
individually for the datasets D1 and D2 satisfying the following constraint: the weight
vector of these new constrained models must be close to that of R (should not deviate
much from R). To enforce this constraint, we change the underlying implementation of
LR model to satisfy the following constraints:
|Rk − wk | ≤ (12)
where is the deviation we allow from individual weight vectors of model obtained
from D. The upper and lower bound for each individual component of the weight vec-
tors is obtained from above equation. To solve this problem, we now use constrained
optimization algorithm in the implementation of constrained LR models.
The first derivative while obtaining LR model (Eq. (9)) is set to zero. In our model,
a scaled modified Newton step replaces the unconstrained Newton step [6]. The scaled
Constrained Logistic Regression for Discriminative Pattern Mining 101
Fig. 2. Illustration of our approach to obtain Supervised Distribution Difference between two
multivariate datasets
modified Newton step arises from examining the Kuhn-Tucker necessary conditions for
Equations (8) and (12).
∂L
(Z(w))−2 =0 (13)
∂w
Thus, we have an extra term (Z(w))−2 multiplied to the first partial derivative of the
optimization problem(L). This term can be defined as follows:
1
Z(w) = diag(|v(w)|− 2 ) (14)
The underlying term v(w) is defined below for 1 ≤ i ≤ k
∂Li (w)
vi = wi − (Ri + ) if ∂w
< 0 and (Ri + ) < ∞
∂Li (w)
vi = wi − (Ri − ) if ∂w ≥ 0 and (Ri − ) > −∞
Thus, we can see that the epsilon constraint is used in modifying the first partial deriva-
tive of L. The scaled modified Newton step for the nonlinear system of equations given
by Eq. (13) is defined as the solution to the linear system
ˆ
∂L
ÂZsN = − (15)
∂w
ˆ
∂L ∂L
= Z −1 (16)
∂w ∂w
∂L v
 = Z −2 H + diag( )J (17)
∂w
The reflections are used to increase the step size and a single reflection step is defined
as follows. Given a step η that intersects a bound constraint, consider the first bound
102 R. Anand and C.K. Reddy
constraint crossed by η; assume it is the ith bound constraint (either the ith upper or
lower bound). Then the reflection step η R = η except in the ith component, where
ηiR = ηi . In summary, our approach can be termed as constrained minimization with
box constraints. It is different from LR which essentially performs an unconstrained
optimization. After the constrained models for the two datasets D1 and D2 are induced,
we can capture the model distance by the Eq. (11). Algorithm 1 outlines our approach
for generating constrained LR models.
Most of the input parameters for the constrained LR algorithm are dataset dependent
and are obtained before running the algorithm as can be seen in the flowchart in Fig. 2.
The only parameter required is τ which is set to 0.15. However, depending on the appli-
cation domain of the dataset used, this value can be adjusted as it’s main purpose is to
allow for tolerance by losing some accuracy while comparing datasets. The constraint
is varied systematically using variable a on line 4. This way, we gradually set bounds
for weight vector to be obtained (lines 5, 6). The weight vector for the optimization is
initialized with uniform weights (line 8). Line 10 employs constrained optimization
using bounds provided earlier and terminates when the condition on line 13 is satisfied.
The tolerance value eps is set to 1e-6. After the weight vector for a particular constraint
is obtained, we would like to see if this model can be considered for representing the
Constrained Logistic Regression for Discriminative Pattern Mining 103
dataset. Line 14 checks whether the accuracy of the current model is within the thresh-
old. It also checks if the accuracy of the current model with previously obtained model
and the better one is chosen for further analysis. The best model in the range of 1%
to 5% constraint of base weight vector R is selected. If no such model is found within
this range, then we gradually increase the constraint range (line 20) until we obtain the
desired model. The final weight vector is updated in line 15 and is returned after the
completion of full iteration. The convergence proof for the termination of constrained
optimization on line 10 is similar to the one given in [6].
5 Experimental Results
We conducted our experiments on five synthetic and five real-world datasets [2]. The bi-
nary datasets are represented by triplet (dataset, attributes, instances). The UCI datasets
used are (blood, 5, 748), (liver, 6, 345), (diabetes, 8, 768), (gamma, 11, 19020), and
(heart, 22, 267). Synthetic datasets used in our work have 500,000 to 1 million tuples.
Table 2. Difference between weight vectors for constrained and unconstrained LR models
LR Constrained LR LR Constrained LR
-3.3732 -0.8015 1.2014 0.4258
-0.8693 0 0.0641 0.0306
-1.2061 -0.0158 -0.5393 0.1123
-1.6274 0 -3.5901 0
5.0797 0.9244 0.7765 0.0455
Let N M.Fnum denote a dataset with N million tuples generated by classification func-
tion num. After computing the distance between the datasets, the main issue to be ad-
dressed is: how large a distance should be there in order to ensure that the two datasets
were generated by different underlying processes? The technique proposed in [12] an-
swers this question as follows: If we assume that the distribution G of distance values
(under the hypothesis that the two datasets are generated by the same process) is known,
then we can compute G using bootstrapping technique [9], and we can use standard sta-
tistical tests to compute the significance of the distance d between the two datasets. The
datasets were generated using the functions F1 , F2 , and F4 respectively. One of the
datasets is constructed by unifying D with a new block of 50,000 instances generated
by F4 where D = 1M.F 1. D1 = D ∪ 0.05M.F4 , D2 = 0.5M.F1 , D3 = 1M.F2 , and
D4 = 1M.F4 .
Prior work [12] devised a “data-based” distance measure along with derived a method
for measuring statistical significance of the derived distance. The experiments con-
ducted on synthetic datasets are explained in [1]. The distance value computed on these
datasets by [12] can be taken as the ground truth and our experiments on these datasets
follow the same pattern as that of earlier results. Table 3 highlights that relative ranking
Constrained Logistic Regression for Discriminative Pattern Mining 105
among datasets for distance is same. Note that the distances are not directly compara-
ble ([12] and Constrained LR), only ranking can be compared based using the distance
computed.
Table 3. The distances of all four datasets by constrained LR and Ganti’s method [12]
−3
D1
−3.5 D2
Log (Euclidean Distance)
D3
−4 D4
−4.5
−5
−5.5
−6
−6.5
−7
10 20 30 40 50 60 70 80 90 100
Sampling Percentage
1.8 Liver
1.6
Diabetes
Euclidean Distance
1.4
Heart
Gamma Telescope
1.2
Blood Transfusion
1
0.8
0.6
0.4
0.2
0
10 20 30 40 50 60 70 80 90 100
Sampling Percentage
6 Conclusion
Standard predictive models induced on multivariate datasets capture certain characteris-
tics of the underlying data distribution. In this paper, we developed a novel constrained
logistic regression framework which produces accurate models of the data and simulta-
neously measures the difference between two multivariate datasets. These models were
built by enforcing additional constraints to the standard logistic regression model. We
demonstrated the advantages of the proposed algorithm using both synthetic and real-
world datasets. We also showed that the distance between the models obtained from
proposed method accurately captures the distance between the original multivariate data
distributions.
References
1. Agrawal, R., Imielinski, T., Swami, A.: Database mining: A performance perspective. IEEE
Trans. Knowledge Data Engrg. 5(6), 914–925 (1993)
2. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://archive.ics.uci.edu/ml/
3. Basu, S., Davidson, I., Wagstaff, K.L.: Constrained Clustering: Advances in Algorithms,
Theory, and Applications. CRC Press, Boca Raton (2008)
4. Bay, S.D., Pazzani, M.J.: Detecting group differences: Mining contrast sets. Data Mining and
Knowledge Discovery 5(3), 213–246 (2001)
5. Caruana, R.: Multitask learning. Machine Learning 28(1), 41–75 (1997)
6. Coleman, T.F., Li, Y.: An interior trust region approach for nonlinear minimizations subject
to bounds. Technical Report TR 93-1342 (1993)
7. Dai, W., Yang, Q., Xue, G., Yu, Y.: Boosting for transfer learning. In: ICML 2007: Proceed-
ings of the 24th International Conference on Machine Learning, pp. 193–200 (2007)
8. Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences.
In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discov-
ery and Data Mining, pp. 43–52 (1999)
Constrained Logistic Regression for Discriminative Pattern Mining 107
9. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman and Hall, London
(1993)
10. Fang, G., Pandey, G., Wang, W., Gupta, M., Steinbach, M., Kumar, V.: Mining low-support
discriminative patterns from dense and high-dimensional data. IEEE Transactions on Knowl-
edge and Data Engineering (2011)
11. Gamberger, D., Lavrac, N.: Expert-guided subgroup discovery: methodology and applica-
tion. Journal of Artificial Intelligence Research 17(1), 501–527 (2002)
12. Ganti, V., Gehrke, J., Ramakrishnan, R., Loh, W.: A framework for measuring differences in
data characteristics. J. Comput. Syst. Sci. 64(3), 542–578 (2002)
13. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd edn. Springer, Heidelberg (2009)
14. Hilderman, R.J., Peckham, T.: A statistically sound alternative approach to mining contrast
sets. In: Proceedings of the 4th Australasian Data Mining Conference (AusDM), pp. 157–172
(2005)
15. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86
(1951)
16. Lavrač, N., Kavšek, B., Flach, P., Todorovski, L.: Subgroup discovery with cn2-sd. Journal
of Machine Learning Research 5, 153–188 (2004)
17. Liu, B., Hsu, W., Han, H.S., Xia, Y.: Mining changes for real-life applications. In: Data Ware-
housing and Knowledge Discovery, Second International Conference (DaWaK) Proceedings,
pp. 337–346 (2000)
18. Massey, F.J.: The kolmogorov-smirnov test for goodness of fit. Journal of the American
Statistical Association 46(253), 68–78 (1951)
19. Novak, P.K., Lavrac, N., Webb, G.I.: Supervised descriptive rule discovery: A unifying sur-
vey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning
Research 10, 377–403 (2009)
20. Ntoutsi, I., Kalousis, A., Theodoridis, Y.: A general framework for estimating similarity of
datasets and decision trees: exploring semantic similarity of decision trees. In: SIAM Inter-
national Conference on Data Mining (SDM), pp. 810–821 (2008)
21. Odibat, O., Reddy, C.K., Giroux, C.N.: Differential biclustering for gene expression analysis.
In: Proceedings of the First ACM International Conference on Bioinformatics and Compu-
tational Biology (BCB), pp. 275–284 (2010)
22. Palit, I., Reddy, C.K., Schwartz, K.L.: Differential predictive modeling for racial dispari-
ties in breast cancer. In: IEEE International Conference on Bioinformatics and Biomedicine
(BIBM), pp. 239–245 (2009)
23. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowledge and
Data Engineering 22(10), 1345–1359 (2010)
24. Pekerskaya, I., Pei, J., Wang, K.: Mining changing regions from access-constrained snap-
shots: a cluster-embedded decision tree approach. Journal of Intelligent Information Sys-
tems 27(3), 215–242 (2006)
25. Wang, H., Pei, J.: A random method for quantifying changing distributions in data streams.
In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS
(LNAI), vol. 3721, pp. 684–691. Springer, Heidelberg (2005)
26. Wang, K., Zhou, S., Fu, A.W.C., Yu, J.X.: Mining changes of classification by correspon-
dence tracing. In: Proceedings of the Third SIAM International Conference on Data Mining
(SDM), pp. 95–106 (2003)
27. Webb, G.I., Butler, S., Newlands, D.: On detecting differences between groups. In: Proceed-
ings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD), pp. 256–265 (2003)
α -Clusterable Sets
Abstract. In spite of the increasing interest into clustering research within the
last decades, a unified clustering theory that is independent of a particular algo-
rithm, or underlying the data structure and even the objective function has not
be formulated so far. In the paper at hand, we take the first steps towards a theo-
retical foundation of clustering, by proposing a new notion of “clusterability” of
data sets based on the density of the data within a specific region. Specifically, we
give a formal definition of what we call “α -clusterable” set and we utilize this
notion to prove that the principles proposed in Kleinberg’s impossibility theorem
for clustering [25], are consistent. We further propose an unsupervised clustering
algorithm which is based on the notion of α -clusterable set. The proposed algo-
rithm exploits the ability of the well known and widely used particle swarm op-
timization [31] to maximize the recently proposed window density function [38].
The obtained clustering quality is compared favorably to the corresponding clus-
tering quality of various other well-known clustering algorithms.
1 Introduction
Cluster analysis is an important human process associated with the human ability to dis-
tinguish between different classes of objects. Furthermore, clustering is a fundamental
aspect of data mining and knowledge discovery. It is the process of detecting homoge-
nous groups of objects without any priori knowledge about the clusters. A cluster is a
group of objects or data that are similar to one another within the particular cluster and
are dissimilar to the objects that belong to another cluster [9, 19, 20].
The last decades, there exists an increasing scientific interest in clustering and nu-
merous applications, in different scientific fields have appeared, including statistics [7],
bioinformatics [37], text mining [43], marketing and finance [10, 26, 33], image seg-
mentation and computer vision [21] as well as pattern recognition [39], among others.
Many clustering algorithms have been proposed in the literature, which can be cate-
gorised into two major categories, hierarchical and partitioning [9, 22].
Partitioning algorithms consider the clustering as an optimization problem. There are
two directions. The first one discovers clusters through optimizing a goodness criterion
based on the distance of the dataset’s points. Such algorithms are k-means [27], ISO-
DATA [8] and fuzzy c-means [11]. The second one utilizes the notion of density and
considers clusters as high-density regions. The most characteristic algorithms of this
approach are DBSCAN [18], CLARANS [28] and k-windows [41].
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 108–123, 2011.
c Springer-Verlag Berlin Heidelberg 2011
α -Clusterable Sets 109
Recent approaches for clustering apply population based globalized search algo-
rithms exploiting the capacity (cognitive and social behaviour) of the swarms and the
ability of an organism to survive and adjust in a dynamically changing and competitive
environment [1, 6, 12, 13, 14, 29, 32]. Evolutionary Computation (EC) refers to the
computer-based methods that simulate the evolution process. Genetic algorithms (GA),
Differential Evolution (DE) and Particle Swarm Optimization (PSO) are the main algo-
rithms of EC [16]. The principal issues of these methods consist of the representation
of the solution of the problem and the choice of the objective function.
Despite of the considerably progress and innovations that the last decades have been
occurred, there is a gap between practical and theoretical clustering foundation [2, 3,
25, 30]. The problem is getting worse due to the lack of a unified definition of what a
cluster is, which will be independent of the measure of similarity/ dissimilarity or the
algorithm of clustering. Going a step further, it is difficult to answer questions such
as how many clusters exist in a dataset, without having any priori knowledge for the
underlying structure of the data, or whether a k-clustering of a dataset is meaningful.
All these weaknesses led to the development of the study of theoretical background
of clustering aiming to develop a general theory. Thus, Puzicha et al. [35], consid-
ered the proximity-based data clustering as a combinatorial optimisation problem and
moreover their proposed theory aimed to face two fundamental problems: (i) the spec-
ification of suitable objective functions, and (ii) the derivation of efficient optimisation
algorithms.
In 2002 Kleinberg [25] developed an axiomatic framework for clustering and showed
that there is no clustering function that could satisfy simultaneously three simple prop-
erties, the scale–invariance, the richness and the consistency condition. Kleinberg’s
goal was to develop a theory of clustering that would not be dependent on any particu-
lar algorithm, cost function or data model. To accomplish that, a set of axioms was set
up, aiming to define what the clustering function is. Kleinberg’s result was that there is
no clustering function satisfying all three requirements.
After some years, Ackerman and Ben-David [2] disagreed with Kleinberg’s impos-
sibility theorem claiming that Kleinberg’s result, was to a large extent, the outcome of a
specific formalism rather than being an inherent feature of clustering. They focused on
the clustering-quality framework rather than to attempt to define what a clustering func-
tion is. They developed a formalism and consistent axioms of the quality of a given data
clustering. This lead to a further investigation of interesting measures of clusterability of
data sets [3]. Clusterability is a measure of clustered structure in a data set. Although,
in the literature, several notions of clusterability [17, 35] have been proposed and in
addition they share the same intuitive concept, however these notions are pairwise in-
compatible, as Ackerman et al., have proved in [3]. Furthermore, they concluded that
the finding a close-to-optimal clustering for well clusterable data set is computationally
easy task comparing with the common clustering task which is NP-hard [3].
of α -clusterable set, that is based on the window density function [38]. We aim to cap-
ture the dense regions of points in the data set, given an arbitrary parameter α , which
presents the size of a D-range, where D is the dimensionality of the data set. Intuitively,
a cluster can be considered as a dense area of data points, which is separated from other
clusters with sparse areas of data or areas without any data point. Under this conside-
ration, a cluster can be seen as an α -clusterable set or as an union of all intersecting
α -clusterable sets. Then, a clustering, called α -clustering, will be comprised of the set
of all the clusters. In this theoretical framework, we are able to show that the proper-
ties of Kleinberg’s impossibility theorem are satisfied. Particularly, we prove that in the
class of window density functions there exist clustering functions satisfying the proper-
ties of scale-invariance, richness and consistency. Furthermore, a clustering algorithm
can be found utilising the theoretical framework and having as the goal to detect the
α -clusterable sets.
Thus, we propose an unsupervised clustering algorithm that exploits the benefits of
a population-based algorithm, known as particle swarm optimisation, in order to detect
the centres of the dense regions of data points. These regions are actually what we call
α -clusterable sets. When all the α -clusterable sets have been identified, the merging
procedure is executed in order to merge the regions that have an overlap each other.
After this process, the final clusters will have been formed and the α -clustering will has
been detected.
The rest of the paper is organized as follows. In the next section we briefly present
the background work that our theoretical framework, which is analysed in Section 3,
is based on. In more detail, we present and analyse the proposed definitions of α -
clusterable set and α -clustering, and futhermore we show that, using these concepts
the conditions of Kleinberg’s impossibility theorem for clustering are hold and are con-
sistent. Section 4 gives a detailed analysis of the experimental framework and the pro-
posed algorithm. In Section 5 the experimental results are demonstrated. Finally, the
paper ends in Section 6 with conclusions.
2 Background Material
For completeness purposes, let us briefly describe the Kleingberg’s axioms [25] as well
as the window density functions [38].
As we have already mentioned above, Kleingberg, in [25], proposed three axioms for
clustering functions and claimed that this set of axioms is inconsistent, meaning that
lack of clustering function that satisfies all the three axioms. Let X = {x1 , x2 , . . . , xN }
be a data set with cardinality N and let d : X × X → R be a distance function over
X, that means ∀ xi , x j ∈ X, d(xi , x j ) > 0 if and only if xi
= x j and d(xi , x j ) = d(x j , xi )
otherwise. It is worth observing that the triangle inequality is not necessary to be ful-
filled, i.e. distance function should not be considered as a metric function. Furthermore,
a clustering function is a function f which, given a distance function d, separates the
data set X into a set of Γ clusters.
α -Clusterable Sets 111
The first axiom, scale-invariance, is concern with the requirement that the clustering
function have to be invariant to changes in the units of a distance measure. Formally,
for any distance function d and any λ > 0, a clustering function f is scale-invariant if
f (d) = f (λ d).
The second property, called richness, deals with the outcome of the clustering func-
tion, and it requires that every possible partition of the data set can be obtained. Typi-
cally, a function f is rich if for each partition Γ of X, there exist a distance function d
over X such that f (d) = Γ .
The consistency property requires that if the distances between the points laid in the
same cluster are decreased and the distances between points laid in a different clusters
are increased, then the clustering result does not change. Kleinberg gave the following
definition:
Definition 1. Let Γ be a partition of X and d, d are two distance functions on X.
Then, a distance function d is a Γ -transformation of d if (a) ∀ xi , x j ∈ X belonging
to the same cluster of Γ , it holds d (xi , x j ) d(xi , x j ) and (b) ∀ xi , x j ∈ X belonging
to different clusters of Γ , it holds d (xi , x j ) d(xi , x j ). Furthermore, a function f is
consistent if f (d) = f (d ), whenever a distance function d is a Γ -transformation of d.
Using the above axioms, Kleinberg stated the impossibility theorem [25]:
Theorem 1 (Impossibility Theorem). For each N 2, there is no clustering function
f that satisfies scale-invariance, richness and consistency.
Sα ,z = {y ∈ X : zi − α yi zi + α , ∀ i = 1, 2, . . . , D} .
Then the Window Density Function (WDF) for the set X, with respect to a given size
α ∈ R is defined as:
WDFα (z) = |Sα ,z | , (1)
where | · | indicates the cardinality of the set Sα ,z .
WDF is a non-negative function that expresses the density of the region (orthogonal
range) around the point. The points that are included in this region can be effectively
estimated using Computational Geometry methods [5, 34]. For a given α , the value of
WDF increases continuously as the density of the region within the window increases.
Furthermore, for low values of α , WDF has many local maxima. While the value of α
112 G.S. Antzoulatos and M.N. Vrahatis
increases, WDF reveals the number of local maxima that corresponds to the number of
clusters. However for higher values of the parameter, WDF becomes smoother and the
clusters are not distinguished.
Thus, it is obvious, that the determination of the dense region depends on the size
of the window. Actually, the parameter α captures our inherent view for the size of
the dense regions that there exist in the data set. To illustrate the effect of parameter
α , we employ the following dataset Dset1 which contains 1600 data points in the 2-
dimensional Euclidean space (Fig. 1(a)).
In the following figures the behaviour of WDF function is exhibited over distinct
values of the α parameter. As we can conclude, when the value of parameter α is
increasing more dense and smooth regions of data points is detected. When α = 0.05
or α = 0.075 there are many maxima inside the real clusters of data points, Fig. 1(b),
Fig. 1(c) respectively. As α increases there is a clear improvement on the formation of
groups, namely the dense regions are more distinct and separate, so between the values
α = 0.1 and α = 0.25 we can detect the four real clusters of data points, Fig 1(d),
Fig. 1(e) respectively. If the parameter α continue to grow, then the four maximum
of the WDF function corresponding to the four clusters of data points, which were
detected previously, merge into one single maximum leading to the formation of one
cluster, Fig 1(f).
In this section, we give the definitions needed to support the proposed theoretical frame-
work for clustering. Based on the observation that a good clustering is one that separates
the points of all data in high-density areas, which are separated by areas of sparse points
or areas with no points, we define the notion of an α –clusterable set as well as the no-
tion of α –clustering. To do this, we exploit the benefits of window density function
and its ability to find local dense regions of data points without investigate the whole
dataset.
Definition 3 (α –Clusterable Set). Let X be the data set that is comprised of the set of
points {x1 , x2 , . . . , xN }. A set of data points xm ∈ X is defined as an α –clusterable set
if there exist a positive real value α ∈ R, a hyper–rectangle Hα of size α and a point
z ∈ Hα in which the window density function centered at z is unimodal. Formally,
Cα ,z = xm | xm ∈ X ∧ ∃ z ∈ Hα : WDFα (z) WDFα (y), ∀ y ∈ Hα . (2)
Remark 1. It is worth to mention that although the points y and z are laid in the hyper–
rectangle Hα , however it is not necessary to be points of the data set. Also, the hyper–
rectangle Hα is a bounding box of the data set X and a set Cα ,z is a subset of X. In
addition, the α –clusterable set is a highly dense region due to the fact that the value of
WDF function is maximised. Furthermore, the point z could be considered as the centre
of the α –clusterable set. Thus, given an α and a sequence of points zi ∈ Hα , a set that
comprises of a number of α –clusterable sets could be considered as a close to optimal
clustering of X.
α -Clusterable Sets 113
We explain the above notions by given an example. Let X be the dataset of 1000 random
data points that drawn from the normal (Gaussian) distribution (Figure 2). The four
clusters have the same cardinality thus each one of them contains 250 points. As we
can notice, there exist a proper value for the parameter α , α = 0.2, so as the hyper–
rectangles Hα captures the whole clusters of points. These hyper–rectangles can be
considered as the α –clusterable sets. Also, it is worth to mention that there is only one
point z inside the α –clusterable set, such that the window density function is unimodal.
114 G.S. Antzoulatos and M.N. Vrahatis
Furthermore, we define an α –clustering function for a data set X, that takes a window
density function, with respect to a given size α , on X and returns a partition C of α –
clusterable sets of X.
Definition 5 (α –Clustering Function). A function fα (WDFα , X) is an α –clustering
function if for a given window density function, with respect to a real value parameter
α , returns a clustering C of X, such as each cluster of C is an α –clusterable set of X.
Next, we prove that the clustering function fα fulfills the properties of scale-invari-
ance, consistency and richness. Intuitively, the scale-invariance property describes that
in any uniform change in the scale of the domain space of the data, the high-density
areas will be maintained and furthermore they will be separated by sparse regions of
points. Richness means that there exist a parameter α and points z, such that an α -
clustering function f can be constructed, with the property of partitioning the dataset X
into α -clusterable sets. Finally, the consistency means that if we shrink the dense areas,
α -clusterable sets, and simultaneously expand the sparse areas between the dense areas,
then we can get the same clustering solution.
Lemma 1 (Scale-Invariance). Every α –clustering function is scale-invariant.
Proof. According to the definition of scale-invariance, every clustering function has
this property if for every distance measure dist and any λ > 0 it holds that f (dist) =
f (λ dist). Thus, in our case an α –clustering function, fα , is scale-invariant since it holds
that:
fα (WDFα (z), X) = fλ α (WDFλ α (λ z), X),
for every positive number λ . This is so because if a data set X is scaled by a factor
λ > 0, then the window density function of each point will be remain the same. Indeed,
if a uniform scale is applied to the dataset, then we can find a scale factor λ , such that
a scaled window, with size λ α , contains the same amount of points as the window of
size α . More specifically, for each data point y ∈ X that belongs to a window which has
center the point z and size α , it holds that:
So, if the point y ∈ X belongs to the window of size α and center z, then the point
y = λ y, y ∈ X will belong to the scaled window, which has size λ α and center the
point z = λ z. Thus the lemma is proved.
α -Clusterable Sets 115
Proof. It is obvious that for each non-trivial α –clustering C of X, there exist a window
density function for the set X, with respect to a size α , such that:
f (WDFα (z), X) = C .
In other words, given a data set of points X we can find a WDF and a size α , such that
each window with size α and center the point z will be an α –clusterable set. Thus the
lemma is proved.
4 Experimental Framework
In this section we propose an unsupervised algorithm, in the sense that it doesn’t require
a predefined number of clusters in order to detect the α –clusterable sets laiding in the
dataset X. Define the correct number of clusters is a critical open issue in cluster analy-
sis, Dubes refer to it as “the fundamental problem of cluster analysis” [15], because the
number of clusters is often tough to determine or, even worse, impossible to define.
116 G.S. Antzoulatos and M.N. Vrahatis
Thus, the main goal of the algorithm is to identify the dense regions of points,
in which the window density function is unimodal. These regions constitute the α –
clusterable sets that enclose the real clusters of the dataset. The algorithm runs itera-
tively identifying the centre of the α –clusterable set, removing the data points that lie
within it. The above process continues until no data points left in the dataset. In order
to detect the centre of the dense regions we utilised a well-known population-based op-
timisation algorithm, called Particle Swarm Optimisation (PSO) [23]. PSO is inspired
by swarm behaviour, such as flocking birds collaboately searching for food. In the last
decades there has been a rapid increase of the scientific interest around Swarm Intelli-
gence and particularly around Particle Swarm Optimization and numerous approaches
have been proposed in many application fields [16, 24, 31]. Recently, Swarm Intelli-
gence and especially Particle Swarm Optimisation have been utilised in Data Mining
and Knowledge Discovery, producing promising results [1, 40].
In [6] an algorithm, called IUC, has been proposed, which utilises as objective func-
tion the window density function and Differential Evolution algorithm in order to evolve
the clustering solution of the data set reaching the best position of the data set. Also,
they use an enlargment procedure in order to detect all the points that laying in the same
cluster. In the paper at hand, we exploit the benifits of the Particle Swarm Optimisation
algorithm to search the space of potential solutions efficiently, so as to find the global
optimum of a window density function. Each particle presents the centre of a dense
region of the dataset, so the particles are flying through the seach space forming folks
around peaks of window density function. Thus, the algorithm detects the centre of the
α –clusterable set one each time.
It is worth to say that the choice of the value of the parameter α seems to play
an important role of the identifcation of the real number of clusters and depends on
several factors. For instance, if the value of the parameter α is too small so the hyper-
rectangle is not able to capture the whole cluster, or if the data points shape dense
regions with various cardinality, then again the hyper-rectangles with constant size α
are difficult to capture the whole clusters of the datasets. The following figures describe
the above cases more clearly. We conclude that the small choice of parameter α leads
to the detection of small dense regions that are the α –clusterable sets. However, as we
can be noticed, even for the detection of small clusters of data points, it needs more than
one α –clusterable set (Fig. 3(a)). On the other hand, increasing α causes the detection
of small clusters of the data sets by using only one α –clusterable set. However, the
detection of the big cluster needs more α –clusterable sets, the union of them describes
the whole cluster. It has to mentioned here, that the union of overlapping α –clusterable
sets is still an α -clusterable set, hence we can find a point z which will be the centre
of the set and its window density function value is maximum, in a hyper-rectangle size
α > α , means that the WDFα (z) is unimodal.
In order to avoid the above situations, we propose and implement a merging pro-
cedure that merges the overlapping α –clusterable sets, so that the outcome of the al-
gorithm represents the real number of clusters in the data set. Specifically, two dense
regions (α –clusterable sets) are going to merge if and only if the overlap between them
contains at least one data point.
α -Clusterable Sets 117
(a) Effect of parameter value α = 0.2 (b) Effect of parameter value α = 0.25
Subsequently, we summarise the above analysis and we propose the new clustering
algorithm. It is worth to refer that the detection α –clusterable sets, which are highly
density regions of the datasets, through the window density function is a maximiza-
tion problem, however Particle Swarm Optimisation is a minimization algorithm, hence
−WDFα (z) is utilised as the fitness function.
5 Experimental Results
The objective of the conducted experiments was three-fold. First, we want to investi-
gate the behaviour of the algorithm regarding the resizing of the window. Second, we
118 G.S. Antzoulatos and M.N. Vrahatis
(a) Entropy and Purity vs Window Size (b) Entropy and Purity vs Window (c) Entropy and Purity vs Window Size
α for the Dset1 Size α for the Dset2 α for the Dset3
Table 1. The mean values and standard deviation of entropy and purity for each algorithm over
the four datasets
Dset1 Dset2
Entropy Purity Entropy Purity
IUC DE1 8.55e-3(0.06) 99.7%(0.02) 4.54e-2(0.11) 98.9%(0.03)
IUC DE2 1.80e-2(0.1) 99.4%(0.03) 3.08e-2(0.09) 99.2%(0.03)
IUC DE3 1.94e-4(0.002) 100%(0.0) 7.16e-2(0.13) 98.2%(0.03)
IUC DE4 6.01e-3(0.06) 99.8%(0.01) 4.21e-2(0.10) 99.0%(0.02)
IUC DE5 2.46e-2(0.01) 99.2%(0.03) 6.95e-2(0.13) 98.3%(0.03)
DEUC DE1 1.70e-1(0.1) 91.0%(0.05) 3.39e-2(0.02) 90.5%(0.01)
DEUC DE2 1.36e-1(0.09) 92.3%(0.05) 3.22e-2(0.02) 90.3%(0.01)
DEUC DE3 1.66e-1(0.09) 90.4%(0.05) 2.90e-2(0.02) 90.8%(0.01)
DEUC DE4 1.45e-1(0.09) 91.1%(0.04) 3.16e-2(0.02) 90.4%(0.01)
DEUC DE5 1.39e-1(0.1) 92.9%(0.05) 2.88e-2(0.02) 90.6%(0.01)
k-means 1.10e-1(0.21) 96.7%(0.06) 3.45e-1(0.06) 90.5%(0.03)
k-windows 0.00e-0(0.0) 99.2%(0.02) 2.20e-2(0.08) 95.4%(0.01)
DBSCAN 0.00e-0(—) 100%(—) 3.74e-1(—) 100.0%(—)
PSO 0.05-Cl 0.00e-0(0.0) 100%(0.0) 6.44e-2(0.11) 98.2%(0.03)
PSO 0.075-Cl 0.00e-0(0.0) 100%(0.0) 1.86e-1(0.16) 95.1%(0.04)
PSO 0.1-Cl 0.00e-0(0.0) 92.048%(0.01) 3.07e-1(0.08) 92.0%(0.0)
PSO 0.2-Cl 5.54e-2(0.17) 98.2%(0.06) 3.68e-1(0.01) 91.4%(0.0)
PSO 0.25-Cl 4.3e-2(0.15) 98.6%(0.05) 3.66e-1(0.01) 91.4%(0.0)
Dset4 Dset5
Entropy Purity Entropy Purity
IUC DE1 2.52e-3(0.02) 94.7%(0.05) 2.7e-3(0.03) 99.7%(0.01)
IUC DE2 7.59e-3(0.04) 96.0%(0.04) 7.9e-3(0.04) 99.5%(0.02)
IUC DE3 1.02e-2(0.05) 95.5%(0.04) 8.0e-3(0.04) 99.6%(0.02)
IUC DE4 0.00e+0(0.0) 96.6%(0.01) 1.06e-3(0.05) 99.4%(0.02)
IUC DE5 5.04e-3(0.03) 97.0%(0.01) 2.12e-3(0.07) 99.0%(0.02)
DEUC DE1 6.86e-3(0.01) 90.7%(0.02) 2.63e-3(0.21) 87.4%(0.07)
DEUC DE2 6.04e-3(0.01) 91.0%(0.02) 2.90e-3(0.19) 86.4%(0.06)
DEUC DE3 6.16e-3(0.07) 91.2%(0.01) 2.94e-3(0.21) 86.4%(0.07)
DEUC DE4 7.17e-3(0.01) 89.9%(0.02) 3.09e-3(0.24) 86.0%(0.07)
DEUC DE5 6.38e-3(0.01) 90.1%(0.02) 2.79e-3(0.22) 86.8%(0.07)
k-means 2.69e-1(0.18) 89.9%(0.07) 3.99e-3(0.25) 86.8%(0.09)
k-windows 4.18e-5(0.0) 98.3%(0.003) 0.00e-0(0.0) 99.7%(0.006)
DBSCAN 8.54e-4(—) 99.2%(—) 0.00e-0(0.0) 100%(—)
PSO 0.05-Cl 0.00e-0(0.0) 99.9%(0.0) 0.00e-0(0.0) 99.0%(0.0)
PSO 0.075-Cl 1.03e-2(0.05) 99.5%(0.02) 0.00e-0(0.0) 100.0%(0.0)
PSO 0.1-Cl 7.95e-2(0.12) 96.9%(0.05) 0.00e-0(0.0) 100.0%(0.0)
PSO 0.2-Cl 4.62e-1(0.15) 81.9%(0.06) 1.02e-2(0.05) 99.5%(0.02)
PSO 0.25-Cl 1.76e-0(0.17) 45.5%(0.05) 1.30e-1(0.06) 94.7%(0.06)
α -Clusterable Sets 121
Table 2. The mean values and standard deviation of entropy, purity and number of clusters
the performance of the clustering is increased when the size of the window becomes
larg. However, if the size of the window exceeds a specific value, related to the dataset,
the quality of the clustering deteriorates.
The scalability of the algorithm depends on the window density function and specifi-
cally depends on the complexity of determining the points that lie in a specific window.
This is the well-known orthogonal range search problem that have been studied and
many algorithms have been proposed in the literature to address it [4, 34]. A preprocess-
ing phase is employed so as to construct the data structure that stores the data points.
For high dimensional applications data structures like Multidimensional Binary Tree
[34] is preferable, while for low dimensional applications with large number of points
Alevizos’s approach [4] is more suitable. In this work, we utilise the Multidimensional
Binary Tree so the preprocessing time is O(DN log N), while the data structure demands
O(s + DN 1−1/D ) time to answer it to a query [38].
6 Conlusions
Although clustering is a foundamental process to discover knowledge from data, how-
ever it still difficult to give a clear, coherent and general definition of what is a cluster, or
whether a dataset is clusterable or not. Furthermore, many researches focused on prac-
tical aspect of clustering and leave almost untouched the theoretical background. In this
study, we have presented a theoretical framework of clustering and we introduced a
new notion of clusterability, called “α –clusterable set”, which is based on the notion of
window densiy function. Particularly, an α –clusterable set is considered as the dense
region of points of a dataset X and also inside of this area the window density function
is unimodal. The set of these α –clusterable sets forms a clustering solution, denoted
as α –clustering. Moreover, we prove, in contrary to the general framework of Klein-
berg’s impossibility theorem, that this α –clustering solution of a data set X satisfies
the properties of scale-invariance, richness and consistency. Furthermore, to validate
the theoretical framework, we propose an unsupervised algorithm based on the particle
swarm optimisation. The experimental results are promising since its performance is
better or similar to other well-known algorithms and in addition the proposed algorithm
exhibits good scalability properties.
122 G.S. Antzoulatos and M.N. Vrahatis
References
[1] Abraham, A., Grosan, C., Ramos, V.: Swarm Intelligence in Data Mining. Springer, Hei-
delberg (2006)
[2] Ackerman, M., Ben-David, S.: Measures of clustering quality: A working set of axioms for
clustering. In: Advances in Neural Information Processing Systems (NIPS), pp. 121–128.
MIT Press, Cambridge (2008)
[3] Ackerman, M., Ben-David, S.: Clusterability: A theoretical study. Journal of Machine
Learning Research - Proceedings Track 5, 1–8 (2009)
[4] Alevizos, P.: An algorithm for orthogonal range search in d ≥ 3 dimensions. In: Proceed-
ings of the 14th European Workshop on Computational Geometry (1998)
[5] Alevizos, P., Boutsinas, B., Tasoulis, D.K., Vrahatis, M.N.: Improving the orthogonal range
search k-windows algorithms. In: 14th IEEE International Conference on Tools and Artifi-
cial Intelligence, pp. 239–245 (2002)
[6] Antzoulatos, G.S., Ikonomakis, F., Vrahatis, M.N.: Efficient unsupervisd clustering through
intelligent optimization. In: Proceedings of the IASTED International Conference Artificial
Intelligence and Soft Computing (ASC 2009), pp. 21–28 (2009)
[7] Arabie, P., Hubert, L.: An overview of combinatorial data analysis. In: Clustering and Clas-
sification, pp. 5–64. World Scientific Publishing Co., Singapore (1996)
[8] Ball, G., Hall, D.: A clustering technique for summarizing multivariate data. Behavioral
Sciences 12, 153–155 (1967)
[9] Berkhin, P.: Survey of data mining techniqes. Technical report, Accrue Software (2002)
[10] Berry, M.J.A., Linoff, G.: Data mining techniques for marketing, sales and customer sup-
port. John Willey & Sons Inc., USA (1996)
[11] Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Aca-
demic Publishers, Norwell (1981)
[12] Chen, C.Y., Ye, F.: Particle swarm optimization algorithm and its application to clustering
analysis. In: IEEE International Conference on Networking, Sensing and Control, vol. 2,
pp. 789–794 (2004)
[13] Cohen, S.C.M., Castro, L.N.: Data clustering with particle swarms. In: IEEE Congress on
Evolutionary Computation, CEC 2006, pp. 1792–1798 (2006)
[14] Das, S., Abraham, A., Konar, A.: Automatic clustering using an improved differential evo-
lution algorithm. IEEE Transactions on Systems, Man and Cybernetics 38, 218–237 (2008)
[15] Dubes, R.: Cluster Analysis and Related Issue. In: Handbook of Pattern Recognition and
Computer Vision, pp. 3–32. World Scientific, Singapore (1993)
[16] Engelbrecht, A.P.: Computational Intelligence: An Introduction. John Wiley & Sons, Ltd.,
Chichester (2007)
[17] Epter, S., Krishnamoorthy, M., Zaki, M.: Clusterability detection and initial sees selection
in large datasets. Technical Report 99-6, Rensselaer Polytechnic Institute, Computer Sci-
ence Dept. (1999)
[18] Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clus-
ters in large spatial databases with noise. In: Proceedings of 2nd International Conference
on Knowledge Discovery and Data Mining, pp. 226–231 (1996)
[19] Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publish-
ers, San Francisco (2006)
[20] Jain, A.K., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs
(1988)
[21] Jain, A.K., Flynn, P.J.: Image segmentation using clustering. In: Advances in Image Under-
standing: A Festschrift for Azriel Rosenfeld, pp. 65–83. Willey - IEEE Computer Society
Press, Singapore (1996)
α -Clusterable Sets 123
[22] Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Sur-
veys 31, 264–323 (1999)
[23] Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of IEEE Interna-
tional Conference on Neural Networks, vol. 4, pp. 1942–1948 (1995)
[24] Kennedy, J., Eberhart, R.C.: Swarm Intelligence. Morgan Kaufmann Publishers, San Fran-
cisco (2001)
[25] Kleinberg, J.: An impossibility theorem for clustering. In: Advances in Neural Information
Processing Systems (NIPS), pp. 446–453. MIT Press, Cambridge (2002)
[26] Lisi, F., Corazza, M.: Clustering financial data for mutual fund managment. In: Mathe-
matical and Statistical Methods in Insurance and Finance, pp. 157–164. Springer, Milan
(2007)
[27] MacQueen, J.B.: Some methods for classification and analysis of multivariate observations.
In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probabil-
ity, vol. 1, pp. 281–297. University of California Press (1967)
[28] Ng, R., Han, J.: CLARANS: A method for clustering objects for spatial data mining. IEEE
Transactions on Knowledge and Data Engineering 14(5), 1003–1016 (2002)
[29] Omran, M.G.H., Engelbrecht, A.P.: Self-adaptive differential evolution methods for un-
supervised image classification. In: Proceedings of IEEE Conference on Cybernetics and
Intelligent Systems, pp. 1–6 (2006)
[30] Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, S.: The effectiveness of lloyd-type meth-
ods for the k-means problem. In: Proceedings of the 47th Annual IEEE Symposium on
Foundations of Computer Science, pp. 165–176. IEEE Computer Society, Washington, DC
(2006)
[31] Parsopoulos, K.E., Vrahatis, M.N.: Particle Swarm Optimization and Intelligence: Ad-
vances and Applications. Information Science Publishing (IGI Global), Hershey (2010)
[32] Paterlini, S., Krink, T.: Differential evolution and particle swarm optimisation in partitional
clustering. Computational Statistics & Data Analysis 50, 1220–1247 (2006)
[33] Pavlidis, N., Plagianakos, V.P., Tasoulis, D.K., Vrahatis, M.N.: Financial forecasting
through unsupervised clustering and neural networks. Operations Research - An Interna-
tional Journal 6(2), 103–127 (2006)
[34] Preparata, F., Shamos, M.: Computational Geometry: An Introduction. Springer, New York
(1985)
[35] Puzicha, J., Hofmann, T., Buhmann, J.: A theory of proximity based clustering: Structure
detection by optimisation. Pattern Recognition 33, 617–634 (2000)
[36] Tasoulis, D.K., http://stats.ma.ic.ac.uk/d/dtasouli/public_html
[37] Tasoulis, D.K., Plagianakos, V.P., Vrahatis, M.N.: Unsupervised clustering in mRNA ex-
presion profiles. Computers in Biology and Medicine 36, 1126–1142 (2006)
[38] Tasoulis, D.K., Vrahatis, M.N.: The new window density function for efficient evolution-
ary unsupervised clustering. In: IEEE Congress on Evolutionary Computation, CEC 2005,
vol. 3, pp. 2388–2394. IEEE Press, Los Alamitos (2005)
[39] Theodoridis, S., Koutroubas, K.: Pattern Recognition. Academic Press, London (1999)
[40] van der Merwe, D.W., Engelbrecht, A.P.: Data clustering using particle swarm optimiza-
tion. In: Proceedings of the 2003 IEEE Congress on Evolutionary Computation, pp. 215–
220 (2003)
[41] Vrahatis, M.N., Boutsinas, B., Alevizos, P., Pavlides, G.: The new k-windows algorithm for
improving the k-means clustering algorithm. Journal of Complexity 18, 375–391 (2002)
[42] Xiong, H., Wu, J., Chen, J.: K-means clustering versus validation measures: A data-
distribution perspective. IEEE Transactions on Systems, Man and Cybernetics - Part B:
Cybernetics 39(2), 318–331 (2009)
[43] Zhao, Y., Karypis, G.: Criterion Functions for Clustering on High-Dimensional Data. In:
Grouping Multidimensional Data Recent Advances in Clustering, pp. 211–237. Springer,
Heidelberg (2006)
Privacy Preserving Semi-supervised Learning for
Labeled Graphs
1 Introduction
Label prediction of partially labeled graphs is one of the major machine learning
problems. Graph-based semi-supervised learning is useful when link information
is obtainable with lower cost than label information. Prediction of protein func-
tions is one familiar example [12]. In this problem, the functions and similarities
of proteins correspond to the labels and node similarities, respectively. Amassing
information about the protein functions requires expensive experimental analy-
ses, while protein similarities are often obtained computationally, with a lower
cost. Taking advantage of this gap, the semi-supervised approach successfully
achieves better classification accuracy, even when only a limited number of la-
beled examples is obtainable.
The key observation of this study is that the difficulty of preparing label
information is not only its cost, but also its privacy. Even when a large number
of labels have already been collected, the labels might not be observed or may
require extreme caution in handling. In this paper, we consider label prediction
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 124–139, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Privacy Preserving Semi-supervised Learning for Labeled Graphs 125
in a situation where the entire graph cannot be observed by any entities, due to
privacy reasons. Such a situation is often found in networks among social entities,
such as individuals or enterprises, in the real world. The following scenarios pose
intuitive examples where privacy preservation is required in label prediction.
Consider a physical contact network of individuals, in which individuals and
their contacts correspond to nodes and links, respectively. Suppose an infectious
disease is transmitted by contact. Some of the individuals have tested their infec-
tion states. Regarding infection states as node labels, semi-supervised learning is
expected to predict the infection states of untested individuals, by exploiting the
existing test results and the contact network. However, the contact information
between individuals and their infection states can be too sensitive to disclose.
In this scenario, both the labels and the links must be kept private. In order
to formulate such situations, we consider three types of typical privacy models.
The first model, referred to as the public model, assumes that each node discloses
its label and links to all other nodes. One example of this model is social network
services such as facebook or LinkedIn, when the privacy policy is set as “everyone
can see everything”. Users (nodes) disclose their friends (links) and status (node
labels), such as their education or occupations, to every user.
The second model, referred to as the label-aware model, assumes each node
discloses its label and links only to the nodes that it is linked to. Let the node
labels correspond to the flu infection states in the contact network. We can
naturally assume that each individual (node) is aware of the individuals with
whom he/she had contact before (links) and whether or not they had the flu
(labels of the nodes it is linking to), but he/she would never know the contact
information or the infection states of individuals he/she has never met.
The third model, referred to as the label-unaware model, assumes that each
node does not disclose any links or labels to others. Consider the contact net-
work again. Let links and labels correspond to sexual relationships and sexually
transmitted infections, respectively. In such a case, no one would disclose their
links and label to others; the label may not be disclosed even to individuals with
whom he/she had a relationship.
In addition, asymmetries of relationships need to be considered. For example,
in a business network whose links correspond to the ratio of the stock holdings,
a directed graph should be used to represent the scope of observation. Thus, the
privacy model to be employed depends on the nature and the sensitivity of the
information in the graphs.
Related Works. Existing label prediction methods are designed with the im-
plicit assumption that a supervisor exists who can view anything in the graph
(the public model). If we could introduce a trusted third party (TTP)1 as a
supervisor, then any label prediction algorithms, such as TSVM [8] or Clus-
ter kernel [12], would immediately work in the label-(un)aware model; however,
facilitating such a party is unrealistic in general (Table 1, the first line).
1
TTP is a party which never deviates from the specified protocol and does not reveal
any auxiliary information.
126 H. Arai and J. Sakuma
The k-nearest neighbor (kNN) method predicts labels from the labels of the
k-nearest nodes; kNN works in the label-aware model (Table 1, the second line).
Label propagation (LP) [15] achieves label prediction in the label-aware model,
when algorithms are appropriately decentralized (Table 1, the third line, see
Sect. 3.2 for details). Note that even if it is decentralized, each node has to
disclose its labels to the neighboring nodes. That is, kNN and LP do not work
in the label-unaware model.
Secure function evaluation (SFE) [13], also referred to as Yao’s garbled cir-
cuits, is a methodology for secure multiparty computation. Using SFE, any func-
tion, including label prediction methods, can be carried out without revealing
any information except for the output value. That is, SFE allows execution of
label prediction in the label-unaware model (Table 1, the fourth line). Although
the computational cost of SFE is polynomially bounded, it can be too inefficient
for practical use. We implemented label prediction on SFE (LP/SFE), and the
efficiency of SFE is discussed with experiments in Sect. 6.
Privacy-preserving computations for graph mining are discussed in [4,10].
Both of them calculate HITS and PageRank from networks containing private
information. They compute the principal eigenvector of the probability transi-
tion matrix of networks without learning the link structure of the network. Our
solution for label prediction is similar to above link analyses in the sense that
both of them consider computation over graphs containing private graph struc-
tures. However, the target of their protocols and ours are different; their protocol
computes node ranking while our protocol aims at node label prediction.
Our Contribution. As discussed above, no label prediction methods that work
efficiently in the label unaware model have been presented. We propose a novel
solution for privacy preserving label prediction in the label-unaware model. Com-
parisons between our proposal and existing solutions are summarized in Table
1. First, we formulate typical privacy models for labeled graphs (Sect. 2). Then,
a label propagation algorithm that preserves the privacy in the label-unaware
model is presented (Sect. 5). Our proposal theoretically guarantees that (1) no
information about links and labels is leaked through the entire process, (2) the
predicted labels as the final output are disclosed only to the node itself, and
(3) the final output is exactly equivalent to that of label propagation. Con-
nections to our protocol and differential privacy are also discussed in Sect. 5.3.
Privacy Preserving Semi-supervised Learning for Labeled Graphs 127
The experiments show that the computational time of our proposal is more than
100 times shorter than that of LP/SFE. We also examined our proposal using
the real social network data (Sect. 6).
Fig. 1. Graph privacy models. Links, link weights wij , and label fik that are private
from node i are depicted as gray.
3 Our Approach
Our aim is to develop label prediction that securely works under given graph
privacy models. First, we state the problem, based on graph privacy models and
label propagation [15]. Then, we show that a decentralization of label propa-
gation makes the computation secure in label-aware PWGs, but not secure in
label-unaware PWGs. At the end of this section, we clarify the issues to be
addressed to achieve secure label prediction in label-unaware PWGs.
4 Cryptographic Tools
In a public key cryptosystem, encryption uses a public key that can be known
to everyone, while decryption requires knowledge of the corresponding private
key. Given a corresponding pair of (sk, pk) of private and public keys and a
message m, then c = Encpk (m; r) denotes the random encryption of m, and m =
Decsk (c) denotes the decryption. The encrypted value c uniformly distributes
over ZN = {0, ..., N−1}, if r is taken from ZN randomly. An additive homomorphic
cryptosystem allows the addition of encrypted values, without knowledge of the
private key. There is some operation · such that for any plaintexts m1 and m2 ,
(t)
Let F̃(t) = (f˜ij ). The convergence of iterations of eq. 3 is shown as follows;
Although eq. 4 cannot be decrypted by any nodes in the actual protocol, here
we assume that a node could decrypt both terms of eq. 4. Then, we have
(t)
(t−1)
(t−1)
f˜ik ← p̃ij f˜jk + Lt−1 ỹik = L( αpij f˜jk + Lt−1 (1 − α)yik ). (5)
j∈N(i) j∈N(i)
We prove f˜ik = Lt fik holds by the inductive method. When t = 1, f˜ik = Lfik
(t) (t) (1) (1)
obviously holds. Assuming f˜ik = Lu fik for any u ∈ Z, f˜ik = Lu+1 fik
(u) (u) (u+1) (u+1)
is readily derived using eq. 5 and the assumption. Thus, f˜ik = Lt fik holds.
(t) (t)
Consequently, F̃(t) /Lt = F(t) holds and the lemma is proved by Lemma 1.
Privacy Preserving Semi-supervised Learning for Labeled Graphs 133
Note that the computation time of SFE is usually large; however, both compu-
tations are not necessarily required at every update, the cost per iteration can
be small, which will not create a bottleneck. We discuss this further in Sect. 6.
labels of neighboring nodes, especially when the neighboring nodes have only
small a number of links. Differential privacy provides a theoretical privacy def-
inition in terms of outputs. Output perturbation using Laplace mechanism can
guarantee the differential privacy for a specified function under some conditions.
Each node can guess only a limited amount of information if such a mech-
anism is applied to the output, even when each node has strong background
knowledge about the private information. The security of the computation and
secure disclosure of outputs are mutually independent in label prediction. There-
fore, output perturbation can be readily combined with our protocol while the
design of perturbation for label propagation is not straightforward. We do not
pursue this topic in this paper any further and this problem is remained for our
future works.
(t−1)
In order to update the above, note that: (1) node i has to send cik to Nout (i)
(t−1)
in Step 2 (a), and (2) node i has to receive cjk from j ∈ Nin (j) in Step 2 (b).
In the directed graph, since W is row-private, node i can know to whom it is
linking to (Nout (i)), but cannot know from whom it is linked to (Nin (i)). Each
node in Nin (i) can send a request for connection to node i. However, the request
itself violates the given privacy model. Thus, in directed graphs, the problem is
not only in the secrecy of messages but also in the anonymity of connections. For
this, we make use of the onion routing [7], which provides anonymous connection
over a public network. By replacing every message passing that occurs in the
protocol with the onion routing, we obtain the PPLP protocol for label-unaware
directed PWGs (PPLP-D). Techniques to convert the protcool in the label-aware
model to that in the label-unaware model can be found in [10], too.
6 Experimental Analysis
We compared the label prediction methods in terms of the accuracy, privacy
loss, and computation cost.
Datasets. Two labeled graphs were taken from the real-world examples. Roman-
tic Network (ROMN) is a network in which students and their sexual contacts
correspond to the nodes and links, respectively [1]. We used the largest com-
ponent (288 nodes) of the original network. 5 randomly selected nodes and the
136 H. Arai and J. Sakuma
nodes within 5 steps of them were labeled as “infected” (total 80 nodes); other
nodes were labeled as “not-infected” (208 nodes)3 . ROMN is undirected; the
weight matrix wij = 1 if there is a link between i and j, and wij = 0 otherwise.
MITN is a network in which the mobile phone users, their physical proximities,
and their affiliations correspond to the nodes, links, and node labels, respectively.
The proximity is measured by Bluetooth devices of the mobile phones, in MIT
Reality Mining Project [5]. 43 nodes are labeled as “Media Lab” and 24 nodes
are labeled as “Sloan” (total 67 nodes). MITN is a directed graph; the weight
wij is set as the time length of user i detecting user j.
Settings. Three types of label propagation were implemented, decentralized
label propagation (LP), label propagation implemented by SFE (LP/SFE), and
our proposal, PPLP and PPLP-D. For comparison, kNN was also tested. In
PPLP and PPLP-D, we set the parameters as L = 103 and α = 0.05 and
normalization was performed once per 100 updates of eq. 3. For kNN, we set
k = 1, 3, 5. Results were averaged over 10 trials.
In Sect. 6.1, we evaluated trade-off between prediction accuracy and privacy
loss. In Section 6.2, we evaluated computational efficiency of PPLP, PPLP-D,
and LP/SFE with complexity analysis. The generalized Paillier cryptosystem [2]
with 1024-bit keys was used in PPLP and PPLP-D. For SFE implementation,
FairPlay [9] was used. Experiments were performed using Linux with 2.80GHz
(CPU), 2GB (RAM).
We empirically compared computational cost of our solutions because com-
plexity analysis of computations implemented by SFE is difficult. Furthermore,
implementation details are as follows. In PPLP and PPLP-D, all computational
procedure including normalization using SFE is implemented. In a computa-
tional cost of an update, one percent of normalization cost is accounted because
normalization is executed once per 100 times. If LP/SFE is implemented in a
naive manner, the computational time can be so large that computational analy-
sis is unrealistic. Instead, we decomposed LP/SFE into single summation of the
decentralized label propagation (eq. 2) and implemented each summation using
SFE. This implementation allows the nodes to leak elements of the label matrix
computed in the middle of label propagation, while the required computation
time can be largely reduced. We regarded the computation time of this relaxed
LP/SFE as the lower bound of the computation time of LP/SFE in experiments.
(a) Error rate in ROMN dataset (b) Error rate in MITN dataset
Error rate
LP
kNN(k=1)
0.6 0.4
0.4 0.2
0.2 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Number of labeled nodes Number of labeled nodes
(c) Single updete in an undirected graph (d) Single update in a directed graph
108 108
PPLP D=8
Computational time (msec)
106 106
105 105
4
10 104
103 103
2
10 102
1 10 100 1000 10 100 1000 10000
Node degree Number of nodes, n
Fig. 3. Accuracies and computational costs of PPLP and PPLP-D vs. other methods.
(a) and (b) are the accuracy changes with respect to the number of label disclosure .
(c) shows the scalability with respect to the maximum number of links per node Δ.
(d) is the scalability of PPLP-D with respect to Δ and network size n.
Table 3. The computation time until convergence (10 itr.) in ROMN (Δ = 9) and
disclosed information
labels and links is essential, PPLP or LP/SFE could be a good choice, because
the accuracies of PPLP and LP/SFE which observe no labels are always equiv-
alent to that of LP which is allowed to observe all labels.
7 Conclusion
In this paper, we introduced novel privacy models for labeled graphs, and then
stated secure label prediction problems with these models. We proposed solu-
tions for secure label prediction, PPLP and PPLP-D, which allow us to execute
label prediction without sharing private links and node labels. Our methods are
scalable compared to existing privacy preserving methods. We experimentally
showed that our protocol completed the label prediction of a graph with 288
nodes in about 10 seconds. In undirected graphs, the complexity is proportional
to the maximum number of links, rather than the network size. We can con-
clude that our protocol achieves both privacy preservation and scalability, even
in large-scale networks. Scalability in directed graphs is relatively larger than
that in undirected graphs. Our future work will involve the implementation of
our proposal in real social problems.
References
1. Bearman, P., Moody, J., Stovel, K.: Chains of affection: The structure of adolescent
romantic and sexual networks. American J. of Sociology 110(1), 44–91 (2004)
2. Dåmgard, I., Jurik, M.: A Generalisation, a Simplification and Some Applications
of Paillier’s Probabilistic Public-Key System. In: Kim, K.-c. (ed.) PKC 2001. LNCS,
vol. 1992, pp. 119–136. Springer, Heidelberg (2001)
3. Damgård, I.B., Koprowski, M.: Practical threshold RSA signatures without a
trusted dealer. In: Pfitzmann, B. (ed.) EUROCRYPT 2001. LNCS, vol. 2045, pp.
152–165. Springer, Heidelberg (2001)
4. Duan, Y., Wang, J., Kam, M., Canny, J.: Privacy preserving link analysis on dy-
namic weighted graph. Comp. & Math. Organization Theory 11(2), 141–159 (2005)
5. Eagle, N., Pentland, A., Lazer, D.: Inferring social network structure using mobile
phone data. In: PNAS (2007)
6. Goldreich, O.: Foundations of cryptography: Basic applications. Cambridge Uni-
versity Press, Cambridge (2004)
7. Goldschlag, D., Reed, M., Syverson, P.: Onion routing. Communications of the
ACM 42(2), 39–41 (1999)
8. Joachims, T.: Transductive inference for text classification using support vector
machines. In: Proc. ICML (1999)
9. Malkhi, D., Nisan, N., Pinkas, B., Sella, Y.: Fairplay: secure two-party computation
system. In: Proc. of the 13th USENIX Security Symposium, pp. 287–302 (2004)
10. Sakuma, J., Kobayashi, S.: Link analysis for private weighted graphs. In: Proceed-
ings of the 32nd International ACM SIGIR, pp. 235–242. ACM, New York (2009)
11. Sakuma, J., Kobayashi, S., Wright, R.: Privacy-preserving reinforcement learning.
In: Proceedings of the 25th International Conference on Machine Learning, pp.
864–871. ACM, New York (2008)
12. Weston, J., Leslie, C., Ie, E., Zhou, D., Elisseeff, A., Noble, W.: Semi-supervised
protein classification using cluster kernels. Bioinformatics 21(15), 3241–3247 (2005)
13. Yao, A.: How to generate and exchange secrets. In: Proc. of the 27th IEEE Annual
Symposium on Foundations of Computer Science, pp. 162–167 (1986)
14. Zhou, D., Bousquet, O., Lal, T., Weston, J., Schölkopf, B.: Learning with local
and global consistency. In: Advances in Neural Information Processing Systems
16: Proceedings of the 2003 Conference, pp. 595–602 (2004)
15. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian
fields and harmonic functions. In: ICML (2003)
Novel Fusion Methods for Pattern Recognition
Centre for Vision, Speech and Signal Processing (CVSSP) University of Surrey, UK
{m.rana,f.yan,k.mikolajczyk,j.kittler}@surrey.ac.uk
Abstract. Over the last few years, several approaches have been pro-
posed for information fusion including different variants of classifier level
fusion (ensemble methods), stacking and multiple kernel learning (MKL).
MKL has become a preferred choice for information fusion in object
recognition. However, in the case of highly discriminative and comple-
mentary feature channels, it does not significantly improve upon its triv-
ial baseline which averages the kernels. Alternative ways are stacking and
classifier level fusion (CLF) which rely on a two phase approach. There
is a significant amount of work on linear programming formulations of
ensemble methods particularly in the case of binary classification.
In this paper we propose a multiclass extension of binary ν-LPBoost,
which learns the contribution of each class in each feature channel. The
existing approaches of classifier fusion promote sparse features combina-
tions, due to regularization based on 1 -norm, and lead to a selection of
a subset of feature channels, which is not good in the case of informative
channels. Therefore, we generalize existing classifier fusion formulations
to arbitrary p -norm for binary and multiclass problems which results
in more effective use of complementary information. We also extended
stacking for both binary and multiclass datasets. We present an extensive
evaluation of the fusion methods on four datasets involving kernels that
are all informative and achieve state-of-the-art results on all of them.
1 Introduction
The goal of this paper is to investigate machine learning methods for combin-
ing different feature channels for pattern recognition. Due to the importance
of complementary information in feature combination, much research has been
undertaken in the field of low level feature design to diversify kernels, leading to
a large number of feature channels (kernels) in typical pattern recognition tasks.
Kernels are often computed independently of each other, thus may be highly
redundant. On the other hand, different kernels capture different aspects of in-
traclass variability while being discriminative at the same time. Proper selection
and fusion of kernels is, therefore, crucial to optimizing the performance and to
addressing the efficiency issues in large scale pattern recognition applications.
The key idea of MKL [10,15,20], in the case of SVM, is to learn a linear com-
bination of given base kernels by maximizing the soft margin between classes
using 1 -norm regularization on weights. In contrast to MKL, the main idea of
classifier level fusion [8] is to construct a set of base classifiers and then classify
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 140–155, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Novel Fusion Methods for Pattern Recognition 141
where K̇r (x) is the column corresponding to test sample x, Y is an m×m matrix
with labels yi on the diagonal and α is a vector of lagrangian multipliers.
n
In MKL, the aim is to find a convex combination of kernels K = r=1 βr Kr
by maximizing the soft margin [1,10,15,20,25] using the following program:
1 T
n m
min wr wr + C ξi (2)
wr ,ξ,b,β 2
r=1 i=1
n
s.t. yi ( wr , βr Φr (xi ) + b) ≥ 1 − ξi , ξ 0, β 0, βpp ≤ 1
r=1
The dual of Eq. (2) can be derived easily using Lagrange multiplier techniques.
The MKL primal for linear combination and its corresponding dual are derived
for different formulations in [1,9,10,15,20,24] and compared in [25] which also
extended MKL to the multiclass case. The dual problem can be solved by us-
ing several existing MKL approaches, e.g, SDP [10], SMO [1], SILP [20] and
simpleMKL [15]. The decision function for MKL SVM is the sign of f (x):
n
f (x) = βr K̇r (x)T Y α + b. (3)
r=1
Novel Fusion Methods for Pattern Recognition 143
n
f (x) = βr gr (x). (4)
r=1
Note that for the SVM, f (x) is a linear combination of the real valued output
of n SVMs, where gr (x) is given by Eq. (1). The decision function of MKL in
Eq. (3) shows that the same set of parameters {α, b} is shared by all participating
kernels. In contrast to MKL, the decision function of CLF methods in Eq. (4)
uses separate sets of SVM parameters, since different {α, b} embedded in gr (x)
can be used for each base learner. In that sense, MKL can be considered as a
restricted version of CLF [6]. The aim of the ensemble learning is to find optimal
weight vector β for the linear combination of base classifiers given by Eq. (4).
We define the margin (or classification confidence) for an example xi as ρi :=
n
yi f (xi ) = yi r=1 βr gr (xi ) and the normalized (smallest) margin as:
n
ρ := min yi f (xi ) = min yi βr gr (xi ). (5)
1≤i≤m 1≤i≤m
r=1
It has been argued that AdaBoost maximizes the smallest margin ρ on the train-
ing set [16]. Based on this idea and the idea of soft margin SVM formulations,
the ν-LP-AdaBoost formulation has been proposed in [16]. The ν-LPBoost per-
forms a sparse selection of feature channels due to 1 regularization, which is
suboptimal if all feature channels carry complementary information. Similarly,
in the case of ∞ norm, noisy features channels may have significant impact on
the results. To address these problems, we generalize binary classifier fusion for
arbitrary norms {p , p ≥ 1}.
144 M. Awais et al.
1
m
max ρ − ξi (6)
β,ξ,ρ νm i=1
n
s.t. yi βr fr (xi ) ≥ ρ − ξi ∀ i = 1, ..., m
r=1
βpp ≤ 1, β 0, ξ 0, ρ ≥0
where ξi are slack variables which accommodate negative margins. The regular-
ization constant is given by νm1
, which corresponds to the C constant in SVM.
Problem (6) is a nonlinear separable convex optimization problem and can be
solved efficiently for global optimal solution by standard optimization toolboxes1 .
n
n
NC
ρi (xi , β) := β(NC (r−1)+yi ) gr,yi (xi ) − β(NC (r−1)+yj ) gr,yj (xi ) (7)
r=1 r=1 j=1,j
=i
The classification confidence for examples xi depends upon β and scores from
base classifiers. The main difference between the two margins is that here, we are
taking responses (scores multiplied with corresponding weights) from all nega-
tive classes, sum them and subtract this sum from the response of positive class.
This is done for all n feature channels. Normalized (smallest) margin can then
be defined as ρ := min1≤i≤m ρ(xi , β). Inspired by LP formulations of AdaBoost
(cf. [16] and references therein) we propose to maximize the normalized margin
ρ to learn linear combination of base classifiers. However, generalization perfor-
mance of LP formulation of AdaBoost based on maximizing only normalized
margin is inferior to AdaBoost for noisy problems [16]. Moreover, theorem 2
in [16] highlights the fact that minimum bound on generalization error is not
necessarily achieved with a maximum margin. To address these issues, soft mar-
gin SVM based formulation with slack variable is introduced in Eq. (8). This
formulation does not force all the margins to be greater than zero. To avoid
penalization of informative channels and to gain robustness against noisy fea-
ture channels, we change the regularization norm to handle any arbitrary norm
p , ∀p ≥ 1. The final optimization problem is (replacing ρi with Eq. (7)):
1
m
max ρ − ξi (8)
β,ξ,ρ νm i=1
n
n
NC
s.t. β(NC (r−1)+yi ) gr,yi (xi ) − β(NC (r−1)+yj ) gr,yj (xi )
r=1 r=1 j=1,j
=i
≥ ρ − ξi i = 1, ..., m, (9)
βpp ≤ 1, ρ ≥ 0, β 0 ξ 0 ∀i = 1, ..., m
1
where νm is the regularization constant and gives a trade-off between minimum
classification confidence ρ and the margin errors. This formulation looks similar
to Eq. (6), in fact we are using the same objective function but the main dif-
ference is the definition of margin which is used in the constraints in Eq. (9).
Eq. (9) employs a lower bound on the differences between the classification con-
fidence (margin) of the true class and the joint confidence of all other classes.
It is important to note that the total number of constraints is equivalent to the
number of training examples m plus one regularization constraint for lp -norm
146 M. Awais et al.
1
m
min −ρ+ ξi (10)
β,ξ,ρ νm i=1
n
n
s.t. βr gr,yi (xi ) − max βr gr,yj (xi ) ≥ ρ − ξi , ∀ i = 1, ..., m (11)
r=1 yj
=yi ,r=1
βpp ≤ 1, βr ≥ 0, ξi ≥ 0, ρ ≥ 0, ∀r = 1, . . . , n, ∀ i = 1, . . . , m.
1
m
min −ρ+ ξi (12)
B,ξ,ρ νm i=1
n
n
s.t. Bryi gr,yi (xi ) − Bryj gm,yj (xi ) ≥ ρ − ξi i = 1, ..., m, (13)
r=1 yj
=yi ,r=1
The first set of constraints (Eq. (13)) gives a lower bound on the pairwise differ-
ence between classification confidences (margins) of the true class and non-target
class. Note that in this formulation NC − 1 constraints are added for every train-
ing example and the total number of constraints is m × (NC − 1) + 1.
Discussion: The main difference between the three multiclass approaches dis-
cussed in this section is in the definition of the feasible region which is defined
by Eq. (9), Eq. (11) and Eq. (13) for NLP-νMC, NLP-β and NLP-B respec-
tively. In NLP-β and Lp-β [6] the feasible region depends on the difference
between the classification confidence of the true class and the closest non-target
Novel Fusion Methods for Pattern Recognition 147
class only. The total number of constraints in this case is m + 1. The feasible
region of NLP-B and LP-B [6] is defined by the pairwise difference between class
confidence of the true class and non-target class added as one constraint at a
time. In other words each difference pair is added as an independent constraint
without having any interaction among each other. There are NC constraints for
each example and the total number of constraints is m × (NC − 1) + 1. The large
number of constraints makes this approach less attractive for datasets with a
large number of classes. For example, for Caltech101 [4] with only 15 images per
class for training, the number of constraints for LP-B is more than 150 thousand
(15 × 101 × 100 + 1 ∼ = 1.5 × 105 ). In case of our NLP-νMC, the feasible re-
gion depends upon the joint classification confidence of all the non-target classes
subtracted from the class confidence of the true class. Thus, the feasible region
of NLP-νMC is much smaller than the feasible region of NLP-B. Due to these
joint constraints the total number of constraints for NLP-νMC is m + 1, e.g., for
Caltech101 [4] with 15 images per class for training, the number of constraints
for NLP-νMC is only 1516 (15*101+1) which is only 1% of the constraints in
NLP-B. We, therefore, can apply NLP-νMC to large multiclass datasets, as op-
posed to NLP-B, especially for norms greater than 1. Note that the difference
in complexity between NLP-νMC and NLP-β or binary classifier fusion is the
extended weight vector β.
4 Extended Stacking
(3x1) and image quarters (2x2). The descriptors are clustered using k-means to
form a codebook of 4000 visual words. Each spatial grid is then represented by
histograms of codebook occurrences and a separate kernel matrix is computed
for each grid. The kernel function to compute entry (i, j) of the kernel matrix is
based on χ2 distance between features Fi and Fj .
1
K(Fi , Fj ) = e− A dist(Fi ,Fj ) (14)
where, A is a scalar for normalizing the distance, and is set to average χ2 distance
between all features.
We apply Support Vector Machine (SVM) as base classifiers for nonlinear
classifier level fusion schemes and the stacking proposed in this paper and com-
pare them with MKL schemes. The regularization parameter for SVM is in the
set {2(−2,0,3,7,10,15) }. The regularization parameter ν for different CF methods
is in the range ν ∈ [.05, .95] with the step size of 0.05. Both SVM and CF regu-
larization parameters are selected on the validation set. The values for norms for
generalized classifier fusion are in the range p ∈ {1, 1 + 2−5,−3,−1 , 2, 3, 4, 8, 104}.
We consider each value of p as a separate fusion scheme. Note for p = 10000
we get uniform weights which corresponds to unweighted sum or ∞ . Figure 1
shows learnt weights on the training set of aeroplane category of Pascal VOC
2007 for several values of p using CLF. The plotted weights are corresponding to
the optimal value of regularization parameter C of SVM. The sparsity of learnt
weights can be observed easily for low values of p. The sparsity decreases with
increased p, up to uniform weights (corresponding to ∞ ) achieved at p = 10000.
Weights can also be learnt corresponding to best performing p on validation set.
The mean average precision for several fusion methods are given in Table 1.
Row MKL shows the results for nine MKL methods with different regularization
norms applied to 5 base kernels. Note that MAP increases with the decrease
in sparsity at higher values of norms. Similar trend can be found in CLF. Low
performance of MKL-1 -norm, which leads to sparse selection, indicates that
base kernels carry complementary information. Therefore, the non-sparse MKL
or CLF methods such as 2 -norm and ∞ -norm, give better results as reported in
Table 1. Unweighted sum in the case of MKL is performing better than any other
MKL methods which reflects that in case of all informative channels, learning
the weights for MKL does not improve much on this dataset. The proposed
non-sparse CLF (2 ) schemes outperform the state-of-the-art MKL (2 -norm,
∞ -norm) by 2 % and 1.1% respectively. The stacking is performing the best
among all the methods and outperforms MKL by 1.5%. Further improvements
can be gained by fusing the stacking kernel together with 5 base kernels in case
of both MKL and CLF. The combination of base plus the stacking kernel under
MKL produced state-of-the-art result on this dataset with a MAP of 66.24%,
and outperforms MKL and CLF by 3.3% and 2.3% respectively.
5.2 Flower 17
Flower 17 [14] consists of 17 categories of flowers common in UK with 80 im-
ages in each category. The dataset is split into training (40 images per class),
150 M. Awais et al.
−5 −3 −1 4
p=1 p=1+2 p=1+2 p=1+2 p=2 p=3 p=4 p=8 p=10
1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Fig. 1. Pascal VOC 2007. Feature channels weights learned with various p for CLF(p ).
validation (20 images per class) and test (20 images per class) using 3 prede-
fined random splits by the authors of the dataset. There are large appearance
variations within each category and similarities with other categories. For ex-
periments we have used 7 RBF kernels from the 7 χ2 distance matrices provided
online3 . The features used to compute these distance matrices include different
types of shape, texture and color based descriptors whose details can be found
in [14]. We have used SVM as a base classifier and its regularization parameter
is in the range {10(−2,−1,...,3) }. The Regularization parameter for different CLF
is in the range ν ∈ {0.05, 0.1, . . . , 0.95}. Both SVM and CLF regularization pa-
rameters are selected on the validation set. To carry out a fair comparison, the
regularization parameters and other setting are the same as in [6].
The results given in Table 2, show that the baseline for MKL, i.e., MKL-
avg(∞ ) gives 84.9% [6], and baseline for classifier level fusion, i.e., CLF(∞ )
gives 86.7%. The MKL results are obtained using the SHOGUN multiclass MKL
implementation for different norms. Nonlinear versions of classifier fusion per-
form better than their sparse counterparts as well as state-of-the-art MKL. The
best result in CLF is obtained by the proposed NLP-νMC (2 ) and NLP-β (4 ).
They outperform the MKL baseline by more than 2.5% and multiclass MKL
by 0.6%. Stacking yields the best results on this dataset, outperforming MKL
baseline by more than 4.5%, MKL by more than 2% and the best CLF method
by more than 1.5%. Combining the stacking kernel with the 7 base kernels us-
ing multiclass MKL also shows similar results. Note that the performance drops
when the stacking kernel is combined with the 7 base kernels using MKL (∞ )
or CLF (∞ ). This highlights the importance of learning in fusion methods.
However, when the stacking kernel is combined with the 7 base kernels using
classifier fusion, it produces state-of-the-art results on this dataset, and outper-
forms MKL, the best in CLF and stacking by 3%, 2.3% and 0.8%, respectively.
The second half of Table 2 shows comparison with published state-of-the-art
results. According to our knowledge the best performing method using the 7
3
http://www.robots.ox.ac.uk/~ {}vgg/data/flowers/17/index.html
Novel Fusion Methods for Pattern Recognition 151
distance matrices provided by the authors is giving 86.7% which is similar to the
CLF baseline. Our best CLF method outperforms it by 1.2% while our stacking
approach outperforms it by 2.7% and our CLF combination of base plus stacking
outperforms it by 3.5%. It is important to note that while comparing fusion
methods, the base feature channels (kernels) must be the same across different
schemes. For example, the comparison of Flower 17 with state-of-the-art in [22]
is not justified as it uses 30 kernels while normally the results are reported using
the 7 kernels provided online. Nevertheless, our best method outperforms this
by 1.2% which can be considered as a significant improvement in spite of using
4 times fewer feature channels.
comparison with previously published results we use the same split as used by
other authors. The baseline for MKL gives 73.4%, and baseline for CLF gives
73.0%. Multiclass MKL is not performing well on this dataset with the best
result achieved by MKL (1 ) and performs 3.5% lower than the trivial baseline.
The best among classifier level fusion is the NLP-β (1+2−3 ) scheme. It performs
5.8% better than multiclass MKL and 2.3%, 2.7% better than MKL and CLF
baselines, respectively. Note that NLP-νMC is performing worse than NLP-β as
it has to estimate NC times more parameter than NLP-β in the presence of few
training example per category. We expect NLP-νMC to perform better in the
presence of more training data. Stacking achieves the best results on this dataset
and it performs 7.8% better than multiclass MKL and 4.3%, 4.7% better than
MKL and CLF baselines, respectively. The results can be further improved by
combining the stacking kernel with the 4 base kernels by using MKL or CLF.
However, the performance drops when the stacking kernel is combined with the
4 base kernels using MKL (∞ ) or CLF (∞ ). This highlights the importance of
learning in fusion methods. We achieve state-of-the-art results on this dataset
by combining the stacking kernel with the 4 base kernels using CLF. This com-
bination performs 10% better than multiclass MKL and 6.6%, 7% and 2.3%
better than MKL baseline, CLF baseline and stacking, respectively. Note that
we are unable to compute the mean accuracy for NLP-B, especially for p -norm
greater than 1, due to a large number of constraints in the optimization problem.
The results for MKL are reported from [13] for comparison. In comparison to
the published results, our best method has an improvement of 7.2% which is a
significant gain. given that we are not using any new information.
5.4 Caltech101
50 images per category are randomly selected for testing. The average accuracy
is computed over all 101 object classes. This process is repeated 3 times and the
mean accuracy over 3 splits is reported for each method. In this experiment, we
combine 10 features channels based on the features introduced in [12,19] with
dense sampling strategies. The RBF kernel function to compute kernel matrices
from the χ2 distance matrices is given in Eq. (14). The experimental setup is
the same as for Flower 17.
The results of the proposed methods are presented in Table 4 and compared
with other techniques. The baseline for MKL gives 67.4% and the baseline for
CLF gives 68.5%. The best result among MKL is achieved by multiclass MKL
(1 ). It performs 1.2% better than the MKL baseline and performs similar to CLF
baseline. Stacking does not perform well on this dataset. It performs 0.6% better
than the MKL baseline, however, it performs worse than both CLF baseline
and multiclass MKL. Classifier level fusion achieves best results on this dataset
(NLP-β3 )). It performs 1.8% and 0.7% better than MKL and CLF baselines and
performs 0.6% better than multiclass MKL. The results can be further improved
by using the stacking kernel with the 10 base kernels. We achieve state-of-the-art
results on this dataset by combining the stacking kernel with the 10 base kernels
using CLF. This combination performs 3.3%, 2.7%, 2.2% and 2.1% better than
the MKL baseline, stacking, the CLF baseline and multiclass MKL. Note that
we are unable to compute the Mean accuracy for NLP-B, especially for p -norm
greater than 1, due to a large number of constraints in the optimization problem.
It is well known that the type and the number of kernels have a large impact
on the overall performance. Therefore, a direct comparison of scores with the
published methods is not entirely fair. Nonetheless, it can be noted that the best
performing methods on Caltech101 in [7] and [6] using a single kernel are giving
60% and 61% respectively. The performance in [6] using 8 kernels is close to 63%
while the performance using 39 feature channels is 70.4%. Note that our best
method gives 70.7% using 10 feature channels only, which can be considered as a
significant improvement, given that we have used 4 times fewer feature channels.
154 M. Awais et al.
6 Conclusions
In this paper we proposed a nonlinear separable convex optimization formula-
tion for multiclass classifier fusion (NLP-νMC) which learns the weight for each
class in every feature channel. We have also extended linear programming for
binary and multiclass classifier fusion (ensemble methods) to nonlinear separa-
ble convex classifier fusion by incorporating arbitrary norms. Unlike the existing
methods, these formulations do not reject informative feature channels and make
the classifier fusion robust to both noisy and redundant feature channels which
results in an improved performance.
We also extended stacking in the case of both binary and multiclass datasets.
By considering stacking as a separate feature channel, we can combine the stack-
ing kernel with base kernels using any proposed fusion method. We have per-
formed comparative experiments on challenging object recognition benchmarks
for both multi-label and multiclass cases. Our results show that optimal p is
an intrinsic property of kernels set and can be different for different datasets.
It can be learnt systematically using validation set. In general if some channels
are noisy 1 -norm is better (sparse weights). For carefully designed features non-
sparse solutions, e.g., 2 -norm, are better. Note that both are special cases of
our approaches. The proposed methods perform better than the state-of-the-art
MKL methods. In addition to this, the non-sparse version of the classifier fu-
sion is performing better than sparse selection of feature channels. We achieve
state-of-the-art performance on all datasets by combining the stacking kernel
with base kernels using classifier level fusion.
The two step training of classifier fusion may seem as an overhead. However,
the first step is independent for each feature channel as well as each class and can
be performed in parallel. Independent training also makes the systems applicable
to large datasets. Moreover, in MKL one has to train an SVM classifier in α-step
before getting the optimal weights. As MKL is optimizing parameters jointly,
one may argue that the independent optimization of weights in case of classifier
fusion is less effective. However, as our consistently better results show, these
schemes seem to be more suitable for visual recognition problems. The proposed
classifier fusion schemes seem to be attractive alternatives to the state-of-the-
art MKL approaches for both binary and multiclass problems and address the
complexity issues of the MKL.
References
1. Bach, F., Lanckriet, G., Jordan, M.: Multiple Kernel Learning, Conic Duality, and
the SMO Algorithm. In: ICML (2004)
2. Džeroski, S., Ženko, B.: Is combining classifiers with stacking better than selecting
the best one? ML 54(3), 255–273 (2004)
3. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The
pascal visual object classes (voc) challenge. IJCV 88(2), 303–338 (2010)
Novel Fusion Methods for Pattern Recognition 155
4. Fei-Fei, L., Fergus, R., Perona, P.: One-shot Learning of Object Categories. PAMI,
594–611 (2006)
5. Freund, Y., Schapire, R.: A Desicion-Theoretic Generalization of On-Line Learning
and an Application to Boosting. In: CLT (1995)
6. Gehler, P., Nowozin, S.: On Feature Combination for Multiclass Object Classifica-
tion. In: ICCV (2009)
7. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Tech. Rep.
7694, California Institute of Technology (2007)
8. Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. PAMI 20(3),
226–239 (1998)
9. Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A., Laskov, P., Müller, K.: Efficient
and Accurate lp-norm MKL. In: NIPS (2009)
10. Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L., Jordan, M.: Learning the
Kernel Matrix with Semidefinite Programming. JMLR 5, 27–72 (2004)
11. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid
Matching for Recognizing Natural Scene Categories. In: CVPR (2006)
12. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors.
PAMI 27(10), 1615–1630 (2005)
13. Nilsback, M.E., Zisserman, A.: Automated Flower Classification over a Large Num-
ber of Classes. In: ICCVGIP (2008)
14. Nilsback, M., Zisserman, A.: A visual Vocabulary for Flower Classification. In:
CVPR (2006)
15. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: SimpleMKL. JMLR 9,
2491–2521 (2008)
16. Rätsch, G., Schölkopf, B., Smola, A., Mika, S., Müller, K., Onoda, T.: Robust
Ensemble Learning for Data Analysis. In: PACKDDM (2000)
17. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label
classification. In: MLKDD, pp. 254–269 (2009)
18. Rifkin, R., Klautau, A.: In defense of one-vs-all classification. JMLR 5, 101–141
(2004)
19. van de Sande, K., Gevers, T., Snoek, C.: Evaluation of color descriptors for object
and scene recognition. In: CVPR (2008)
20. Sonnenburg, S., Rätsch, G., Schafer, C., Schölkopf, B.: Large Scale Multiple Kernel
Learning. JMLR 7, 1531–1565 (2006)
21. Wolpert, D.: Stacked generalization. Neural Networks 5(2), 241–259 (1992)
22. Xie, N., Ling, H., Hu, W., Zhang, Z.: Use bin-ratio information for category and
scene classification. In: CVPR (2010)
23. Yan, F., Mikolajczyk, K., Barnard, M., Cai, H., Kittler, J.: Lp norm multiple kernel
fisher discriminant analysis for object and image categorisation. In: CVPR (2010)
24. Ying, Y., Huang, K., Campbell, C.: Enhanced protein fold recognition through a
novel data integration approach. BMCB 10(1), 267 (2009)
25. Zien, A., Ong, C.: Multiclass Multiple Kernel Learning. In: ICML (2007)
A Spectral Learning Algorithm for Finite State
Transducers
1 Introduction
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 156–171, 2011.
c Springer-Verlag Berlin Heidelberg 2011
A Spectral Learning Algorithm for Finite State Transducers 157
The independence assumptions are that Pr[hs |x, h1:s−1 ] = Pr[hs |xs−1 , hs−1 ] and
Pr[ys |x, h, y1:s−1 ] = Pr[ys |hs ]. That is, given the input symbol at time s − 1 and
the hidden state at time s − 1 the probability of the next state is independent
of anything else in the sequence, and given the state at time s the probability
of the corresponding output symbol is independent of anything else. We usually
drop the subscript when the FST is obvious from the context.
Equation (1) shows that the conditional distribution defined by an FST P can
be fully characterized using standard transition, initial and emission parameters,
which we define as follows. For each symbol a ∈ X , let Ta ∈ Rm×m be the state
transition probability matrix, where Ta (i, j) = Pr[Hs = ci |Xs−1 = a, Hs−1 = cj ].
Write α ∈ Rm for the initial state distribution, and let O ∈ Rl×m be the emission
probability matrix where O(i, j) = Pr[Ys = bi |Hs = cj ]. Given bi ∈ Y we write Dbi
to denote an m × m diagonal matrix with the ith row of O as diagonal elements.
Similarly, we write Dα for the m × m diagonal matrix with α as diagonal values.
To calculate probabilities of output sequences with FSTs we employ the notion
of observable operators, which is commonly used for HMMs [22,4,11,14]. The
following lemma shows how to express P(y|x) in terms of these quantities using
the observable operator view of FSTs.
Our learning algorithms will learn model parameterizations that are based
on this observable operators view of FSTs. As input they will receive a sample
of input-output pairs sampled from an input distribution D over X ∗ ; the joint
distribution will be denoted by D ⊗ P. In general, learning FSTs is known to be
hard. Thus, our learning algorithms need to make some assumptions about the
FST and the input distribution. Before stating them we introduce some notation.
a ∈ X let pa = Pr[X1 = a], and define an “average” transition matrix
For any
T = a pa Ta for P and μ = mina pa , which characterizes the spread of D.
Assumptions. An FST can be learned when D and P satisfy the following:
(1) l ≥ m, (2) Dα and O have rank m, (3) T has rank m, (4) μ > 0.
Assumptions 1 and 2 on the nature of the target FST have counterparts in HMM
learning. In particular, the assumption on Dα requires that no state has zero
initial probability. Assumption 3 is an extension for FSTs of a similar condition
for HMM, but in this case depends on D as well as on P. Condition 4 on the input
distribution ensures that all input symbols will be observed in a large sample.
For more details about the implications of these assumptions, see [13].
P = O T Dα O , (5)
Pab = O T a Db T Dα O
. (6)
160 B. Balle, A. Quattoni, and X. Carreras
Algorithm LearnFST(X , Y, S, m)
Input:
– X and Y are input-output alphabets
– S = {(x1 , y 1 ), . . . , (xn , y n )} is a training set of input-output sequences
– m is the number of hidden states of the FST
Output:
– Estimates of the observable parameters β1 , β∞ and B ab for all a ∈ X and b ∈ Y
Note that if Assumptions 1–3 are satisfied, then P has rank m. In this case we
can perform an SVD on P = U ΣV ∗ and take U ∈ Rl×m to contain its top m
left singular vectors. It is shown in [13] that under these conditions the matrix
U O is invertible. Finally, let ρ ∈ Rl be the initial symbol probabilities, where
ρ(i) = Pr[Y1 = bi ].
Estimations of all these matrices can be efficiently computed from a sam-
ple obtained from D ⊗ P. Now we use them to define the following observable
representation for P:
β1 = U ρ , (7)
+
β∞ = ρ (U P ) , (8)
Bab = (U
Pab )(U P )+ . (9)
Next lemma shows how to compute FST probabilities using these new observable
operators.
Lemma 2 (Observable FST representation). Assume D and P obey As-
sumptions 1–3. For any a ∈ X , b ∈ Y, x ∈ X t and y ∈ Y t , the following hold.
β1 = (U O)α , (10)
−1
β∞ = 1 (U O) , (11)
Bab = (U O)Aba (U O)−1 ,
(12)
yt
P(y|x) = β∞ Bxt · · · Bxy11 β1 . (13)
The proof is analogous to that of Lemma 3 of [13]. We omit it for brevity.
=O
α + ρ , (19)
+ Pa (Dα O)
Ta = O + . (20)
The correctness of these expressions in the error-free case can be easily verified.
Though this method is provided without an error analysis, some experiments
in Section 5 demonstrate that in some cases the parameters recovered with this
algorithm can approximate the target FST better than the observable represen-
tation obtained with LearnFST.
Note that this method is different from those presented in [18,13] for recov-
ering parameters of HMMs. Essentially, their approach requires to find a set of
eigenvectors, while our method recovers a set of joint eigenvalues. Furthermore,
our method could also be used to recover parameters from HMMs.
162 B. Balle, A. Quattoni, and X. Carreras
4 Theoretical Analysis
In this section the algorithm LearnFST is analyzed. We show that, under some
assumptions on the target FST and the input distribution, it will output a good
hypothesis with high probability whenever the sample is large enough. First we
discuss the learning model, then we state our main theorem, and finally we sketch
the proof. Our proof schema follows closely that of [13]; therefore, only the key
differences with their proof will be described, and, in particular, the lemmas
which are stated without proof can be obtained by mimicking their techniques.
This loss function corresponds to the L1 distance between D ⊗ P and D ⊗ P.
4.2 Results
Our learning algorithm will be shown to work whenever D and P satisfy Assump-
tions 1–4. In particular, note that 2 and 3 imply that the mth singular values of
O and P , respectively σO and σP , are positive.
We proceed to state our main theorem. There, instead of restricting ourselves
to input-output sequences of some fixed length t, we consider the more general,
practically relevant case where D is a distribution over X ∗ . In this case the bound
depends on λ = EX∼D [|X|], the expected length of input sequences.
Theorem 1. For any 0 < , δ < 1, if D and P satisfy Assumptions 1–4, and
LearnFST receives as input m and a sample with n ≥ N examples for some N
in
λ2 ml k
O 4 2 4 log , (22)
μσO σP δ
returned by the algorithm
then, with probability at least 1 − δ, the hypothesis P
≤ .
satisfies dD (P, P)
A Spectral Learning Algorithm for Finite State Transducers 163
4.3 Proofs
The main technical difference between algorithm LearnFST and spectral tech-
niques for learning HMM is that in our case the operators Bab depend on the
input symbol as well as the output symbol. This implies that estimation er-
rors of Bab for different input symbols will depend on the input distribution; the
occurrence of μ in Equation 22 accounts for this fact.
164 B. Balle, A. Quattoni, and X. Carreras
Next lemma is almost identical to Lemma 10 in [13], and is repeated here for
completeness. Both this and the following one require that D and P satisfy
Assumptions 1–3 described above. These three quantities are used in both state-
ments:
Lemma 5. For all x ∈ X t , let εx = ts=1 (1 +
xs ). The following hold:
(U O)−1 (B y β∞ − B
y β
∞ )1 ≤ (1 +
1 )εx − 1 , (33)
x x
y∈Y t
|P(y|x) − P(y|x)| ≤ (1 +
1 )(1 +
∞ )εx − 1 . (34)
y∈Y t
where the first term is at most /2 by Lemma 5, and the second is bounded
using Markov’s inequality.
5 Synthetic Experiments
In this section we present experiments using our FST learning algorithm with
synthetic data. We are interested in four different aspects. First, we want to eval-
uate how the estimation error of the learning algorithm behaves as we increase
the training set size and the difficulty of the target. Second, how the estimation
error degrades with the length of test sequences. In the third place, we want to
compare our algorithms with other, more naive, spectral methods for learning
FST. And four, we compare LearnFST with our other algorithm for recovering
the parameters of an FST using a joint Schur decomposition.
For our first experiment, we generated synthetic data of increasing difficulty
as predicted by our analysis, as follows. First, we randomly selected a distribu-
tion over input sequences of length three, for input alphabet sizes ranging from 2
to 10, and choosing among uniform, gaussian and power distributions with ran-
dom parameters. Second, we randomly selected an FST, choosing from output
alphabet sizes from 2 to 10, choosing a number of hidden states and randomly
generating initial, transition and observation parameters. For a choice of input
distribution and FST, we computed the quantities appearing in the bound ex-
cept for the logarithmic term, and defined c = (λ2 ml)/(μσO 2 4
σP ). According to
166 B. Balle, A. Quattoni, and X. Carreras
0.5 0.35
0.45
c ~ 106 32k
9 0.3 128k
c ~ 10
0.4 512k
c ~ 1010 0.25
0.35 11
c ~ 10
L1 distance
L1 distance
0.3
0.2
0.25
0.15
0.2
0.15 0.1
0.1
0.05
0.05
0 0
8K 32K 128K 512K 1024K 2048K 0 2 4 6 8 10
# samples sequence length
Fig. 2. (Left) Learning curves for models at increasing difficulties, as predicted by our
analysis. (Right) L1 distance with respect to the length of test sequences, for models
trained with 32K, 128K and 512K training examples (k = 3, l = 3, m = 2).
our analysis, the quantity c is an estimate of the difficulty of learning the FST. In
this experiment we considered four random models, that fall into different orders
of c. For each model, we generated training sets of different sizes, by sampling
as a function
from the corresponding distribution. Figure 2 (left) plots dD (P, P)
of the training set size, where each curve is an average of 10 runs. The curves
follow the behavior predicted by the analysis.
The results from our second experiment can be seen in Figure 2 (right), which
plots the error of learning a given model (with k = 3, l = 3 and m = 2) as a
function of the test sequence lengths t, for three training set sizes. The plot
shows that increasing the number of training samples has a clear impact in the
performance of the model on longer sequences. It can also be seen that, as we
increase the number of training samples, the curve seems to flatten faster, i.e.
the growth rate of the error with the sequence length decreases nicely.
In the third experiment we compared LearnFST to another two baseline spec-
tral algorithms. These baselines are naive applications of the algorithm by Hsu
et al. [13] to the problem of FST learning. The first baseline (HMM) learns an
HMM that models the joint distribution D ⊗ P. The second baseline (k-HMM)
learns k different HMMs, one for each input symbol. This correponds to learning
an operator Bab for each pair (a, b) ∈ X × Y using only the observations where
X2 = a and Y2 = b, ignoring the fact that one can use the same U , computed
with all samples, for every operator Bab . In this experiment, we randomly created
an input distribution D and a target FST P (using k = 3, l = 3, m = 2). Then
we randomly sampled training sequence pairs from D ⊗ P, and trained models
using the three spectral algorithms. To evaluate the performance we measured
the L1 distance on all sequence pairs of length 3. Figure 3 (left) plots learn-
ing curves resulting from averaging performance across 5 random runs of the
experiment. It can be seen that with enough examples the baseline algorithms
are outperformed by our method. Furthermore, the fact that the joint HMM
A Spectral Learning Algorithm for Finite State Transducers 167
0.1
svd
0.08
joint schur
L1 distance
0.06
0.7
HMM
0.6 k−HMM 0.04
FST
0.5 0.02
L1 distance
0.4 0
8 16 32 64 128 256 512 1024 2048 4096
# training samples (in thousands)
0.3
0.8
svd
0.2 joint schur
0.6
L1 distance
0.1
0.4
0
32 128 512 2048 8192 32768
# training samples (in thousands)
0.2
0
8 16 32 64 128
# training samples (in thousands)
Fig. 3. (Left) Comparison with spectral baselines. (Right) Comparison with joint de-
composition method.
outperforms the conditional FST with small sample sizes is consistent with the
well-known phenomena in classification where generative models can outperform
discriminative models with small sample sizes [20].
Our last experiment’s goal is to showcase the behavior of the algorithm pre-
sented in Section 3.1 for recovering the parameters of an FST using a joint
Schur decomposition. Though we do not have a theoretical analysis of this algo-
rithm, several experiments indicate that its behavior tends to depend more on
the particular model than that of the rest of spectral methods. In particular, in
many models we observe an asymptotic behavior similar to the one presented by
LearnFST, and for some of them we observe better absolute performance. Two
examples of this can be found in Figures 3 (right), where the accuracy versus
the number of examples is plotted for two different, randomly selected models
(with k = 3, l = 3, m = 2).
6 Experiments on Transliteration
In this section we present experiments on a real task in Natural Language Pro-
cessing, machine transliteration. The problem consists of mapping named entities
(e.g. person names, locations, etc.) between languages that have different alpha-
bets and sound systems, by producing a string in the target language that is
phonetically equivalent to the string in the source language. For example, the
English word “brooklyn” is transliterated into Russian as “ ”. Because
168 B. Balle, A. Quattoni, and X. Carreras
Table 1. Properties of the transliteration dataset. “length ratio” is the average ratio
between lengths of input and output training sequences. “equal length” is the percent-
age of training sequence pairs of equal length.
orthographic and phonetic systems across languages differ, the lengths of paired
strings also differ in general. The goal of this experiment is to test the perfor-
mance of our learning algorithm in real data, and to compare it with a standard
EM algorithm for training FSTs.
We considered the English to Russian transliteration task of the News shared
task [17]. Training and test data consists of pairs of strings. Table 1 gives addi-
tional details on the dataset.
A standard metric to evaluate the accuracy of a transliteration system is the
normalized edit distance (ned) between the correct and predicted transliter-
ations. It counts the minimum number of character deletions, insertions and
substitutions that need to be made to transform the predicted string into the
correct one, divided by the length of the correct string and multiplied by 100.
In order to apply FSTs to this task we need to handle sequence pairs of un-
equal lengths. Following the classic work on transliteration by Knight and Graehl
[16] we introduced special symbols in the output alphabet which account for an
empty emission and every combination of two output symbols; thus, our FSTs
can map an input character to zero, one or two output characters. However,
the correct character alignments are not known. To account for this, for every
training pair we considered all possible alignments as having equal probability.2
It is easy to adjust our learning algorithm such that when computing the prob-
ability estimates (step 1 in the algorithm of Figure 1) we consider a distribution
over alignments between training pairs. This can be done efficiently with a sim-
ple extension to the classic dynamic programming algorithm for computing edit
distances.
At test, predicting the best output sequence (summing over all hidden se-
quences) is not tractable. We resorted to the standard approach of sampling,
where we used the FST to compute conditional estimates of the next output
symbol (see [13] for details on these computations).
The only free parameter of the FST is the number of hidden states (m).
There is a trade-off between increasing the number of hidden states, yielding
2
The alignments between sequences are a missing part in the training data, and
learning such alignments is in fact an important problem in FST learning (e.g.,
see [16]). However, note that our focus is not on learning alignments, but instead
on learning non-deterministic transductions between aligned sequences. In practice,
our algorithm could be used with an iterative EM method to learn both alignment
distributions and hidden states, and we believe future work should explore this line.
A Spectral Learning Algorithm for Finite State Transducers 169
Table 2. Normalized edit distance at test (ned) of a model as a function of the number
of hidden states (m), using all training samples. σ is the mth singular value of P̂ .
m 1 2 3 4 5
σ 0.0929 0.0914 0.0327 0.0241 0.0088
ned 21.769 21.189 21.224 26.227 71.780
80
Spectral, m=2
Spectral, m=3
70 EM, m=2
60
50
40
30
20
75 150 350 750 1500 3000 6000
# training sequences
Fig. 4. Learning curves for transliteration experiments using the spectral algorithm
and EM, for different number of hidden states. Error is measured as Normalized Edit
Distance.
7 Conclusions
In this paper we presented a spectral learning algorithm for probabilistic non-
deterministic FSTs. The main result are strong PAC-style guarantees, which, to
our knowledge, are the first for FST learning. Furthermore, we present extensive
experiments demonstrating the effectiveness of the proposed method in practice,
when learning from synthetic and real data.
An attractive property of our algorithm is its speed and scalability at training.
Experiments on a transliteration task show that, in practice, it is an effective
algorithm for learning FSTs. Our models could be used as building blocks to
solve complex tasks, such as parsing and translation of natural languages, and
planning in reinforcement learning.
Future work should improve the behavior of our algorithm in large input
alphabets by means of smoothing procedures. In practice, this should improve
the robustness of the method and make it applicable to a wider set of tasks.
Other lines of future research include: conducting a theoretical analysis of the
joint Schur approach for recovering parameters of HMM and FST, and exploring
the power of our algorithm for learning more general families of transductions.
References
1. Abe, N., Takeuchi, J., Warmuth, M.: Polynomial Learnability of Stochastic Rules
with Respect to the KL-Divergence and Quadratic Distance. IEICE Transactions
on Information and Systems 84(3), 299–316 (2001)
2. Bailly, R., Denis, F., Ralaivola, L.: Grammatical inference as a principal component
analysis problem. In: Proc. ICML (2009)
A Spectral Learning Algorithm for Finite State Transducers 171
3. Bernard, M., Janodet, J.-C., Sebban, M.: A discriminative model of stochastic edit
distance in the form of a conditional transducer. In: Sakakibara, Y., Kobayashi, S.,
Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp.
240–252. Springer, Heidelberg (2006)
4. Carlyle, J.W., Paz, A.: Realization by stochastic finite automaton. Journal of Com-
puter and System Sciences 5, 26–40 (1971)
5. Casacuberta, F.: Inference of finite-state transducers by using regular grammars
and morphisms. In: Oliveira, A.L. (ed.) ICGI 2000. LNCS (LNAI), vol. 1891, pp.
1–14. Springer, Heidelberg (2000)
6. Chang, J.T.: Full reconstruction of markov models on evolutionary trees: Identifi-
ability and consistency. Mathematical Biosciences 137, 51–73 (1996)
7. Chen, S., Goodman, J.: An empirical study of smoothing techniques for language
modeling. In: Proc. of ACL, pp. 310–318 (1996)
8. Clark, A.: Partially supervised learning of morphology with stochastic transducers.
In: Proc. of NLPRS, pp. 341–348 (2001)
9. Clark, A., Costa Florêncio, C., Watkins, C.: Languages as hyperplanes: grammat-
ical inference with string kernels. Machine Learning, 1–23 (2010)
10. Eisner, J.: Parameter estimation for probabilistic finite-state transducers. In: Proc.
of ACL, pp. 1–8 (2002)
11. Fliess, M.: Matrices de Hankel. Journal de Mathematiques Pures et Appliquees 53,
197–222 (1974)
12. Haardt, M., Nossek, J.A.: Simultaneous schur decomposition of several nonsymmet-
ric matrices to achieve automatic pairing in multidimensional harmonic retrieval
problems. IEEE Transactions on Signal Processing 46(1) (1998)
13. Hsu, D., Kakade, S.M., Zhang, T.: A spectral algorithm for learning hidden markov
models. In: Proc. of COLT (2009)
14. Jaeger, H.: Observable operator models for discrete stochastic time series. Neural
Computation 12, 1371–1398 (2000)
15. Jelinek, F.: Statistical Methods for Speech Recognition (Language, Speech, and
Communication). MIT Press, Cambridge (1998)
16. Knight, K., Graehl, J.: Machine transliteration. Computational Linguistics 24(4),
599–612 (1998)
17. Li, H., Kumaran, A., Pervouchine, V., Zhang, M.: Report of news 2009 machine
transliteration shared task. In: Proc. Named Entities Workshop (2009)
18. Mossel, E., Roch, S.: Learning nonsingular phylogenies and hidden markov models.
In: Proc. of STOC (2005)
19. Neuts, M.F.: Matrix-geometric solutions in stochastic models: an algorithmic ap-
proach. Johns Hopkins University Press, Baltimore (1981)
20. Ng, A., Jordan, M.: On discriminative vs. generative classifiers: A comparison of
logistic regression and naive bayes. In: NIPS (2002)
21. Ron, D., Singer, Y., Tishby, N.: On the learnability and usage of acyclic proba-
bilistic finite automata. In: Proc. of COLT, pp. 31–40 (1995)
22. Schützenberger, M.: On the definition of a family of automata. Information and
Control 4, 245–270 (1961)
23. Siddiqi, S.M., Boots, B., Gordon, G.J.: Reduced-Rank Hidden Markov Models. In:
Proc. AISTATS, pp. 741–748 (2010)
An Analysis of Probabilistic Methods for Top-N
Recommendation in Collaborative Filtering
1 Introduction
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 172–187, 2011.
c Springer-Verlag Berlin Heidelberg 2011
An Analysis of Probabilistic Methods for Top-N Recommendation in CF 173
she will likely provide a positive feedback, a recommendation list can be be hence
built by drawing upon the (predicted) highly-rated items.
Under this perspective, a common approach to evaluate the predictive skills
of a recommender systems is to minimize statistical error metrics, such as the
Root Mean Squared Error (RMSE). The common assumption is that small im-
provements in RMSE would reflect into an increase of the accuracy of the recom-
mendation lists. This assumption, however does not necessarily hold. In [4], the
authors review the most common approaches to CF-based recommendation, and
compare them according to a new testing methodology which focuses on the ac-
curacy of the recommendation lists rather than on the rating prediction accuracy.
Notably, cutting-edge approaches characterized by low RMSE values achieves
performances comparable to naive techniques, whereas simpler approaches, such
as the pure SVD, consistently outperforms the other techniques. In an attempt
to find an explanation, the authors impute the contrasting behavior with a “lim-
itation of RMSE testing, which concentrates only on the ratings that the user
provided to the system” and consequently “misses much of the reality, where all
items should count, not only those actually rated by the user in the past” [4].
The point is that pure SVD rebuilds the original rating matrix in terms of
latent factors, rather than trying to minimize the error on observed data. In
practice, the underlying optimization problem is quite different, since it takes
into account the whole rating matrix considering both observed and unobserved
preference values. To summarize, it is likely to better identify the latent factors
and the hidden relationships between both factor/users and factors/items. It
is natural then to ask whether more sophisticated latent factor models confirm
this trend, and are able to guarantee better results in terms of recommendation
accuracy, even when they provide poor RMSE performances.
Among the state-of-the art latent factor models, probabilistic techniques of-
fer some advantages over traditional deterministic models: notably, they do not
minimize a particular error metric but are designed to maximize the likelihood
of the model given the data which is a more general approach; moreover, they
can be used to model a distribution over rating values which can be used to de-
termine the confidence of the model in providing a recommendation; finally, they
allow the possibility to include prior knowledge into the generative process, thus
allowing a more effective modeling of the underlying data distribution. However,
previous studies on recommendation accuracy do not take into consideration
such probabilistic approaches to CF, which instead appear rather promising un-
der the above devised perspective.
In this paper we adopt the testing methodology proposed in [4], and discuss
also other metrics [6] for assessing the accuracy of the recommendation list. Based
on these settings, we perform an empirical study of some paradigmatic probabilis-
tic approaches to recommendation. We study different techniques to rank items
in a probabilistic framework, and evaluate their impact in the generation of a
recommendation list. We shall consider approaches for both implicit and explicit
preference values, and show that latent factor models, equipped with the proper
ranking functions, achieve competitive advantages over traditional techniques.
174 N. Barbieri and G. Manco
The rest of the paper is organized as follows: the testing methodology and
the accuracy metrics are discussed in Sec. 2. Section 3 introduces the proba-
bilistic approaches to CF that we are interested in evaluating. The approaches
we include can be considered representative of wider classes which share the
same roots. In this context, our results can be extended to more sophisticated
approaches. Finally, in Sec. 4 we compare the approaches and assess their effec-
tiveness according to the selected testing methodology.
used to train the RS, while the latter is used for validation. It is worth noticing
that, while both T and S share the same dimensions as R, for each pair (u, i) we
have that Sui > 0 implies Tui = 0, i.e. no incompatible values overlap between
training and test set. By selecting a user in S, the set Cuj is obtained by drawing
upon IR − IT (u). Next, we ask the system to predict a set of items which he/she
may like and then measure the accuracy of the provided recommendation. Here,
the accuracy is measured by comparing the top-N items selected by resorting to
the RS, with those appearing in IS (u).
Relevance can be measured in several different ways. Here we adopt two alter-
native definitions. When V > 1 (i.e., an explicit preference value is available)
we denote as relevant all those items which received a rating greater than the
average ratings in the training set, i.e.,
Implicit preferences assume instead that all items in IS (u) are relevant.
Evaluating Users Satisfaction. The above definitions of precision and recall aims
at evaluating the amount of useful recommendations in a single session. A dif-
ferent perspective can be considered by assuming that a recommendation meets
user satisfaction if he/she can find in the recommendation list at least an item
which meets his/her interests. This perspective can be better modeled by a dif-
ferent approach to measure accuracy, as proposed in [5,4]. The approach relies
on a different definition of relevant items, namely:
Tur = {i ∈ IS (u)|Sui = V }
Notice that the above definition of precision does not penalize false positives:
the recommendation is considered successful if it matches at least an item of
interest. However, neither the amount of non-relevant“spurious” items, nor the
position of the relevant item within the top-N is taken into account.
be used to model a distribution over rating values which can be used to infer
confidence intervals and to determine the confidence of the model in providing
a recommendation.
In the following we will briefly introduce some paradigmatic probabilistic ap-
proaches to recommendation, and discuss how these probabilistic model can be
used for item ranking, which is then employed to produce the top-N recommen-
dation list. The underlying idea of probabilistic models based on latent factors
is that each preference observation u, i is generated by one of k possible states,
which informally model the underlying reason why u has chosen/rated i. Based
on the mathematical model, two different inferences can be then supported to
be exploited in item ranking, where the main difference [9], lies in a difference
way of modeling data according to the underlying model:
– Forced Prediction: the model provides estimate of P (r|u, i), which represents
the conditional probability that user u assign a rating value r given the item
i;
– Free prediction: the item selection process is included in the model, which is
typically based on the estimate of P (r, i|u). In this case we are interested in
predicting both the item selection and the preference of the user for each se-
lected item. P (r, i|u) can be factorized as P (r|i, u)P (i|u); the resulting model
still includes a component of forced prediction which however is weighted by
the item selection component and thus allows a more precise estimate of
user’s preferences.
where
P (z|u) ∝ P (uobs |z)θz
and uobs represents the observed values (u, i, r) in R.
The probabilistic Latent Semantic Analysis approach (PLSA, [9]) spec-
ifies a co-occurence data model in which the user u and item i are conditionally
independent given the state Z of the latent factor. Differently from the previous
mixture model, where a single latent factor is associated with every user u, the
PLSA model associates a latent variable with every observation triplet (u, i, r).
Hence, different ratings of the same user can be explained by different latent
178 N. Barbieri and G. Manco
φ
k
β
k×n
z r n θ z r
θ nu
m
m
β k×n
φ β
k k×n
θ z i n θ z r
n
m m
η φ η β
k k×n
α α
θ z i n θ z r
n
m m
m γ δ n
σ
m×n
(g) Probabilistic Matrix Factoriza-
tion
causes in PLSA (modeled as priors {θu }1,...,m in Fig. 1(c)), whereas a mixture
model assumes that all ratings involving the same user are linked to the same
underlying community. PLSA directly supports item selection:
P (i|u) = φz,i θu,z (4)
z
Conversely, the Gaussian Mixture Model (G-PLSA, [8]) models βz,i = (μiz ,
σiz ) as a gaussian distribution, and provides a normalization of ratings through
the user’s mean and variance, thus allowing to model users with different rating
patterns. The corresponding rating probability is
P (r|u, i) = N (r; μiz , σiz )θu,z (6)
z
The Latent Dirichlet Allocation [3] is designed to overcome the main draw-
back in the PLSA-based models, by introducing Dirichlet priors, which provide
a full generative semantic at user level and avoid overfitting. Again, two differ-
ent formulations, are available, based on whether we are interested in modeling
implicit (LDA) or explicit (User Rating Profile, URP[13]) preference values.
In the first case, we have:
P (i|u) = φz,i θz P (θ|uobs )dθ (7)
z
(where P (θ|uobs ) is estimated in the inference phase). Analogously, for the URP
we have
P (r|u, i) = βz,i,r θz P (θ|uobs )dθ (8)
z
The User Communities Model (UCM, [2]) adopts the same inference for-
mula Eq. 3 of the multinomial model. Nevertheless, it introduces some key fea-
tures, that combine the advantages of both the AM and the MMM, as shown
in Fig. 1(b). First, the exploitation of a unique prior distribution θ over the
user communities helps in preventing overfitting. Second, adds flexibility in the
prediction by modeling an item as an observed (and hence randomly generated)
component. UCM directly a free-prediction approach.
180 N. Barbieri and G. Manco
Both the original approach and its bayesian generalizations [17,20] are charac-
terized by high prediction accuracy.
Predicted Rating. The most intuitive way to provide item ranking in the recom-
mendation process relies on the analysis of the distribution over preference values
P (r|u, i) (assuming that we are modeling explicit preference data). Given this
distribution, there are several methods for computing the ranking for each pair
u, i; the most commonly used is the expected value E[R|u, i], as it minimizes
the MSE and thus the RMSE:
We will show in Sec. 4 that this approach fails in providing accurate recommen-
dation and discuss about potential causes.
Item Selection. For co-occurrence preference approaches, the rank of each item
i, with regards to the user u can be computed as the mixture:
pui = P (i|u) = P (z|u)P (i|z) (11)
z
where P (i|z) is the probability that i will be selected by users represented by the
abstract pattern z. This distribution is a key feature of co-occurrence preference
approaches and models based on free-prediction. When P (i|z) is not directly
inferred by the model, we can still estimate it by averaging on all the possible
users who selected i:
P (i|z) ∝ δ(u, i)T P (z|u)
u
Item Selection And Relevance. In order to force the selection process to concen-
trate on relevant items, we can extend the ranking discussed above, by including
a component that represents the “predicted” relevance of an item with respect
to a given user:
pui = P (i, r > r T |u)
= P (i|u)P (r > r T |u, i) = P (z|u)P (i|z)P (r > r T |i, z) (12)
z
where P (r > r T |i, z) = r>rT P (r|i, z). In practice, an item is ranked on the
basis of the value of its score, by giving high priority to the high-score items.
4 Evaluation
In this section we experiment the testing protocols presented in Sec. 2 on the
probabilistic approaches defined in the previous section. We use the MovieLens-
1M1 dataset, which consists of 1, 000, 209 ratings given by 6, 040 users on approx-
imately 3, 706 movies, with a sparseness coefficient 96% and an average number
of ratings 132 per user, and 216 per item. In the evaluation phase, we adopt
a MonteCarlo 5-folds validation, where for each fold contains about the 80% of
overall ratings and the remaining data (20%) is used as test-set. The final results
reported by averaging the values achieved in each fold.
In order to make our results comparable with the ones reported in [4], we con-
sider Top-Pop and Item-Avg algorithms as baseline, and Pure-SVD as a main
competitor. Notice that there are some differences between our evaluation and
the one performed in the above cited study, namely: (i) we decided to employ
bigger test-sets (20% of the overall data vs 1.4%) and to cross-validate the re-
sults; (ii) for lack of space we concentrate on MovieLens only, and omit further
evaluations on the Netflix data (which however, in the original paper [4], confirm
Pure-SVD as the top-performer); (iii) we decided to omit the “long tail” test,
aimed at evaluating the capability of suggesting non-trivial items, as it is out of
the scope of this paper.2
In the following we study the effects of the ranking function on the accuracy
of the recommendation list. The results we report are obtained by varying the
length of the recommendation list in the range 1 − 20 and the dimension of the
random sample is fixed to D = 1000. In a preliminary test, we found the optimal
number of components for the Pure-SVD to be set to 50.
URP, UCM and G-PLSA, where the predicted rating is employed as ranking
function. First of all, the following table summarizes the RMSE obtained by
these approaches:
Approach RMSE #Latent Factors
Item Avg 0.9784 -
MMM 1.0000 20
G-PLSA 0.9238 70
UCM 0.9824 10
URP 0.8989 10
PMF 0.8719 30
The results about Recall and Precision are given in Fig. 2, where the respective
number of latent factors is given in brackets. Considering user satisfaction, al-
most all the probabilistic approaches fall between the two baselines. Pure-SVD
outperforms significantly the best probabilistic performers, namely URP and
PMF. The trend for probabilistic approaches does not change considering Recall
and Precision, but in this case not even the Pure-SVD is able to outperform
Top-Pop, which exhibits a consistent gain over all the considered competitors.
A first summary can be obtained as follows. First, we can confirm that there
is no monotonic relationship between RMSE and recommendation accuracy. All
US-Recall Ranking Method: Expected Value Prediction US-Precision Ranking Method: Expected Value Prediction
0.35 0.06
0.3
0.05
0.25
0.04
US-Precision
0.2
US-Recall
0.03
0.15
0.02
0.1
0.01
0.05
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N N
Item-Avg Pure-SVD(50) MMM(20) GPLSA(70) Item-Avg Pure-SVD(50) MMM(20) GPLSA(70)
Top-Pop PMF(30) URP(10) UCM(10) Top-Pop PMF(30) URP(10) UCM(10)
0.16
0.25
0.14
0.2 0.12
Precision
0.1
Recall
0.15
0.08
0.1 0.06
0.04
0.05
0.02
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N N
Item-Avg Pure-SVD(50) MMM(20) GPLSA(70) Item-Avg Pure-SVD(50) MMM(20) GPLSA(70)
Top-Pop PMF(30) URP(10) UCM(10) Top-Pop PMF(30) URP(10) UCM(10)
the approaches tend to have a non-deterministic behavior, and even the best
approaches provide unstable results depending on the size N . Further, ranking
by the expected value exhibits unacceptable performance on the probabilistic
approaches, which reveal totally inadequate in this perspective. More in general,
any variant of this approach that we do not report here for space limitations)
does not substantially change the results.
Things radically change when item occurrence is taken into consideration. Fig. 3
show the recommendation accuracy achieved by probabilistic models which em-
ploy Item-Selection (LDA,PLSA,UCM and URP) and Item-Selection&Relevance
(UCM and URP). The LDA approach significantly outperforms all the available
approaches. Surprisingly, UCM is the runner-up, as opposed to the behavior ex-
hibited with the expected value ranking. it is clear that the component P (i|z)
here plays a crucial role, that is further strengthened by the relevance ranking
component.
Also surprising is the behavior of URP, which still achieves a satisfactory
performance compared to Pure-SVD. However, it does not compare to LDA. The
reason can be found in the fact that the inference procedure in the LDA directly
0.45
0.07
0.4
0.06
0.35
0.05
US-Precision
0.3
US-Recall
0.25 0.04
0.2
0.03
0.15
0.02
0.1
0.01
0.05
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N N
Item-Avg PLSA(20) Item-Avg PLSA(20)
Top-Pop URP(10)SelectionRanking Top-Pop URP(10)SelectionRanking
Pure-SVD(50) URP(10)SelectionRelevanceRanking Pure-SVD(50) URP(10)SelectionRelevanceRanking
LDA(20) UCM(10)SelectionRanking LDA(20) UCM(10)SelectionRanking
MMM(20)SelectionRanking UCM(10)SelectionRelevanceRanking MMM(20)SelectionRanking UCM(10)SelectionRelevanceRanking
0.35
0.25
0.3
0.2
0.25
Precision
Recall
0.2 0.15
0.15
0.1
0.1
0.05
0.05
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N N
Item-Avg PLSA(20) Item-Avg PLSA(20)
Top-Pop URP(10)SelectionRanking Top-Pop URP(10)SelectionRanking
Pure-SVD(50) URP(10)SelectionRelevanceRanking Pure-SVD(50) URP(10)SelectionRelevanceRanking
LDA(20) UCM(10)SelectionRanking LDA(20) UCM(10)SelectionRanking
MMM(20)SelectionRanking UCM(10)SelectionRelevanceRanking MMM(20)SelectionRanking UCM(10)SelectionRelevanceRanking
0.75
0.7 0.035
0.65
0.6 0.03
US Precision
US Recall
0.55
0.5 0.025
0.45
0.4 0.02
0.35
0.3 0.015
250 500 750 1000 250 500 750 1000
Dimension of the Random Sample Dimension of the Random Sample
LDA(20) UCM(10)SelectionRelevanceRanking LDA(20) UCM(10)SelectionRelevanceRanking
Pure-SVD(50) URP(10)SelectionRelevanceRanking Pure-SVD(50) URP(10)SelectionRelevanceRanking
4.3 Discussion
There are two main considerations in the above figures. One is that rating pre-
diction fails in providing accurate recommendations. The second observation is
the unexpected strong impact of the item selection component, when properly
estimated.
In an attempt to carefully analyze the rating prediction pitfalls, we can plot
in Fig. 5(a) the contribution to the RMSE in each single evaluation in V by the
probabilistic techniques under consideration. Item-Avg acts as baseline here.
While predictions are accurate for values 3 − 4, they result rather inadeguate
for border values, namely 1, 2 and 5. This is mainly due to the nature of RMSE,
which penalizes larger errors. This clearly supports the thesis that low RMSE
An Analysis of Probabilistic Methods for Top-N Recommendation in CF 185
1.6 25
Percentage of Ratings
1.4
RMSE
20
1.2
1 15
0.8
10
0.6
0.4 5
* ** *** **** ***** * ** *** **** *****
Rating Values Rating Values
0.45
0.35
0.4
0.3
0.35
0.25 0.3
US-Recall
US-Recall
0.25
0.2
0.2
0.15
0.15
0.1 0.1
0.05
0.05
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 N
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N Pure-SVD(50)
LDA(20)SelectionRanking
UCM(10)Ranking:E[r|u,i] UCM(10)Ranking:P(i,r>3|u) URP(10)PredictionRanking
UCM(10)Ranking:P(i|u) Ensemble(LDA+URP)SelectionRelevanceRanking
Fig. 6.
does not necessarily induces good accuracy, as the latter is mainly influenced by
the items in class 5 (where the approaches are more prone to fail). It is clear
that a better tuning of the ranking function should take this component into
account.
Also, by looking at the distribution of the rating values, we can see that the
dataset is biased towards the mean values, and more in general the low rating
values represent a lower percentage. This explains, on one side, the tendency of
the expected value to flatten towards a mean value (and hence to fail in providing
an accurate prediction). On the other side, the lack of low-rating values provides
an interpretation of the dataset as a Like/DisLike matrix, for which the item
selection tuning provides a better modeling.
By the way, the rating information, combined with item selection, provides
a marginal improvement, as testified by Fig. 6(a). Here, a closer look at the
UCM approach is taken, by plotting three curves relative to the three different
approaches to item ranking. Large recommendation lists tend to be affected by
the rating prediction.
186 N. Barbieri and G. Manco
Our experiments have shown that item selection component plays the most im-
portant role in recommendation ranking. However, better results can be achieved
by considering also a rating prediction component. To empirically prove the ef-
fectiveness of such approach, we performed a final test in which item ranking is
performed by employing an ensemble approach based on the item selection and
relevance ranking. In this case, the components of the ranking come from dif-
ferent model: the selection probability is computed according to an LDA model,
while the relevance ranking is computed by employing the URP model. Fig. 6(b)
shows that this approach outperforms LDA, achieving the best result in recom-
mendation accuracy ( due to the lack of space we show only the trend corre-
spoding to US-Recall).
We have shown that probabilistic models, equipped with the proper ranking
function, exhibit competitive advantages over state-of-the-art RS in terms of
recommendation accuracy. In particular, we have shown that strategies based on
item selection guarantee significant improvements, and we have investigated the
motivations behind the failure of prediction-based approaches. The advantage
of probabilistic models lies in their flexibility, as they allow switching between
both methods in the same inference framework. The nonmonotonic behavior of
RMSE also finds its explanation in the distribution of errors along the rating
values, thus suggesting different strategies for prediction-based recommendation.
Besides the above mentioned, there are other significant advantages in the
adoption of probabilistic models for recommendation. Recent studies pointed
out that there is more in recommendation than just rating prediction. A suc-
cessful recommendation should answer to the simple question ‘What is the user
actually looking for?’ which is strictly tied with dynamic user profiling. Moreover,
prediction-based recommender systems do not consider one of the most impor-
tant applications from the retailer point of view: suggesting users products they
would not have found otherwise discovered.
In [15] the authors argued that the popular testing methodology based on pre-
diction accuracy is rather inadeguate and does not capture important aspects
of the recommendations, like non triviality, serendipity, users needs and expec-
tations, and their studies have scaled down the usefulness of achieving a lower
RMSE [12]. In short, the evaluation of a recommender cannot rely exclusively
on prediction accuracy but must take into account what is really displayed to
user, i.e the recommendation list, and its impact on his/her navigation.
Clearly, probabilistic graphical models, like the ones discussed in this paper,
provide several components which can be fruitfully exploited for the estimation
of such measures. Latent factors, probability of item selection and rating proba-
bility can help in better specify usefulness in recommendation. We plan to extend
the framework in this paper in this promising directions, by providing subjective
measures for such features and measuring the impact of such models.
An Analysis of Probabilistic Methods for Top-N Recommendation in CF 187
References
1. Agarwal, D., Chen, B.-C.: flda: matrix factorization through latent dirichlet allo-
cation. In: WSDM, pp. 91–100 (2010)
2. Barbieri, N., Guarascio, M., Manco, G.: A probabilistic hierarchical approach for
pattern discovery in collaborative filtering data. In: SMD (2011)
3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of
Machine Learning Research 3, 993–1022 (2003)
4. Cremonesi, P., Koren, Y., Turrin, R.: Performance of recommender algorithms on
top-n recommendation tasks. In: ACM RecSys, pp. 39–46 (2010)
5. Cremonesi, P., Turrin, R., Lentini, E., Matteucci, M.: An evaluation methodology
for collaborative recommender systems. In: AXMEDIS, pp. 224–231 (2008)
6. Ge, M., Delgado-Battenfeld, C., Jannach, D.: Beyond accuracy: evaluating recom-
mender systems by coverage and serendipity. In: ACM RecSys, pp. 257–260 (2010)
7. Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to
weave an information tapestry. Communications of the ACM 35(12), 61–70 (1992)
8. Hofmann, T.: Collaborative filtering via gaussian probabilistic latent semantic anal-
ysis. In: SIGIR (2003)
9. Hofmann, T.: Latent semantic models for collaborative filtering. ACM Transactions
on Information Systems (TOIS) 22(1), 89–115 (2004)
10. Hofmann, T., Puzicha, J.: Latent class models for collaborative filtering. IJCAI,
688–693 (1999)
11. Jin, X., Zhou, Y., Mobasher, B.: A maximum entropy web recommendation system:
combining collaborative and content features. In: KDD, pp. 612–617 (2005)
12. Koren, Y.: How useful is a lower rmse? (2007),
http://www.netflixprize.com/community/viewtopic.php?id=828
13. Marlin, B.: Modeling user rating profiles for collaborative filtering. In: NIPS (2003)
14. Marlin, B., Marlin, B.: Collaborative filtering: A machine learning perspective.
Tech. rep., Department of Computer Science University of Toronto (2004)
15. McNee, S.M., Riedl, J., Konstan, J.A.: Being accurate is not enough: How accuracy
metrics have hurt recommender systems. In: ACM SIGCHI Conference on Human
Factors in Computing Systems, pp. 1097–1101 (2006)
16. Pazzani, M.J., Billsus, D.: Content-based recommendation systems. The Adaptive
Web: Methods and Strategies of Web Personalization, 325–341 (2007)
17. Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using
markov chain monte carlo. In: ICML, pp. 880–887 (2008)
18. Salakhutdinov, R., Mnih, A.: Probabilistic matrix factorization. In: NIPS, pp.
1257–1264 (2008)
19. Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filtering
recommendation algorithms. In: WWW, pp. 285–295 (2001)
20. Shan, H., Banerjee, A.: Generalized probabilistic matrix factorizations for collab-
orative filtering. In: ICDM (2010)
21. Stern, D.H., Herbrich, R., Graepel, T.: Matchbox: large scale online bayesian rec-
ommendations. In: WWW, pp. 111–120 (2009)
Learning Good Edit Similarities with
Generalization Guarantees
1 Introduction
Similarity and distance functions between objects play an important role in
many supervised and unsupervised learning methods, among which the popular
k-nearest neighbors, k-means and support vector machines. For this reason, a lot
of research has gone into automatically learning similarity or distance functions
from data, which is usually referred to as metric learning. When data consists
in numerical vectors, a common approach is to learn the parameters (i.e., the
transformation matrix) of a Mahalanobis distance [1–4].
Because they involve more complex procedures, less work has been devoted to
learning such functions from structured data (for example strings or trees). Still,
there exists a few methods for learning edit distance-based functions. Roughly
We would like to acknowledge support from the ANR LAMPADA 09-EMER-007-02
project and the PASCAL 2 Network of Excellence.
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 188–203, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Learning Good Edit Similarities with Generalization Guarantees 189
speaking, the edit distance between two objects is the cost of the best sequence of
operations (insertion, deletion, substitution) required to transform an object into
another, where an edit cost is assigned to each possible operation. Most general-
purpose methods for learning the edit cost matrix maximize the likelihood of the
data using EM-based iterative methods [5–9], which can imply a costly learning
phase. Saigo et al. [10] manage to avoid this drawback in the context of remote
homologies detection in protein sequences by applying gradient descent to a
specific objective function. Some of the above methods do not guarantee to
find the optimal parameters and/or are only based on a training set of positive
pairs: they do not take advantage of pairs of examples that have different labels.
Above all, none of these methods offer theoretical guarantees that the learned
edit functions will generalize well to unseen examples (while it is the case for
some Mahalanobis distance learning methods [4]) and lead to good performance
for the classification or clustering task at hand.
Recently, Balcan et al. [11, 12] introduced a theory of learning with so-called
(, γ, τ )-good similarity functions that gives intuitive, sufficient conditions for a
similarity function to allow one to learn well. Essentially, a similarity function
K is (, γ, τ )-good if an proportion of examples are on average 2γ more similar
to reasonable examples of the same class than to reasonable examples of the
opposite class, where a τ proportion of examples must be reasonable. K does
not have to be a metric nor positive semi-definite (PSD). They show that if K
is (, γ, τ )-good, then it can be used to build a linear separator in an explicit
projection space that has margin γ and error arbitrarily close to . This separator
can be learned efficiently using a linear program and is supposedly sparse.
In this article, we propose a novel edit similarity learning procedure driven
by the notion of good similarity function. Our approach (GESL, for Good Edit
Similarity Learning) is formulated as an efficient convex programming approach
allowing us to learn the edit costs so as to optimize the (, γ, τ )-goodness of
the resulting similarity function. We provide a bound based on the notion of
uniform stability [13] that guarantees that our learned similarity will generalize
well and induce low-error classifiers. This bound is independent of the size of the
alphabet, making GESL suitable for handling problems with large alphabet. To
the best of our knowledge, this work is the first attempt to establish a theoretical
relationship between a learned edit similarity function and its generalization and
discriminative power. We show in a comparative experimental study that GESL
has fast convergence and leads to more accurate and sparser classifiers than other
edit similarities.
This paper is organized as follows. In Section 2, we introduce a few notations,
and review the theory of Balcan et al. as well as some prior work on edit simi-
larity learning. In Section 3, which is the core of this paper, we present GESL,
our approach to learning good edit similarities. We then propose a theoretical
analysis of GESL based on uniform stability, leading to the derivation of a gen-
eralization bound. An experimental evaluation of our approach is provided in
Section 4. Finally, we conclude this work by outlining promising lines of research
on similarity learning.
190 A. Bellet, A. Habrard, and M. Sebban
φSi (x) = K(x, xi ), i ∈ {1, . . . , d}. Then, with probability at least 1 − δ over the
random sample S, the induced distribution φS (P ) in Rd has a linear separator
α of error at most + 1 at margin γ.
Therefore, if we are given an (, γ, τ )-good similarity function for a learning
problem P and enough (unlabeled) landmark examples, then with high proba-
bility there exists a low-error linear separator α in the explicit “φ-space”, which
is essentially the space of the similarities to the d landmarks. As Balcan et al.
mention, using du unlabeled examples and dl labeled examples, we can efficiently
find this separator α ∈ Rdu by solving the following linear program (LP):1
⎡ ⎤
dl
du
min ⎣1 − αj i K(xi , xj )⎦ + λα1 . (1)
α
i=1 j=1
+
Note that Problem (1) is essentially a 1-norm SVM problem [15] with an em-
pirical similarity map [11], and can be efficiently solved. The L1 -regularization
induces sparsity in α: it allows us to automatically select useful landmarks (the
reasonable points), ignoring the others, whose corresponding coordinates in α
will be set to zero during learning. We can also control the sparsity of the solu-
tion directly: the larger λ, the sparser α. Therefore, one does not need to know
in advance the set of reasonable points R, it is automatically worked out while
learning α.
Our objective in this paper is to make use of the theory of Balcan et al. to
efficiently learn (, γ, τ )-good edit similarities from data that will lead to effective
classifiers. In the next section, we review some past work on edit cost learning.
What makes the edit costs C hard and expensive to optimize is the fact that the
edit distance is based on an optimal script which depends on the edit costs them-
selves. This is the reason why, as we have seen earlier, iterative approaches are
very commonly used to learn C from data. In this section, we take a novel convex
programming approach based on the theory of Balcan et al. to learn (, γ, τ )-
good edit similarity functions from both positive and negative pairs without
requiring a costly iterative procedure. Moreover, this new framework allows us
to derive a generalization bound establishing the convergence of our method and
a relationship between the learned similarities and their (, γ, τ )-goodness.
Learning Good Edit Similarities with Generalization Guarantees 193
Note that to compute eG , we do not extract the optimal script with respect to
C: we use the Levenshtein script2 and apply custom costs C to it. Therefore,
since the edit script defined by #(x, x ) is fixed, eG (x, x ) is nothing more than
a linear function of the edit costs and can be optimized directly.
Recall that in the framework of Balcan et al., a similarity function must be
in [−1, 1]. To respect this requirement, we define our similarity function to be:
KG (x, x ) = 2e−eG (x,x ) − 1.
The motivation for this exponential form is related to the one for using exponen-
tial kernels in SVM classifiers: it can be seen as a way to introduce nonlinearity
to further separate examples of opposite class while moving closer those of the
same class. Note that KG may not be PSD nor symmetric. However, as we have
seen earlier, Balcan et al.’s theory does not require these properties, unlike SVM.
Criterion (2) bounds that of Definition 1 due to the convexity of the hinge loss.
It is harder to satisfy since the “goodness” is required with respect to each
reasonable point instead of considering the average similarity to these points.
Clearly, if KG satisfies (2), then it is (, γ, τ )-good with ≤ .
Let us consider a training sample of NT labeled points T = {zi = (xi , i )}N T
i=1
and a sample of landmark examples SL = {zj = (xj , j )}j=1 . Note that these
NL
Therefore, we want 1 − i j KG (xi , xj )/γ + = 0, hence i j KG (xi , xj ) ≥ γ. A
benefit from using this constraint is that it can easily be turned into an equivalent
linear one, considering the following two cases.
= j , we get:
1. If i
1−γ 1−γ
−KG (xi , xj ) ≥ γ ⇐⇒ e−eG (xi ,xj ) ≤ ⇐⇒ eG (xi , xj ) ≥ − log( ).
2 2
We can use a variable B1 ≥ 0 and write the constraint as eG (xi , xj ) ≥ B1 ,
with the interpretation that B1 = − log( 1−γ 2 ). In fact, B1 ≥ − log( 2 ).
1
2. Likewise, if i = j , we get eG (xi , xj ) ≤ − log( 2 ). We can use a variable
1+γ
B1 ≥ − log( 2 ), 0 ≤ B2 ≤ − log( 12 ), B1 − B2 = ηγ
1
Ci,j ≥ 0, 0 ≤ i, j ≤ A,
+1 .
GESL is a convex program, thus we can efficiently find its global optimum.
Using slack variables to express the hinge loss, it has O(NT NL + A2 ) variables
and O(NT NL ) constraints. Note that GESL is quite sparse: each constraint
involves at most one string pair and a limited number of edit cost variables,
making the problem faster to solve. It is also worth noting that our approach is
very flexible. First, it is general enough to be used with any definition of eG that
is based on an edit script (or even a convex combination of edit scripts). Second,
Learning Good Edit Similarities with Generalization Guarantees 195
one can incorporate additional convex constraints, which offers the possibility of
including background knowledge or desired requirements on C (e.g., symmetry).
Lastly, it can be easily adapted to the multi-class case.
In the next section, we derive a generalization bound guaranteeing not only
the convergence of our learning method but also the overall goodness of the
learned edit similarity function for the task at hand.
1 1
NT NL
FT (C) = V (C, zk , zk j ) + βC2 ,
NT NL j=1
k=1
where CT denotes the edit cost matrix learned by GESL from sample T . Our
objective is to derive an upper bound on the generalization loss L(CT ) with
respect to the empirical loss LT (CT ).
A learning algorithm is stable [13] when its output does not change signifi-
cantly under a small modification of the learning sample. We consider the follow-
ing definition of uniform stability meaning that the replacement of one example
must lead to a variation bounded in O(1/NT ) in terms of infinite norm.
Definition 3 (Jin et al. [4], Bousquet and Elisseeff [13]). A learning al-
gorithm has a uniform stability in NκT , where κ is a positive constant, if
κ
∀(T, z), |T | = NT , ∀i, sup |V (CT , z1 , z2 ) − V (CT i,z , z1 , z2 )| ≤ ,
z1 ,z2 NT
To prove that GESL has the property of uniform stability, we need the following
two lemmas (proven in Appendices 1 and 2).
Lemma 1. For any edit cost matrices C, C and any examples z, z :
|V (C, z, z ) − V (C , z, z )| ≤ C − C W.
Lemma 2. Let FT and FT i,z be the functions to optimize, CT and CT i,z their
corresponding minimizers, and β the regularization parameter. Let ΔC = (CT −
CT i,z ). For any t ∈ [0, 1]:
(2NT + NL )t2W
CT 2 − CT − tΔC2 + CT i,z 2 − CT i,z + tΔC2 ≤ ΔC.
βNT NL
Using Lemma 1 and 2, we can now prove the stability of GESL.
Theorem 2. Let NT and NL be respectively the number of training examples
and landmark points. Assuming that NL = αNT , α ∈ [0, 1], GESL has a uniform
2
stability in NκT , where κ = 2(2+α)W
βα .
Lemma 4. For any edit cost matrix learned by GESL using NT training exam-
ples and NL landmarks, with Bγ = max(ηγ , −log(1/2)), we have the following
bound:
(2NT + NL )( √2W + 3)Bγ
2κ βBγ
∀i, 1 ≤ i ≤ NT , ∀z, |DT − DT i,z | ≤ + .
NT NT NL
We are now able to derive our generalization bound over L(CT ).
Theorem 4. Let T be a sample of NT randomly selected training examples and
let CT be the edit costs learned by GESL with stability NκT using NL = αNT
landmark points. With probability 1 − δ, we have the following bound for L(CT ):
κ 2+α 2W ln(2/δ)
L(CT ) ≤ LT (CT ) + 2 + 2κ + + 3 Bγ
NT α βBγ 2NT
2(2+α)W 2
with κ = αβ and Bγ = max(ηγ , −log(1/2)).
22 ln(2/δ)
By fixing δ = 2 exp − (2κ+B) 2 /N
T
, we get = (2κ + B) 2NT . Finally, from
(3), Lemma 3 and the definition of DT , we have with probability at least 1 − δ:
κ ln(2/δ)
DT < ET [DT ] + ⇒ L(CT ) < LT (CT ) + 2 + (2κ + B) .
NT 2NT
This bound outlines three important features of our approach. First, it has a
convergence in O( N1T ), which is classical with the notion of uniform stability.
Second, this rate of convergence is independent of the alphabet size, which means
that our method should scale well to large alphabet problems. Lastly, thanks to
the relation between the optimized criterion and Definition 1 that we established
earlier, this bound also ensures the goodness in generalization of the learned
similarity function. Therefore, by Theorem 1, it guarantees that the similarity
will induce low-error classifiers for the classification task at hand.
198 A. Bellet, A. Habrard, and M. Sebban
4 Experimental Results
In this section, we provide an experimental evaluation of the approach presented
in Section 3. Using the learning rule (1) of Balcan et al., we compare three edit
similarity functions:4 (i) KG , learned by GESL,5 (ii) the Levenshtein distance
eL , and (iii) an edit similarity function pe learned with the method of Oncina
and Sebban [7].6 The task is to learn a model to classify words as either English
or French. We use the 2,000 top words lists from Wiktionary.7
First, we assess the convergence rate of the two considered edit cost learning
methods (i and iii). We keep aside 600 words as a validation set to tune the pa-
rameters, using 5-fold cross-validation and selecting the value offering the best
4
A similarity function that is not in [−1, 1] can be normalized.
5
In this series of experiments, we constrained the cost matrices to be symmetric in
order not to be dependent on the order in which the examples are considered.
6
We used their software SEDiL, available online at http://labh-curien.
univ-st-etienne.fr/SEDiL/
7
These lists are available at http://en.wiktionary.org/wiki/Wiktionary:
Frequency_lists. We only considered unique words (i.e., not appearing in both
lists) of length at least 4, and we also got rid of accent and punctuation marks. We
ended up with about 1,300 words of each language over an alphabet of 26 symbols.
Learning Good Edit Similarities with Generalization Guarantees 199
78
76 200
74
Classification accuracy
72 150
Model size
70
68 100
66
64 50
eL eL
62 pe pe
KG KG
60 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
NT NT
Fig. 1. Learning the costs: accuracy and sparsity results with respect to NT
80 250
75 200
Classification accuracy
Model size
70 150
65 100
60 50
eL eL
pe pe
KG KG
55 0
0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 1200 1400 1600
dl dl
Fig. 2. Learning the separator: accuracy and sparsity results with respect to dl
English French
high showed holy economiques americaines decouverte
liked hardly britannique informatique couverture
Table 2. Some discriminative patterns extracted from the reasonable points of Table
1 (^: start of word, $: end of word, ?: 0 or 1 occurrence of preceding letter)
reasonable points obtained with KG using a set of 1,200 examples to learn α.8
This small set actually carries a lot of discriminative patterns (shown in Table
2 along with their number of occurrences in each class over the entire dataset).
For example, words ending with ly correspond to English words, while those
ending with que characterize French words. Note that Table 1 also reflects the
fact that English words are shorter on average (6.99) than French words (8.26)
in the dataset, but the English (resp. French) reasonable points are significantly
shorter (resp. longer) than the average (mean of 5.00 and 10.83 resp.), which
allows better discrimination.
In this work, we proposed a novel approach to the problem of learning edit sim-
ilarities from data that induces (, γ, τ )-goodness. We derived a generalization
bound using the notion of uniform stability that is independent from the size of
the alphabet, making it suitable for problems involving large vocabularies. This
8
We used a high λ value in order to get a small set, thus making the analysis easier.
Learning Good Edit Similarities with Generalization Guarantees 201
bound is related to the goodness of the resulting similarity, which gives guar-
antees that the similarity will induce accurate models for the task at hand. We
experimentally showed that it is indeed the case and that the induced models
are also sparser than if we use other (standard or learned) edit similarities. Our
approach is flexible enough to be straightforwardly generalized to tree edit sim-
ilarity learning: one just has to redefine eG to be a tree edit script. Considering
that tree edit distances generally run in cubic time and that the methods for
learning tree edit similarities available in the literature are mostly EM-based
(thus requiring the distances to be recomputed many times), this seems a very
promising avenue to explore. Finally, learning (, γ, τ )-good Mahalanobis dis-
tance could also be considered.
A Appendices
A.1 Proof of Lemma 1
Proof. |V (C, z, z ) − V (C , z, z )| ≤ | 0≤i,j≤A (Ci,j − Ci,j
)#i,j (z, z )| ≤ C −
C #(z, z ). The first inequality uses the 1-lipschitz property of the hinge loss
and the fact that B1 ’s and B2 ’s cancel out. The second one comes from the
Cauchy-Schwartz inequality.9 Finally, since #(z, z ) ≤ W , the lemma holds.
(2NT + NL )t2W
≤ ΔC.
NT NL
The second line is obtained by the fact that every zk in T , zk
= zi , has at most
two landmark points different between T and T i,z , and z and zi at most NL
different landmarks. To complete the proof, we reorder the terms and use the 1-
lipschitz property, Cauchy-Schwartz, triangle inequalities and #(z, z ) ≤ W .
9
1-lipschitz implies |[X]+ − [Y ]+ | ≤ |X − Y |, Cauchy-Schwartz | ni=1 xi yi | ≤ xy.
10
Due to the limitation of space, the details of this construction are not presented in
this paper. We advise the interested reader to have a look at Lemma 20 in [13].
202 A. Bellet, A. Habrard, and M. Sebban
It suffices to apply this trick twice, combined with the triangle inequality and
the property of stability in NκT to lead to the lemma.
A.4 Lemma 5
In order to bound |DT − DT i,z |, we need to bound CT .
Lemma 5. Let (CT , B1 , B2 ) an optimal solution learned
by GESL from a sam-
Bγ
ple T , and let Bγ = max(ηγ , −log(1/2)), then CT ≤ β .
For the last inequality, note that V (0, zk , zk j ) is bounded either by Bγ or 0.
T 1
NL 1
Since N k=1 NT j=1 NL V (C, zk , zkj ) ≥ 0, we get βCT ≤ Bγ .
2
References
1. Yang, L., Jin, R.: Distance Metric Learning: A Comprehensive Survey. Technical
report, Dep. of Comp. Science and Eng., Michigan State University (2006)
2. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric
learning. In: Proc. of the Int. Conf. on Machine Learning (ICML), pp. 209–216
(2007)
3. Weinberger, K.Q., Saul, L.K.: Distance Metric Learning for Large Margin Nearest
Neighbor Classification. J. of Mach. Learn. Res. (JMLR) 10, 207–244 (2009)
4. Jin, R., Wang, S., Zhou, Y.: Regularized distance metric learning: Theory and
algorithm. In: Adv. in Neural Inf. Proc. Sys. (NIPS), pp. 862–870 (2009)
5. Ristad, E.S., Yianilos, P.N.: Learning String-Edit Distance. IEEE Trans. on Pat-
tern Analysis and Machine Intelligence. 20, 522–532 (1998)
6. Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String
Similarity Measures. In: Proc. of the Int. Conf. on Knowledge Discovery and Data
Mining (SIGKDD), pp. 39–48 (2003)
7. Oncina, J., Sebban, M.: Learning Stochastic Edit Distance: application in hand-
written character recognition. Pattern Recognition 39(9), 1575–1587 (2006)
8. Bernard, M., Boyer, L., Habrard, A., Sebban, M.: Learning probabilistic models of
tree edit distance. Pattern Recognition 41(8), 2611–2629 (2008)
9. Takasu, A.: Bayesian Similarity Model Estimation for Approximate Recognized
Text Search. In: Proc. of the Int. Conf. on Doc. Ana. and Reco., pp. 611–615
(2009)
10. Saigo, H., Vert, J.-P., Akutsu, T.: Optimizing amino acid substitution matrices
with a local alignment kernel. BMC Bioinformatics 7(246), 1–12 (2006)
11. Balcan, M.F., Blum, A.: On a Theory of Learning with Similarity Functions. In:
Proc. of the Int. Conf. on Machine Learning (ICML), pp. 73–80 (2006)
12. Balcan, M.F., Blum, A., Srebro, N.: Improved Guarantees for Learning via Simi-
larity Functions. In: Proc. of the Conf. on Learning Theory (COLT), pp. 287–298
(2008)
13. Bousquet, O., Elisseeff, A.: Stability and generalization. Journal of Machine Learn-
ing Research 2, 499–526 (2002)
14. Wang, L., Yang, C., Feng, J.: On Learning with Dissimilarity Functions. In: Proc.
of the Int. Conf. on Machine Learning (ICML), pp. 991–998 (2007)
15. Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-norm Support Vector Machines.
In: Adv. in Neural Inf. Proc. Sys. (NIPS), vol. 16, pp. 49–56 (2003)
16. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks.
Proc. of the National Academy of Sciences of the United States of America 89,
10915–10919 (1992)
17. McCallum, A., Bellare, K., Pereira, F.: A Conditional Random Field for
Discriminatively-trained Finite-state String Edit Distance. In: Conference on Un-
certainty in AI, pp. 388–395 (2005)
18. McDiarmid, C.: On the method of bounded differences. In: Surveys in Combina-
torics, pp. 148–188. Cambridge University Press, Cambridge (1989)
Constrained Laplacian Score for
Semi-supervised Feature Selection
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 204–218, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Constrained Laplacian Score for Semi-supervised Feature Selection 205
the goal of data clustering, then using the labels to generate constraints which
could in turns ameliorate the clustering. In this context, “good” features are
those which better describe the geometric structure of data. On the other hand,
semi-supervised data could be used for supervised learning, i.e. classification
or prediction of the unlabeled examples using a classifier constructed from the
labeled examples. In this context, “good” features are those which are better cor-
related with the labels. Subsequently, the use of a filter method makes the feature
selection process independent from the further learning algorithm whether it is
supervised or unsupervised. This is important to eliminate the bias of feature
selection in both cases, i.e. good features in this case would be those which com-
promise between better description of data structure and better correlation with
desired labels.
2 Related Work
In semi-supervised learning, a data set of N data points X = {x1 , ..., xN } consists
of two subsets depending on the label availability: XL = (x1 , ..., xl ) for which
the labels YL = (y1 , ..., yl ) are provided, and XU = (xl+1 , ..., xl+u ) whose labels
are not given. Here data point xi is a vector with m dimensions (features), and
label yi ∈ {1, 2, ..., C} (C is the number of different labels) and l + u = N (N
is the total number of instances). Let F1 , F2 , ..., Fm denote the m features of X
and f1 , f2 , ..., fm be the corresponding feature vectors that record the feature
value on each instance.
Semi-supervised feature selection is to use both XL and XU to identify the
set of most relevant features Fj1 , Fj2 , ..., Fjk of the target concept, where k ≤ m
and jr ∈ {1, 2, ..., m} for r ∈ {1, 2, ..., k}.
where :
⎧
⎨ xi −xj 2
Sij = e− λ if xi and xj are neighbors or (xi , xj ) ∈ ΩM L (5)
⎩0 otherwise
and:
frj if (xi , xj ) ∈ ΩCL
αirj = (6)
μr otherwise
Since the labeled and unlabeled data are sampled from the same population
generated by target concept, the basis idea behind our score is to generalize
the Laplacian score for semi-supervised feature selection. Note that if there are
208 K. Benabdeslem and M. Hindawi
Algorithm 1. CLS
Input: Data set X
1: Construct the constraint set (ΩM L and ΩCL ) from YL
2: Construct graphs Gkn and GCL from (X, ΩM L ) and ΩCL respectively.
3: Calculate the weight matrices S kn , S CL and their Laplacians Lkn , LCL respec-
tively.
for r = 1 to m do
4: Calculate CLSr
end for
5: Rank the features r according to their CLSr in ascending order.
where c = arg mins xi − ws , is the index of the best matching unit of the data
sample xi , . is the distance mesure, typically the Euclidean distance, and t
denotes the time. hsc (t) is the neighborhood function around the winner unit c.
Constrained Laplacian Score for Semi-supervised Feature Selection 211
δsc
In practice, we often use hsc = e− 2T 2 where T represents the neighborhood
raduis in the map. It is decreased from an initial value Tmax to a final value
Tmin .
Subsequently, as explained above, SOM will be applied on the unsupervised
part of data (XU ) for obtaining XU with a size equal to the number of SOM’
nodes (K). Therefore, CLS will be performed on the new obtained data set
(XL + XU ). Note that any other clustering method could be applied over XU ,
but here SOM is chosen for its ability to well preserve the topological relationship
of data and thus the geometric structure of their distribution. Finally, the feature
selection framework is represented in the Figure 1.
5 Results
The experimental results will be presented on three folds. First, we test our
algorithm on data sets whose the relevant features are known. Second, we do
some comparisons with known powerful feature selection methods and finally,
we apply the algorithm on databases with huge number of features. In most
experiments, the λ value is set to 0.1 and k = 10 for building the neighborhood
graph. For the semi-supervised data, we chose the first labeled examples for
all data sets (with different labels). We did no selection neither on the level of
examples to be labeled, nor on the generated constraints.
(a)
1.5
0.5
−0.5
−1
−1.5
−4 −3 −2 −1 0 1 2 3 4
(b) (c)
4.5 2.5
4 2
3.5 1.5
F2
F4
3 1
2.5 0.5
2 0
4 6 8 0 5 10
F1 F3
four features are sorted as (F3, F1, F4, F2). With k ≥ 15, Laplacian score
sorts these four features as (F3, F4, F1, F2). It sorts them as (F4, F3, F1, F2)
when 3 ≤ k < 15. By using CLS, the features are sorted as (F3, F4, F1, F2)
for any value of k (between 1 and 20). For explaining the difference between
the two scores, we chose for this data set, l = 10 generating 45 constraints.
Two of CL-type constraints are constructed from the pairs (73th , 150th ) and
(78th , 111th ) according to the labels of the points Figure.2(a)1 (The concerned
points are represented by rounds). Since, the data points between brackets are
close, with the Laplacian score, the edges e73,150 and e78,111 are constructed in
the associated k-neighborhood graph and affect the feature selection process.
With our method, these edges never exist because of the CL constraint property
even if k is small. For that, the scores obtained by CLS are smaller than the
ones obtained by Laplacian score. We also observed an important gap on scores
between the relevant variables (CLS3 = 1.4 × 10−3 , CLS4 = 2.7 × 10−3 ) and
the irrelevant ones (CLS1 = 1.07 × 10−2 , CLS2 = 1.77 × 10−2 ). In fact, In
the region where the points belong to the two non-linearly separable classes,
Laplacian score is biased by the dissimilarity which could affect the ranking of
features for their selection, while CLS is able to control this problem with the
help of constraints.
The waveform of Brieman data set “Wave” consists of 5000 instances divided
into 3 classes. This data set is composed of 21 relevant features (the first ones)
and 19 noise features with mean 0 and variance 1. Each class is generated from
a combination of 2/3 “base” waves. We tested our feature selection algorithm
1
Figure 2(a) is obtained by PCA.
214 K. Benabdeslem and M. Hindawi
Brieman Wave
0.08
0.07
0.06
0.05
CLS 0.04
0.03
0.02
0.01
0
0 10 20 30 40 50
Features
with l = 8 (28 constraints) and the dimension of the map (26 × 14) for SOM
algorithm. We can see in Figure 3 that the features (21 to 40) have high values
on CLS. The noise represented by these features is clearly detected.
Fig. 5. Accuracy vs. different numbers of labeled data (for Fisher Score) or pairwise
constraints (for CScore and CLS)
Fig. 6. Accuracy vs. different numbers of selected features on gene expression data sets
Fig. 7. Accuracy vs. different numbers of selected features on face-image data sets
6 Conclusion
In this paper, we proposed a filter approach for semi-supervised feature selection.
A new score function was developed to evaluate the relevance of features based
on both, the locally geometrical structure of unlabeled data and the constrains
preserving ability of labeled data. In this way, we combined two powerful scores,
unsupervised and supervised, in a new one which is more generic for a semi-
supervised paradigm. The proposed score function was explained in the spectral
graph theory framework with the study of the complexity of the associated
algorithm. For reducing this complexity we proposed to cluster the unlabeled
part of data by preserving its geometrical structure before feature selection.
Finally, experimental results on five UCI data sets and one microarray database
show that with only a small number of constraints, the proposed algorithm
significantly outperforms other filter based features selection methods.
There are a number of interesting potential avenues for future research. The
choice of (λ, k) is discussed in [20] and [10], we tried to keep the same values that
the authors used in their experiments in order to compare with their results. But
even, the study of the influence of (λ, k) on our function score (with the treatment
of constraints) is still interesting.
Another line of our future work is to study the constraint utility before in-
tegrating them for feature selection. In our proposal, we used the maximum
number of constraints which could be generated from the labeled data. This
could have ill effects over accuracy when constraints are incoherent or inconsis-
tent. It would be thus more interesting to investigate in constraint selection for
more efficient semi-supervised feature selection.
References
1. Jain, A., Zongker, D.: Feature selection: Evaluation, application, and small sam-
ple performance. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 19(2), 153–158 (1997)
2. Frank, A., Asuncion, A.: Uci machine learning repository. Technical report, Uni-
versity of California (2010)
3. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal
of Machine Learning Research (3), 1157–1182 (2003)
4. Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-
based filter solution. In: Proceedings of the Twentieth International Conference on
Machine Learning (2003)
5. Kohavi, R., John, G.: Wrappers for feature subset selection. Artificial Intelli-
gence 97(12), 273–324 (1997)
6. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by local linear em-
bedding. Science (290), 2323–2326 (2000)
7. Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analy-
sis 1(3), 131–156 (2000)
8. Dy, J., Brodley., C.E.: Feature selection for unsupervised learning. Journal of Ma-
chine Learning Research (5), 845–889 (2004)
9. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning. The MIT Press,
Cambridge (2006)
218 K. Benabdeslem and M. Hindawi
10. Zhao, Z., Liu, H.: Semi-supervised feature selection via spectral analysis. In: Pro-
ceedings of SIAM International Conference on Data Mining (SDM), pp. 641–646
(2007)
11. Chung, F.: Spectral graph theory. AMS, Providence (1997)
12. Ren, J., Qiu, Z., Fan, W., Cheng, H., Yu, P.S.: Forward semi-supervised feature
selection. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008.
LNCS (LNAI), vol. 5012, pp. 970–976. Springer, Heidelberg (2008)
13. Basu, S., Davidson, I., Wagstaff, K.: Constrained clustering: Advances in algo-
rithms, theory and applications. Chapman and Hall/CRC Data Mining and Knowl-
edge Discovery Series (2008)
14. Xing, E., Ng, A., Jordan, M., Russel, S.: Distance metric learning, with application
to clustering with side-information. In: Advances in Neural Information Processing
Systems, vol. 15, pp. 505–512 (2003)
15. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning a mahalanobis metric
from equivalence constraints. Journal of Machine Learning Research 6, 937–965
(2005)
16. Zhang, D., Zhou, Z., Chen, S.: Semi-supervised dimensionality reduction. In: Pro-
ceedings of SIAM International Conference on Data Mining, SDM (2007)
17. Zhang, D., Chen, S., Zhou, Z.: Constraint score: A new filter method for feature
selection with pairwise constraints. Pattern Recognition 41(5), 1440–1451 (2008)
18. Sun, D., Zhan, D.: Bagging constraint score for feature selection with pairwise
constraints. Pattern Recognition 43(6), 2106–2118 (2010)
19. Kalakech, M., Biela, P., Macaire, L., Hamad, D.: Constraint scores for semi-
supervised feature selection: A comparative study. Pattern Recognition Let-
ters 32(5), 656–665 (2011)
20. He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in
Neural Information Processing Systems, vol. 17 (2005)
21. Robnik-Sikonja, M., Kononenko, I.: Theoretical and empirical analysis of relief and
relieff. Machine Learning 53, 23–69 (2003)
22. Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learn-
ing. In: Proceedings of the Twenty Fourth International Conference on Machine
Learning (2007)
23. Kohonen, T.: Self Organizing Map. Springer, Berlin (2001)
24. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller,
H., Loh, M., Downing, L., Caligiuri, M., Bloomfield, C., Lander, E.: Molecular
classification of cancer: Class discovery and class prediction by gene expression
monitoring. Science 15 286(5439), 531–537 (1999)
25. Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A.:
Broad patterns of gene expression revealed by clustering analysis of tumor and
normal colon tissues probed by oligonucleotide arrays. Natl. Acad. Sci. 96(12),
6745–6750 (1999)
26. Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press,
Oxford (1995)
27. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley-Interscience, Hoboken
(2000)
COSNet: A Cost Sensitive Neural Network for
Semi-supervised Learning in Graphs
1 Introduction
The growing interest of the scientific community in methods and algorithms for
learning network-structured data is motivated by emerging applications in sev-
eral domains, ranging from social to economic and biological sciences [1, 2]. In
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 219–234, 2011.
c Springer-Verlag Berlin Heidelberg 2011
220 A. Bertoni, M. Frasca, and G. Valentini
from labeled examples [8, 18]. Finally, many approaches based on neural net-
works do not distinguish between the node labels and the values of the neuron
states [21], thus resulting in a lower predictive capability of the network.
To address these issues, we propose a cost-sensitive neural algorithm (COSNet ),
based on Hopfield networks, whose main characteristics are the following:
1. Available a priori information is embedded in the neural network and pre-
served by the network dynamics.
2. Labels and neuron states are conceptually separated. In this way a class of
Hopfield networks is introduced, having as parameters the values of neuron
states and the neuron thresholds.
3. The parameters of the network are learned from the data through an efficient
supervised algorithm, in order to take into account the unbalance between
positive and negative node labels.
4. The dynamics of the network is restricted to its unlabeled part, preserving
the minimization of the overall objective function and significantly reducing
the time complexity of the learning algorithm.
In sect. 2 the classification of nodes in networked data is formalized as a semi-
supervised learning problem. Hopfield networks and the main issues related to
this type of recurrent neural network are discussed in Sect. 3 and 4. The problem
of the restriction of network dynamics to a subset of nodes is analyzed in Sect 5,
while the proposed neural network algorithm, COSNet (COst Sensitive neural
Network), is discussed in Sect. 6. In the same section we show that COSNet
covers the main Hopfield networks learning issues, and in particular a statistical
analysis highlights that the network parameters selected by COSNet lead to
significantly lower values of the the energy function w.r.t. the non cost-sensitive
version of the Hopfield network. In Section 7, to test the proposed algorithm on
a classical unbalanced semi-supervised classification problem, we applied COS-
Net to the genome-wide prediction of gene functions in a model organism, by
considering about 200 different functional classes of the FunCat taxonomy [22],
and five different types of biomolecular data. The conclusions end the paper.
3 Hopfield Networks
At each discrete time t each neuron i has a value xi (t) ∈ {sin α, − cos α} accord-
ing to the following dynamics:
The state of the network at time t is the vector x(t) = (x1 (t), x2 (t), . . . , xn (t)).
The main feature of a Hopfield network is the existence of a quadratic state
function, i.e. the energy function:
1
E(x) = − xT W x + xT γ (2)
2
This is a non increasing function w.r.t. the evolution of the network according
to the activation rules (1), i.e.
5 Sub-network Property
Let be H = < W, γ, α > a network with neurons V = {1, 2, . . . , n}, having
the following bipartitions: (U, S) bipartition of V , where up to a permutation,
U = {1, 2, . . . , h} and S = {h + 1, h + 2, . . . , n}; (S + , S − ) bipartition of S;
(U + , U − ) bipartition of U .
224 A. Bertoni, M. Frasca, and G. Valentini
According to (U, S), each network state x can be decomposed in x = (u, s),
where u and s are respectively the states of neurons in U and in S. The energy
function of H can be written by separating the contributions due to U and S:
1 T
E(u, s) = − u Wuu u + sT Wss s + uT Wus s + sT Wus
T
u + uT γ u + sT γ s , (3)
2
Wuu Wus
where W = T and γ = (γ u , γ s ).
Wus Wss
By setting to a given state s̃ the neurons in S, we consider the dynamics
obtained by updating only neurons in U , without changing the state of neurons
in S. Since
1 1
E(u, s̃) = − uT Wuu u + uT (γ u − Wus s̃) − s̃T Wss s̃ + s̃T γ s ,
2 2
the dynamics of neurons in U is described by the subnetwork HU|s̃ =< Wuu , γ u −
Wus s̃, α >. It holds the following:
In our setting, we associate the state x(S + , S − ) with the given bipartition
(S + , S − ) of S:
− sin α if i ∈ S +
xi (S , S ) =
+
− cos α if i ∈ S −
for each i ∈ S. By the sub-network property, if x(S + , S − ) is part of a energy
global minimum of H, we can predict the hidden part relative to neurons U by
minimizing the energy of HU|x(S + ,S − ) .
6 COSNet
In this section we propose COSNet (COst-Sensitive neural Network), a semi-
supervised learning algorithm whose main feature is the introduction of a super-
vised learning strategy which exploits the sub-network property to automatically
estimate the parameters α and γ of the network H =< W, γ, α >. The main steps
of COSNet can be summarized as follows:
INPUT : symmetric connection matrix W : V × V −→ [0, 1], bipartition (U, S)
of V and bipartition (S + , S − ) of S.
OUTPUT : bipartition (U + , U − ) of U .
Step 3. Extend the parameters (α̂, γ̂) to the whole network and run the sub-
network HU|x(S + ,S − ) until an equilibrium state û is reached. The final solu-
tion (U + , U − ) is:
U + = {i ∈ U | ûi = sin α̂}
U − = {i ∈ U | ûi = − cos α̂}.
Below we explain in more details each step of the algorithm.
Fact 6.1. I + is linearly separable from I − if and only if there is a couple (α, γ)
such that x(S + , S − ) is an equilibrium state for the network HS|x(U + ,U − ) .
This fact suggests a method to optimize the parameters α and γ. Let be fα,γ a
straight line in the plane that separates the points Iα,γ
+
= {Δ(k) | fα,γ (Δ(k)) ≥
−
0} from points Iα,γ = {Δ(k) | fα,γ (Δ(k)) < 0}:
fα,γ (y, z) = cos α · y − sin α · z − γ = 0 (4)
Note that we assume that the positive half-plane is “above” the line fα,γ .
To optimize the parameters (α, γ) we adopt the F-score maximization crite-
rion, since it can be shown that Fscore (α, γ) = 1 iff x(S + , S − ) is an equilibrium
state of HS|x(U + ,U − ) . We obtain
(α̂, γ̂) = argmax Fscore (α, γ). (5)
α,γ
where θi = γ̂ − wij xj (S + , S − ). When the position of the positive half-plane is
j∈S
“below” the line, the disequalities (6) need to be reversed: the first one becomes
“sin α̂ if . . . > 0”, and the second “− cos α̂ if . . . < 0”.
The stable state û reached by this dynamics is used to classify unlabeled
data. If the known state x(S + , S − ), with the parameters found according to the
procedure described in Section 6.2, is a part of a global minimum of the energy of
H, and û is an energy global minimum of HU|x(S + ,S − ) , the sub-network property
(Section 5) guarantees that (û, x(S + , S − )) is a energy global minimum of H.
Table 1. Confidence interval estimation for the probabilities Px(α̂) and Px at a confi-
dence level 0.95 (data set PPI-VM)
We distinguish two main cases: a) both the confidence intervals coincide with
the minimum interval [0, 0.0030], case coherent with the prior information; b)
both lower and upper bounds of Px(α̂) are less than the corresponding bounds
of Px . It is worth noting that, in almost all cases, the probability Px(α̂) has an
upper bound smaller than the lower bound of Px . This is particularly evident
for classes “01.03.16.01”, “02.13” and “11.02.01”; in the latter the lower bound
of Px is 0.7761, while the corresponding upper bound of Px(α̂) is 0.
These results, reproduced with similar trends in other data sets (data not
shown), point out the effectiveness of our method in approaching the problem
of the incoherence of the prior knowledge coding.
For PPI data we adopt the scoring function used by Chua et al [36], which
assigns to genes i and j the similarity score
2|Ni ∩ Nj | 2|Ni ∩ Nj |
Sij = ×
|Ni − Nj | + 2|Ni ∩ Nj | + 1 |Nj − Ni | + 2|Ni ∩ Nj | + 1
7.2 Results
We compared COSNet with other semi-supervised label propagation algorithms
and supervised machine learning methods proposed in the literature for the gene
function prediction problem. We considered the classical GAIN algorithm [21],
based on Hopfield networks; LP-Zhu, a semi-supervised learning method based
on label propagation [8]; SVM-l and SVM-g, i.e. respectively linear and gaussian
kernel SVMs with probabilistic output [37]. SVMs had previously been shown to
be among the best algorithms for predicting gene functions in a “flat” setting (that
is without considering the hierarchical relationships between classes) [38, 39].
To estimate the generalization capabilities of the compared methods we
adopted a stratified 10-fold cross validation procedure, by ensuring that each
fold includes at least one positive example for each classification task. Consid-
ering the severe unbalance between positive and negative classes, beyond the
classical accuracy, we computed the F-score for each functional class and for
each considered data set. Indeed in this context the accuracy is only partially
A COSNet for Semi-supervised Learning in Graphs 231
0.7
COSNet COSNet
SVM−l SVM−l
SVM−g SVM−g
0.6
0.5
0.4
0.4
PPI−VM
Pfam−2
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
Fig. 1. Average precision, recall and F-score for each compared method (excluding
GAIN). Left: Pfam-2; Right: PPI-VM.
as a result of a good balancing between them. These results are replicated with
the other data sets, even if with Pfam-1 and Expr data COSNet achieves also
the best average precision and recall (data not shown).
We think that these results come from the COSNet cost-sensitive approach
that allows to automatically find the “near-optimal” parameters of the network
with respect to the distribution of positive and negative nodes (Section 6). It is
worth noting that using only single sources of data COSNet can obtain a rela-
tively high precision, without suffering a too high decay of the recall. This is of
paramount importance in the gene function prediction problem, where “in silico”
positive predictions of unknown genes need to be confirmed by expensive “wet”
biological experimental validation procedures. From this standpoint the experi-
mental results show that our proposed method could be applied to predict the
“unknown” functions of genes, considering also that data fusion techniques could
in principle further improve the reliability and the precision of the results [2, 41].
8 Conclusions
We introduced an effective neural algorithm, COSNet, which exploits Hopfield
networks for semi-supervised learning in graphs. COSNet adopts a cost sensitive
methodology to manage the unbalance between positive and negative labels, and
to preserve and coherently encode the prior information. We applied COSNet
to the genome-wide prediction of gene function in yeast, showing a large im-
provement of the prediction performances w.r.t. the compared state-of-the-art
methods.
By noting that the parameter γ of the neural network may assume different
values for each node, our method could be extended by allowing a different ac-
tivation threshold for each neuron. To avoid overfitting due to the increment of
network parameters, this approach should be paired with proper regularization
techniques. Moreover, by exploiting the supervised learning of network param-
eters, COSNet could be also adapted to combine multiple sources of networked
data: indeed the accuracy of the linear classifier on the labeled portion of the net-
work could be used to “weight” the associated source of data, in order to obtain
a “consensus” network, whose edges are the result of a weighted combination of
multiple types of data.
Acknowledgments. The authors gratefully acknowledge partial support by the
PASCAL2 Network of Excellence under EC grant no. 216886. This publication
only reflects the authors’ views.
References
[1] Zheleva, E., Getoor, L., Sarawagi, S.: Higher-order graphical models for classifi-
cation in social and affiliation networks. In: NIPS 2010 Workshop on Networks
Across Disciplines: Theory and Applications, Whistler BC, Canada (2010)
[2] Mostafavi, S., Morris, Q.: Fast integration of heterogeneous data sources for pre-
dicting gene function with limited annotation. Bioinformatics 26(14), 1759–1765
(2010)
A COSNet for Semi-supervised Learning in Graphs 233
[3] Vazquez, A., et al.: Global protein function prediction from protein-protein inter-
action networks. Nature Biotechnology 21, 697–700 (2003)
[4] Leskovec, J., et al.: Statistical properties of community structure in large social
and information networks. In: Proc. 17th Int. Conf. on WWW, pp. 695–704. ACM,
New York (2008)
[5] Bilgic, M., Mihalkova, L., Getoor, L.: Active learning for networked data. In: Proc.
of the 27th ICML, Haifa, Israel (2010)
[6] Marcotte, E., et al.: A combined algorithm for genome-wide prediction of protein
function. Nature 402, 83–86 (1999)
[7] Oliver, S.: Guilt-by-association goes global. Nature 403, 601–603 (2000)
[8] Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning with gaussian
fields and harmonic functions. In: Proc. of the 20th ICML, Washintgton DC,
USA (2003)
[9] Zhou, D.: et al.: Learning with local and global consistency. In: Adv. Neural Inf.
Process. Syst., vol. 16, pp. 321–328 (2004)
[10] Szummer, M., Jaakkola, T.: Partially labeled classification with markov random
walks. In: NIPS 2001, Whistler BC, Canada, vol. 14 (2001)
[11] Azran, A.: The rendezvous algorithm: Multi- class semi-supervised learning with
Markov random walks. In: Proc. of the 24th ICML (2007)
[12] Belkin, M., Matveeva, I., Niyogi, P.: Regularization and semi-supervised learning
on large graphs. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI),
vol. 3120, pp. 624–638. Springer, Heidelberg (2004)
[13] Delalleau, O., Bengio, Y., Le Roux, N.: Efficient non-parametric function induc-
tion in semi-supervised learning. In: Proc. of the Tenth Int. Workshop on Artificial
Intelligence and Statistics (2005)
[14] Belkin, M., Niyogi, P.: Using manifold structure for partially labeled classification.
In: Adv. Neural Inf. Process. Syst., vol. 15 (2003)
[15] Nabieva, E., et al.: Whole-proteome prediction of protein function via graph-
theoretic analysis of interaction maps. Bioinformatics 21(S1), 302–310 (2005)
[16] Deng, M., Chen, T., Sun, F.: An integrated probabilistic model for functional
prediction of proteins. J. Comput. Biol. 11, 463–475 (2004)
[17] Tsuda, K., Shin, H., Scholkopf, B.: Fast protein classification with multiple net-
works. Bioinformatics 21(suppl 2), ii59–ii65 (2005)
[18] Mostafavi, S., et al.: GeneMANIA: a real-time multiple association network inte-
gration algorithm for predicting gene function. Genome Biology 9(S4) (2008)
[19] Hopfield, J.: Neural networks and physical systems with emergent collective com-
pautational abilities. Proc. Natl Acad. Sci. USA 79, 2554–2558 (1982)
[20] Bengio, Y., Delalleau, O., Le Roux, N.: Label Propagation and Quadratic Crite-
rion. In: Chapelle, O., Scholkopf, B., Zien, A. (eds.) Semi-Supervised Learning,
pp. 193–216. MIT Press, Cambridge (2006)
[21] Karaoz, U., et al.: Whole-genome annotation by using evidence integration in
functional-linkage networks. Proc. Natl Acad. Sci. USA 101, 2888–2893 (2004)
[22] Ruepp, A., et al.: The FunCat, a functional annotation scheme for systematic
classification of proteins from whole genomes. Nucleic Acids Research 32(18),
5539–5545 (2004)
[23] Wang, D.: Temporal pattern processing. In: The Handbook of Brain Theory and
Neural Networks, pp. 1163–1167 (2003)
[24] Liu, H., Hu, Y.: An application of hopfield neural network in target selection of
mergers and acquisitions. In: International Conference on Business Intelligence
and Financial Engineering, pp. 34–37 (2009)
234 A. Bertoni, M. Frasca, and G. Valentini
[25] Zhang, F., Zhang, H.: Applications of a neural network to watermarking capacity
of digital image. Neurocomputing 67, 345–349 (2005)
[26] Tsirukis, A.G., Reklaitis, G.V., Tenorio, M.F.: Nonlinear optimization using gen-
eralized hopfield networks. Neural Comput. 1, 511–521 (1989)
[27] Ashburner, M., et al.: Gene ontology: tool for the unification of biology. the gene
ontology consortium. Nature Genetics 25(1), 25–29 (2000)
[28] Agresti, A., Coull, B.A.: Approximate is better than exact for interval estimation
of binomial proportions. Statistical Science 52(2), 119–126 (1998)
[29] Brown, L.D., Cai, T.T., Dasgupta, A.: Interval estimation for a binomial propor-
tion. Statistical Science 16, 101–133 (2001)
[30] Cesa-Bianchi, N., Valentini, G.: Hierarchical cost-sensitive algorithms for genome-
wide gene function prediction. Journal of Machine Learning Research, W&C Pro-
ceedings, Machine Learning in Systems Biology 8, 14–29 (2010)
[31] Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)
[32] Spellman, P.T., et al.: Comprehensive identification of cell cycle-regulated genes of
the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology
of the Cell 9(12), 3273–3297 (1998)
[33] Gasch, P., et al.: Genomic expression programs in the response of yeast cells to
environmental changes. Mol. Biol. Cell 11(12), 4241–4257 (2000)
[34] Stark, C., et al.: Biogrid: a general repository for interaction datasets. Nucleic
Acids Research 34(Database issue), 535–539 (2006)
[35] von Mering, C., et al.: Comparative assessment of large-scale data sets of protein-
protein interactions. Nature 417(6887), 399–403 (2002)
[36] Chua, H., Sung, W., Wong, L.: An efficient strategy for extensive integration
of diverse biological data for protein function prediction. Bioinformatics 23(24),
3364–3373 (2007)
[37] Lin, H.T., Lin, C.J., Weng, R.: A note on platt’s probabilistic outputs for support
vector machines. Machine Learning 68(3), 267–276 (2007)
[38] Brown, M.P.S., et al.: Knowledge-based analysis of microarray gene expression
data by using support vector machines. Proceedings of the National Academy of
Sciences of the United States of America 97(1), 267–276 (2000)
[39] Pavlidis, P., et al.: Learning gene functional classifications from multiple data
types. Journal of Computational Biology 9, 401–411 (2002)
[40] Wilcoxon, F.: Individual comparisons by ranking methods. Journal of Computa-
tional Biology 1(6), 80–83 (1945)
[41] Re, M., Valentini, G.: Simple ensemble methods are competitive with state-of-
the-art data integration methods for gene function prediction. Journal of Machine
Learning Research, W&C Proceedings, Machine Learning in Systems Biology 8,
98–111 (2010)
Regularized Sparse Kernel
Slow Feature Analysis
1 Introduction
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 235–248, 2011.
c Springer-Verlag Berlin Heidelberg 2011
236 W. Böhmer et al.
The principle of slowness, although not the above definition, has been used
very early in the context of neural networks [2, 8]. Recent variations of SFA
differ either in the objective [4] or the constraints [7, 24]. For some simplified
cases, given an infinite time series and unrestricted function class, it can be
analytically shown that SFA solutions converge to trigonometric polynomials
w.r.t. the underlying latent variables [9, 22]. In reality those conditions are never
met and one requires a function class that can be adjusted to the data set at
hand.
p
minp Et ẏi2 (xt ) , s.t. Et y(xt )y(xt ) = I . (6)
y∈H i=1
where Kij = κ(xi , xj ) is the kernel matrix and D ∈ IRn×n−1 the temporal
derivation matrix with all zero entries except ∀t ∈ {1, . . . , n − 1} : Dt,t = −1
and Dt+1,t = 1.
Sparse Kernel SFA. If one assumes the feature mappings within the span of
another set of data {zi }m i=1 ⊂ X (e.g. a sparse subset of the training data, of-
ten called support vectors), the sparse kernel matrix K ∈ IRm×n is defined as
Kij = κ(zi , xj ) instead. The resulting algorithm will be called sparse kernel
SFA. Note that the representer theorem no longer applies and therefore the so-
lution merely approximates the optimal mappings in H. Both optimization prob-
lems have identical solutions if ∀t ∈ {1, . . . , n} : κ(·, xt ) ∈ span({κ(·, zi )}m
i=1 ),
e.g. {zi }ni=1 = {xt }nt=1 .
Regularized Sparse Kernel SFA. The Hilbert spaces corresponding to some of the
most popular kernels are equivalent to the infinite dimensional space of continu-
ous functions [17]. One example is the Gaussian kernel κ(x, x ) = exp(− 2σ1 2
x−
x
22 ). Depending on hyper-parameter σ and data distribution, this can obviously
lead to over-fitting. Less obvious, however, is the tendency of kernel SFA to be-
come numerically unstable for large σ, i.e. to violate the unit variance constraint.
Fukumizu et al. [10] have shown this analytically for the related kernel canoni-
cal correlation analysis. Note that both problems do not affect sufficiently sparse
solutions, as sparsity reduces the function complexity and sparse kernel matrices
KK are more robust w.r.t. eigenvalue decompositions.
One countermeasure is to introduce a regularization term to stabilize the
sparse kernel SFA algorithm, which thereafter will be called regularized sparse
kernel SFA (RSK-SFA). Our approach penalizes the squared Hilbert-norm of the
selected functions
yi
2H by a regularization parameter λ. Analogous to K-SFA
the kernel trick can be utilized to obtain the new objective:
min n−11
tr A KDD K A + λ tr(A K̄A)
A∈IRm×p
1
s.t. n A KK A = I, (8)
where K̄ij = κ(zi , zj ) is the kernel matrix of the support vectors.
Zero Mean. To fulfil the zero mean constraint, one centres the support functions
{gi }m
i=1 ⊂ H w.r.t. the data distribution, i.e. gi (·) := κ(·, zi )−Et [κ(xt , zi )] 1H (·),
where ∀x ∈ X : 1H (x) = 1H , κ(·, x) H := 1 is the constant function in Hilbert
space H. Although ∀y ∈ span({gi }m i=1 ) : Et [y(xt )] = 0 already holds, it is of
advantage to centre the support functions as well w.r.t. each other [16], i.e. ĝi :=
gi − Ej [gj ]. The resulting transformation of support functions on the training
data can be applied directly onto the kernel matrices K and K̄:
K̂ := (I − 1
m 1m 1m ) K (I − n1 1n 1
n) (9)
ˆ := (I −
K̄ (I −
m 1m 1m ) ,
1 1
K̄ m 1m 1m ) (10)
Unit Variance and Decorrelation. Analogue to linear SFA, we first project into
the normalized eigenspace of n1 K̂K̂ =: UΛU . The procedure is called spher-
ing or whitening and fulfils the constraint in Equation 8, invariant to further
rotations R ∈ IRm×p : R R = I.
1
A := UΛ− 2 R ⇒ 1
n
A K̂K̂ A = R R = I (11)
Note that an inversion of the diagonal matrix Λ requires the removal of zero
eigenvalues and corresponding eigenvectors first.
Grouping the kernel functions of all support vectors together in a column vector,
i.e. k(x) = [κ(z1 , x), . . . , κ(zm , x)] , the combined solution y(·) ∈ Hp can be
expressed more compactly:
The representer theorem guarantees the optimal feature maps yi∗ ∈ H for training
set {xt }nt=1 can be found within span({κ(·, xt )}nt=1 ). For sparse K-SFA, however,
no such guarantee exists. The quality of such a sparse approximation depends
exclusively on the set of support vectors {zi }mi=1 .
Without restriction on y ∗ , it is straight forward to select a subset of the
training data, indicated by an index vector2 i ∈ INm with {zj }m m
j=1 := {xij }j=1 ,
that minimizes the approximation error
m
2
it := minm
κ(·, xt ) − αj κ(·, xij )
α∈IR j=1 H
−1
= Ktt − Kti (Kii ) Kit
for all training samples xt , where Ktj = κ(xt , xj ) is the full kernel matrix.
Finding an optimal subset is a NP hard combinatorial problem, but there exist
several greedy approximations to it.
Online Maximization of the Affine Hull. A widely used algorithm [5], which we
will call online maximization of the affine hull (online MAH) in the absence of
a generally accepted name, iterates through the data in an online fashion. At
time t, sample xt is added to the selected subset if it is larger than some given
threshold η. Exploitation of the matrix inversion lemma (MIL) allows an on-
line algorithm with computational complexity O(m2 n) and memory complexity
O(m2 ). The downside of this approach is the unpredictable dependence of the
final subset size m on hyper-parameter η. Downsizing of the subset therefore
requires a complete re-computation with larger η. The resulting subset size is
not predictable, although monotonically dependent on η.
2
Let “:” denote the index vector of all available indices.
Regularized Sparse Kernel SFA 241
Matching Pursuit for Sparse Kernel PCA. This handicap is addressed by match-
ing pursuit methods [14]. Applied on kernels, some criterion selects the best
fitting sample, followed by an orthogonalization of all remaining candidate sup-
port functions in Hilbert space H. A resulting sequence of m selected samples
therefore contains all sequences up to length m as well. The batch algorithm
of
Smola and Schölkopf [18] chooses the sample xj that minimizes3 Et it ∪ j . It was
shown later that this algorithm performs sparse PCA in H [12]. The algorithm,
which we will call in the following matching pursuit for sparse kernel PCA (MP
KPCA), has a computational complexity of O(n2 m) and a memory complexity
of O(n2 ). In practice it is therefore not applicable to large data sets.
ij+1 := argmin
[i1 ∪ t , . . . , in∪ t ]
∞ ≈ argmax it (15)
t t
Algorithm 2 iterates between sample selection (Equation 15) and error update
(Equation 16). The complexity is O(m2 n) in time and O(mn) in memory (to
avoid re-computations of K(:,i) ).
4 Empirical Validation
Slow feature analysis is not restricted to any specific type of time series data.
To give a proper evaluation of our kernel SFA algorithm, we therefore chose
benchmark data from very different domains. The common element, however, is
the existence of a low dimensional underlying cause.
We evaluated all algorithms on two data sets: Audio recordings from a vowel
classification task and a video depicting a random sequence of two hand-signs.
The second task covers high dimensional image data (40 × 30 pixels), which
is a very common setting for SFA. In contrast, mono audio data is one di-
mensional. Multiple time steps have to be grouped into a sample to create
a high dimensional space in which the state space is embedded as a mani-
fold (see Takens
Theorem [11, 20]). All experiments employ a Gaussian kernel
k(x, x ) = exp − 2σ1 2
x − x
22 .
The true latent variables of the examined data are not known. To ensure that
any meaningful information is extracted, we measure the test slowness, i.e. the
slowness of the learned feature mappings applied on a previously unseen test
sequence drawn from the same distribution. The variance of a slow feature on
unseen data is not strictly specified. This changes the feature’s slowness and for
comparison we normalized all test outputs to unit variance before measuring the
test slowness.
Audio Data. The “north Texas vowel database”5 contains uncompressed audio
files with English words of the form h...d, spoken multiple times by multiple per-
sons [1]. The natural task is to predict the central vowels of unseen instances. We
selected two data sets: (1) A small set with four training and four test instances
for each of the words “heed” and “head”, spoken by the same person. (2) A large
training set of four speakers with eight instances per person and each of the
5
http://www.utdallas.edu/~assmann/KIDVOW1/North_Texas_vowel_database.
html
Regularized Sparse Kernel SFA 243
words “heed” and “head”. The corresponding test set consists of eight instances
of each word spoken by a fifth person. The spoken words are provided as mono
audio streams of varying length at 48kHz, i.e. as a series of amplitude readings
{a1 , a2 , . . .}. To obtain an embedding of the latent variables, one groups a num-
ber of amplitude readings into a sample xt = [aδt , aδt+ , aδt+2 , . . . , aδt+(l−1) ] .
We evaluated the parameters δ, and l empirically and chose δ = 50, = 5 and
l = 500. This provided us with 3719 samples xt ∈ [−1, 1]500 for the small and
25448 samples for the large training set. Although the choice of embedding pa-
rameters change the resulting slowness in a nontrivial fashion, we want to point
out that this change appears to be smooth and the presented shapes similar over
a wide range of embedding parameters. The output of two RSK-SFA features is
plotted in Figure 2c.
Figure 1 shows the test slowness of RSK-SFA features6 on all three data sets for
multiple kernel parameter σ and regularization parameter λ. The small audio
data set (Column a) and video data set (Column c) use the complete training
set as support vectors, whereas for the large audio data set (Column b) a full
kernel approach is not feasible. Instead we selected a subset of size 2500 (based
on kernel parameter σ = 2) with the MP MAH algorithm before training.
In the absence of significant sparseness (Figure 1a and 1c), unregularized
kernel SFA (λ = 0, equivalent to K-SFA, Equations 6 and 7) shows both over-
fitting and numerical instability. Over-fitting can be seen at small σ, where the
features fulfil the unit variance constraint (lower right plot), but do not reach
the minimal test slowness (lower left plot). The bad performance for larger σ,
on the other hand, must be blamed on numerical instability, as indicated by
a significantly violated unit variance constraint. Both can be counteracted by
proper regularization. Although optimal regularization parameters λ are quite
small and can reach computational precision, there is a wide range of kernel
parameters σ for which the same minimal test slowness is reachable. E.g. in
Figure 1a, a fitting λ can be found between σ = 0.5 and σ = 20.
A more common case, depicted in Figure 1b, is a large training set from which
a small subset is selected by MP MAH. Here no regularization is necessary and
6
Comparison to linear SFA features is omitted due to scale, e.g. test slowness was
slightly above 0.5 for both audio data sets. RSK-SFA can therefore outperform linear
SFA up to a factor of 10, a magnitude we observed in other experiments as well.
244 W. Böhmer et al.
a) Small Audio Set (m=n=3719) b) Large Audio Set (m=2500,n=25448) c) Video Data Set (m=n=3600)
0.2 0.2 0.25
Mean Test Slowness of 200 Features
0.1 0.1
0.15
0.05 0.05
0 0 0.1
0 10 20 30 0 50 100 0 250 500
Gaussian Kernel width σ Gaussian Kernel width σ Gaussian Kernel width σ
Magnification Training Variance Magnification Training Variance Magnification Training Variance
0.2 0.2 0.2 1.5
1.02 1.02
0.15 1 0.15 1 1
0.1 0.98 0.1 0.98 0.15
0.5
0.05 0.96 0.05 0.96
0.94 0.94 0.11 0
0 2 4 0 10 20 30 0 2 4 0 50 100 0 10 20 30 0 250 500
Kernel width σ Kernel width σ Kernel width σ Kernel width σ Kernel width σ Kernel width σ
Fig. 1. (Regularization) Mean test slowness of 200 RSK-SFA features over varying
kernel parameter σ for different regularization parameters λ: (a) Small audio data set
with all m = n = 3719 training samples as support vectors, (b) large audio data set
with m = 2500 support vectors selected out of n = 25448 training samples by MP MAH
and (c) video data set with all m = n = 3600 available support vectors. The lower left
plots magnify the left side of the above plot. Notice the difference in scale. The lower
right plots show the variance of the training output. Significant deviation from one
violates the unit variance constraint and thus demonstrates numerical instability. The
legend applies to all plots.
4.3 Sparsity
To evaluate the behaviour of RSK-SFA for sparse subsets of different size, Figure
2a plots the test slowness of audio data for all discussed sparse subset selection
algorithms. As a baseline, we plotted mean and standard deviation of a random
selection scheme. One can observe that all algorithms surpass the random se-
lection significantly but do not differ much w.r.t. each other. As expected, the
MP MAH and Online MAH algorithms perform virtually identical. The novel
MP MAH, however, allows unproblematic and fast fine tuning of the selected
subset’s size.
5 Discussion
To provide a powerful but easily operated algorithm that performs non-linear
slow feature analysis, we derived a kernelized SFA algorithm (RSK-SFA). The
Regularized Sparse Kernel SFA 245
MP MAH
0.7 Online MAH
0.6 MP KPCA
Random
0.5 "two−finger salute" "fist" "open palm"
0.4 c) Example Audio Features
Feature 2
0.3
0.2
0.1
Feature 4
0
1000 2000 3000
Sparse Subset Size (out of 3719) "heed" "head"
Fig. 2. (Sparseness) (a) Mean test slowness of 200 RSK-SFA features of the small
audio data set (λ = 10−7 , σ = 2) over sparse subset size, selected by different algo-
rithms. For random selection the mean of 10 trials is plotted with standard deviation
error bars. (b) Examples from the video data set depicting the performed hand-signs.
(c) Two RSK-SFA features applied on the large audio test set (λ = 0, σ = 20). Vertical
lines separate eight instances of “heed” and eight instances of “head”.
novel algorithm is capable of handling small data sets by regularization and large
data sets through sparsity. To select a sparse subset for the latter, we developed
a matching pursuit approach to a widely used algorithm.
As suggested by previous works, our experiments show that for large data
sets no explicit regularization is needed. The implicit regularization introduced
by sparseness is sufficient to generate features that generalize well over a wide
range of Gaussian kernels. The performance of sparse kernel SFA depends on
the sparse subset, selected in a pre-processing step. In this setting, the subset
size m takes the place of regularization parameter λ. It is therefore imperative
to control m with minimal computational overhead.
We compared two state-of-the-art algorithms that select sparse subsets in
polynomial time. Online MAH is well suited to process large data sets, but
selects unpredictably large subsets. A change of m therefore requires a full re-
computation without the ability to target a specific size. Matching pursuit for
sparse kernel PCA (MP KPCA), on the other hand, returns an ordered list of
selected samples. After selection of a sufficiently large subset, lowering m yields
no additional cost. The downside is a quadratic dependency on the training set’s
size, both in time and memory. Both algorithms showed similar performance and
significantly outperformed a random selection scheme.
The subsets selected by the novel matching pursuit to online MAH (MP MAH)
algorithm yielded virtually the same performance as those selected by Online
MAH. There is no difference in computation time, but the memory complexity
of MP MAH is linearly dependent on the training set’s size. However, reducing
m works just as with MP KPCA, which makes this algorithm the better choice
246 W. Böhmer et al.
if one can afford the memory. If not, Online MAH can be applied several times
with slowly decreasing hyper-parameter η. Although a subset of suitable size will
eventually be found, this approach will take much more time than MP MAH.
The major advancement of our approach over the kernel SFA algorithm of
Bray and Martinez [4] is the ability to obtain features that generalize well for
small data sets. If one is forced to use a large proportion of the training set as sup-
port vectors, e.g. for small training sets of complex data, the solution can violate
the unit variance constraint. Fukumizu et al. [10] have shown this analytically
for the related kernel canonical correlation analysis and demonstrated the use
of regularization. In difference to their approach we penalize the Hilbert norm of
the selected function, rather than regularizing the unit variance constraint. This
leads to a faster learning procedure: First one fulfils the constraints as layed out
in Section 2.1, followed by repeated optimizations of the objective with slowly
increasing λ (starting at 0), until the constraints are no longer violated. The
resulting features are numerically stable, but may still exhibit over-fitting. The
latter can be reduced by raising λ even further or by additional sparsification.
Note that our empirical results in Section 4.2 suggest that for a fitting λ the
RSK-SFA solution remains optimal over large regimes of kernel parameter σ,
rendering an expensive parameter search unnecessary.
Our experimental results on audio data suggest that RSK-SFA is a promis-
ing pre-processing method for audio detection, description, clustering and many
other applications. Applied on large unlabelled natural language data bases,
e.g. telephone records or audio books, RSK-SFA in combination with MP MAH
or Online MAH will construct features that are sensitive to speech patterns of
the presented language. If the function class is powerful enough, i.e. provided
enough support vectors, those features will encode vowels, syllables or words,
depending on the embedding parameters. Based on those features, a small la-
belled data base might be sufficient to learn the intended task. The amount of
support vectors necessary for a practical task, however, is yet unknown and calls
for further investigation.
Although few practical video applications resemble our benchmark data, pre-
vious studies of SFA on sub-images show it’s usefulness in principle [3]. Land-
mark recognition in camera-based simultaneous localization and mapping (Vi-
sual SLAM [6]) algorithms is one possible field of application. To evaluate the
potential of RSK-SFA in this area, future works must include comparisons to
state-of-the-art feature extractions, e.g. the scale invariant feature transform
(SIFT [13]).
The presented results show RSK-SFA to be a powerful and reliable SFA al-
gorithm. Together with Online MAH or MP MAH, it combines the advantages
of regularization, fast feature extraction and large training sets with the perfor-
mance of kernel methods.
References
[1] Assmann, P.F., Nearey, T.M., Bharadwaj, S.: Analysis and classification of a vowel
database. Canadian Acoustics 36(3), 148–149 (2008)
[2] Becker, S., Hinton, G.E.: A self-organizing neural network that discovers surfaces
in randomdot stereograms. Nature 355(6356), 161–163 (1992)
[3] Berkes, P., Wiskott, L.: Slow feature analysis yields a rich repertoire of complex
cell properties. Journal of Vision 5, 579–602 (2005)
[4] Bray, A., Martinez, D.: Kernel-based extraction of Slow features: Complex cells
learn disparity and translation invariance from natural images. In: Neural Infor-
mation Processing Systems, vol. 15, pp. 253–260 (2002)
[5] Csató, L., Opper, M.: Sparse on-line gaussian processes. Neural Computa-
tion 14(3), 641–668 (2002)
[6] Davison, A.J.: Real-time simultaneous localisation and mapping with a single
camera. In: IEEE International Conference on Computer Vision, pp. 1403–1410
(2003)
[7] Einhäuser, W., Hipp, J., Eggert, J., Körner, E., König, P.: Learning viewpoint in-
variant object representations using temporal coherence principle. Biological Cy-
bernetics 93(1), 79–90 (2005)
[8] Földiák, P.: Learning invariance from transformation sequences. Neural Compu-
tation 3(2), 194–200 (1991)
[9] Franzius, M., Sprekeler, H., Wiskott, L.: Slowness and sparseness leads to place,
head-direction, and spatial-view cells. PLoS Computational Biology 3(8), e166
(2007)
[10] Fukumizu, K., Bach, F.R., Gretton, A.: Statistical consistency of kernel canonical
correlation analysis. Journal of Machine Learning Research 8, 361–383 (2007)
[11] Huke, J.P.: Embedding nonlinear dynamical systems: A guide to takens’ theorem.
Technical report, University of Manchester (2006)
[12] Hussain, Z., Shawe-Taylor, J.: Theory of matching pursuit. In: Advances in Neural
Information Processing Systems, vol. 21, pp. 721–728 (2008)
[13] Lowe, D.G.: Object recognition from local scale-invariant features. In: Interna-
tional Conference on Computer Vision, pp. 1150–1157 (1999)
[14] Mallat, S., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE
Transactions On Signal Processing 41, 3397–3415 (1993)
[15] Meyn, S.P., Tweedie, R.L.: Markov chains and stochastic stability. Springer, Lon-
don (1993)
[16] Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 10(5), 1299–1319 (1998)
[17] Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge
University Press, Cambridge (2004)
[18] Smola, A.J., Schölkopf, B.: Sparse greedy matrix approximation for machine learn-
ing. In: Proceedings to the 17th International Conference Machine Learning, pp.
911–918 (2000)
[19] Stone, J.V.: Blind source separation using temporal predictability. Neural Com-
putation 13(7), 1559–1574 (2001)
[20] Takens, F.: Detecting strange attractors in turbulence. Dynamical Systems and
Turbulence, 366–381 (1981)
248 W. Böhmer et al.
[21] Wahba, G.: Spline Models for Observational Data. Society for Industrial and Ap-
plied Mathematics, Philadelphia (1990)
[22] Wiskott, L.: Slow feature analysis: A theoretical analysis of optimal free responses.
Neural Computation 15(9), 2147–2177 (2003)
[23] Wiskott, L., Sejnowski, T.: Slow feature analysis: Unsupervised learning of invari-
ances. Neural Computation 14(4), 715–770 (2002)
[24] Wyss, R., König, P., Verschure, P.F.M.J.: A model of the ventral visual system
based on temporal stability and local memory. PLoS Biology 4(5), e120 (2006)
A Selecting-the-Best Method for Budgeted
Model Selection
1 Introduction
Model assessment and selection [6] is one of the oldest topics in statistical model-
ing and plenty of parametric and nonparametric approaches have been proposed
in literature. The standard way of performing model assessment in machine
learning, especially when dealing with datasets of limited size, relies on a repeti-
tion of multiple training and test runs, like in cross-validation. What is emerging
in recent years is the need of using model selection in contexts where the number
of alternatives is huge and sometimes much larger than the number of samples.
This is typically the case of machine learning, where the model designer is con-
fronted with a huge number of model families and paradigms, and of feature
selection tasks (e.g. in bioinformatics or text mining) where the problems of se-
lecting the best subset among a large set of inputs can be cast in the form of a
model selection problem. In such contexts, given the huge number of alternatives
it is not feasible to carry out for each alternative an extensive (e.g. leave-one-
out) validation procedure. Then it becomes more and more important to design
strategies for assessing and comparing large number of alternatives by using a
limited number of validation runs, then making model selection affordable in a
Olivier Caelen is now with the Fraud Detection Team, Atos Worldline Belgium.
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 249–262, 2011.
c Springer-Verlag Berlin Heidelberg 2011
250 G. Bontempi and O. Caelen
budgeted computational time. In a recent paper by Madani et al. [13] this prob-
lem has been denoted as ”budgeted active model selection” where the learner
can use a fixed budget of assessments (e.g. leave-one-out errors) to identify which
of a given set of model alternatives has the highest expected accuracy. Note that
budgeted model selection is an instance of the more general problem of budgeted
learning, in which there are fixed limits on the resources (e.g. number of exam-
ples or amount of computation) that are used to build the prediction model. The
research on budgeted learning is assuming an increasing role in machine learning
as witnessed by a recent ICML workshop1 .
The problem of selecting between alternatives in a stochastic setting and with
limited resources has been discussed within two research topics: multi-armed
bandit and selecting-the-best. The multi-armed bandit problem, first introduced
by [15], is a classical instance of an exploration/exploitation problem in which
a casino player has to decide which arm of a slot machine to pull to maximize
the total reward in a series of rounds. Each of the arms of the slot machine
returns a reward, which is randomly distributed, and unknown to the player.
Several algorithms deal with this problem and their application to model selec-
tion has already been proven to be effective [16]. However, the paradigm of the
bandit problem is not the only and probably not the most adequate manner of
interpreting a model selection problem in a stochastic environment. As stressed
in [13],[2] the main difference between a budgeted model selection problem and
a bandit problem is that in model selection there is a pure exploration phase
(characterised by spending the budget to assess the different alternatives) fol-
lowed by a pure exploitation step (i.e. the selection of the best model). This is
not the case of typical bandit problem where there is an immediate exploitation
(and consequent reward) following each exploration action. This difference led
us to focus on another computation framework extensively dealt with by the
community of stochastic simulation: the ”selecting-the-best” problem [12,11].
Researchers of stochastic simulation proposed a set of strategies (hereafter de-
noted as selecting-the-best algorithms) to sample alternative options in order to
maximize the probability of correct selection in a finite number of steps [9]. A
detailed review of the existing approaches to sequential selection as well as a
comparison of selecting-the-best and bandit strategies is discussed in [2].
The paper proposes a selecting-the-best strategy to explore the alternatives in
order to maximise the gain of a greedy selection, which is the selection returning
the alternative with the best sampled mean. The idea is to estimate from data the
probability that a greedy selection returns the best alternative and consequently
to estimate the expectation of the reward once a greedy action is done. The
probability of success of a greedy action is well-known in simulation literature
as the probability of correct selection (PCS) [11,3]. Here we use such notion
to study how the expected reward of a greedy model selection depends on the
number of model assessments. In particular we derive an optimal sampling rule
for maximising the greedy reward in the case of K = 2 normally distributed
configurations. Then we extend this rule to problem with K > 2 configurations
1
http://www.dmargineantu.net
A Selecting-the-Best Method for Budgeted Model Selection 251
the probability that the greedy selection returns the kth alternative at the lth
step. It follows that plk∗ is then the probability of making the correct selection
at the lth step. Since K k=1 pk = 1 the expected gain of a greedy selection is
K
μlg = plk μk (1)
k=1
2
Boldface denotes random variables.
252 G. Bontempi and O. Caelen
and the expected regret (i.e. the loss due to a greedy selection) is
K
r = μ∗ − plk μk . (2)
k=1
with mean ⎛ ⎞
μk − μ1
⎜ .. ⎟
⎜ . ⎟
⎜ ⎟
⎜ μk − μk−1 ⎟
⎜
Γ =⎜ ⎟,
⎟
⎜ μk − μk+1 ⎟
⎜ .. ⎟
⎝ . ⎠
μk − μK
and covariance matrix Σ
⎛ σ2 σ12 2
σk 2
σk 2
σk
⎞
nk + n1 · · · ···
k
nk nk nk
⎜ ⎟
⎜ .. .. .. .. .. ⎟
⎜ . . . . . ⎟
⎜ σ2 2 ⎟
⎜ k
···
2
σk σk−1 2
σk
···
2
σk ⎟
⎜ nk nk + nk−1 nk nk ⎟
⎜ σ2 2 2 σ2 2 ⎟
⎜ k
···
σk σk
+ nk+1 ···
σk ⎟
⎜ nk nk nk nk ⎟
⎜ k+1
⎟
⎜ .
.. .. .. .. .. ⎟
⎝ . . . . ⎠
2 2 2 2 2
σk σk σk σk σK
nk ··· nk nk ··· nk + nK
The first property is that when only two alternatives are in competition,
testing one of them leads invariably to an increase of the expected gain. This is
formalized in the following theorem.
Theorem 1. Let z1 ∼ N (μ1 , σ1 ) and z2 ∼ N (μ2 , σ2 ) be two independent Gaus-
sian distributed alternatives and μlg the expected greedy gain at step l. Let μl+1
g (k)
denote the value of the expected greedy gain at step l + 1 if the kth alternative is
tested. Then
∀k ∈ {1, 2}, μl+1g (k) ≥ μg
l
Proof. Without loss of generality let us assume that μ2 > μ1 , i.e. k ∗ = 2. Let
us first remark that the statement ∀k ∈ {1, 2}, μl+1 l
g (k) ≥ μg is equivalent to the
following statement
∀k ∈ {1, 2}, pl+1
2 (k) ≥ p2
l
where pl+1
2 (k) is the probability of selecting the best alternative (i.e. the second
one) at the (l +1)th step when the kth alternative is sampled. Since by definition
plk∗ = pl2 = Prob µ̂l2 ≥ µ̂l1 = Prob µ̂l2 − µ̂l1 ≥ 0 (4)
where Φ is the normal cumulative function and erf() is the Gauss error function,
whose derivative is known to be always positive. Since μ2 > μ1 by definition,
the sampling of each of the alternative will bring to an increase of either n2 (l) or
n1 (l) and consequently to a decrease of the argument of the erf function (given
that the numerator is negative). Since erf is monotonically increasing this leads
to a decrease of the value of the erf function and consequently to an increase of
the probability of making a correct selection.
The second interesting property is that, though sampling any of the two alter-
natives brings to an increase of the expected gain, this increase is not identical.
An optimal sampling policy (implemented by Algorithm 1) can be derived from
the following theorem.
254 G. Bontempi and O. Caelen
0.10
q 2 2
σ2 σ1
n2
+ n1
0.08
0.06
μ2 − μ1
Probability
0.04
0.02
p∗
0.00
−10 −5 0 5 10 15 20
b l2 − μ
μ b l1
Fig. 1. Distribution of µ̂l2 − µ̂l1 . The grey area represents the probability of correct
selection at the lth step if μ2 > μ1 .
where
NΔ = n1 (n1 + 1)(σ22 − σ12 ) + σ12 (n2 + n1 + 1)(n1 − n2 ) (6)
maximises the value of μg at the step l + 1.
μl+1 l+1
g (1) − μg (2) =
This intermediary result is intuitive since it proves that the sign of μl+1 g (1) −
l+1 l+1
μl+1
g (2) is the same as the sign of p 2 (1) − p 2 (2) or in other terms that in or-
der to increase the expected gain we have to increase the probability of correct
selection (i.e. the probability of selecting 2). Let V l+1 (1) and V l+1 (2) denote
the variances of µ̂l+1
2 − µ̂l+1
1 at the (l + 1)th step if we sample at the step l the
A Selecting-the-Best Method for Budgeted Model Selection 255
Algorithm 1. SRule(μ1 , σ1 , n1 , μ2 , σ2 , n2 )
1: Input: μ1 , σ1 : parameters of the first alternative
μ2 , σ2 : parameters of the second alternative
2: Compute NΔ by Equation (6)
3: if NΔ ≤ 0 then
4: return 1
5: end if
6: if NΔ > 0 then
7: return 2
8: end if
σ12 σ2 σ2 σ22 NΔ
ΔV = V l+1 (1) − V l+1 (2) = + 2 − 1 − =
n1 + 1 n2 n1 n2 + 1 DΔ
is always positive. It follows that by applying the sampling rule (5) we are
guaranteed to reduce the variance of µ̂l+12 − µ̂l+1
1 , and consequently increase
the probability of correct selection and finally increasing the expected gain of a
greedy selection.
distribution. Let us suppose at first we want to compute the mean μm and the
variance σm2
of the random variable zm = max{z1 , z2 } where z1 ∼ N (μ1 , σ12 )
and z2 ∼ N (μ2 , σ22 ). If we define
μ1 − μ2
a2 = σ12 + σ22 − 2σ1 σ2 ρ12 , z=
a
where ρ12 is the correlation coefficient between z1 and z2 , it is possible to show
that
μm = μ1 Φ(z) + μ2 Φ(−z) + aφ(z)
2
σm = [(μ21 + σ12 )Φ(z) + (μ22 + σ22 )Φ(−z)+
+ (μ1 + μ2 )aφ(z)] − μ2m
where φ is the standard normal density function and Φ is the associated cumu-
lative distribution. Assume now that we have a third variable z3 ∼ N (μ3 , σ32 )
and that we wish to find the distribution of zM = max{z1 , z2 , z3 }. The Clark
approach solves this problem by making the approximation
zM = max{z1 , z2 , z3 } ≈ max{zm , z3 }
and using iteratively the procedure sketched above for 2 variables.
The Clark’s technique is then a fast and simple manner to approximate a
set of K − 1 configurations by their own maximum. This allows us to reduce a
K > 2 problem into a series of 2-configuration tasks. The resulting selecting-the-
best algorithm is detailed in Algorithm 2. During the initialisation each model is
assessed at least I times (lines 5-10). The ordering of the alternatives is made in
line 15 where the notation [k] is used to design the rank of the alternative with
the kth largest sampled mean (e.g. [1] = arg maxk μ̂k and [K] = arg mink μ̂k ).
Once the ordering is done the loop in line 16 performs the set of comparisons
between the kth best alternative and the maximum of the configurations ranging
from k + 1 to K. If the kth alternative is the one to be sampled (i.e. s = 1) the
choice is done and the executions gets out of the loop (line 21). Otherwise we
move to the k + 1th configuration until k = K − 1. Hereafter we will refer to this
algorithm as the SELBEST.
Note that in the algorithm as well as in the following experiments the number
L of assessments does not include the number I × K of assessments made during
the initialization.
4 Experiments
In order to assess the potential of the proposed technique, we carried out both
synthetic and real selection problems. For the sake of benchmarking, we com-
pared the SELBEST algorithm with three reference methods:
1. a greedy algorithm which for each l samples the model with the highest
estimated performance
kl = arg max μ̂k
258 G. Bontempi and O. Caelen
2. an interval estimation algorithm which samples the model with the highest
upper value of the confidence interval
! "
kl = arg max μ̂k + t0.05,nk (l)−1 σ̂k
where t0.05,n is the upper 95% upper critical point of the Student distribution
with n degrees of freedom and
3. the UCB bandit algorithm [1] which implements the following sampling strat-
egy # $ %
l 2 log l
k̂ = arg max μ̂k +
nk (l)
Note that the initialisation step is the same for all techniques and described by
the lines 5-10 of Algorithm 2.
The synthetic experimental setting consists of three parts: in the first one
we carried out 5000 experiments where each experiment is characterised by a
number K of independent Gaussian alternatives and K is uniformly sampled in
the interval [10, 200]. The values of the means μk , k = 1, . . . , K and standard
deviations σk , k = 1, . . . , K are obtained by sampling the uniform distributions
U(0, 1) and U(0.5, 1) respectively. Table 1 reports the average regrets of the four
assessed sampling strategies for I = 3 and a number L of assessment which takes
values in the set {20, 40, . . . , 200}. The second part is identical to the first one
with the only difference that standard deviations σk , k = 1, . . . , K are obtained
by sampling the distributions U(1, 2.5). The results are in Table 2.
The third part aims to assess the robustness of the approach in a non Gaussian
configuration which is similar to the one often encountered in model selection
where typically a mean squared error has to be minimized. For this reason the
alternatives, whose number K is uniformly sampled within the interval [10, 200],
have a chi-squared distribution of degree thirty and their means are uniformly
sampled within [1, 2]. The average regrets over 5000 experiments are presented
in Table 3.
The real data experimental setting consists of a set of feature selection prob-
lems applied to 26 UCI regression tasks [8], reported in Table 4. Here we consider
a feature selection task as an instance of model selection. In particular each al-
ternative corresponds to a different feature set and the assessment is done by
using a wrapper approach, i.e. by using the performance of a given learner (in
this case a 5-nearest neighbour) as a measure of the performance of the feature
set. Since we want to compare the accuracy of different selection techniques but
it is not possible, unlike the simulated case, to measure the expected generaliza-
tion performance from a finite sample, we partition each dataset in two parts.
The first part is used for assessment and the second is used to compute what
we consider a reliable estimate of the generalization accuracy μk , k = 1, . . . , K
of the K alternatives.
The number L of assessments takes values in the set {20, 40, . . . , 200}, I = 20
and the number of alternatives is uniformly sampled in the range K ∈ [60, 100].
A Selecting-the-Best Method for Budgeted Model Selection 259
Table 4. Datasets used for feature selection and their number of features
Table 5. Real data experiment: average relative regrets over 500 repetitions for L total
assessments. Bold notation is used for regrets which are statistically significantly worse
(paired permutation test pv< 0.05) than the SELBEST performance.
Note that each alternative corresponds to a different feature set with a num-
ber of inputs equal to five. The goal of the budgeted model selection is to return
among the K alternatives the one which has the highest generalization accuracy
μk in the test set. The computation of leave-one-out (providing the values zkl )
A Selecting-the-Best Method for Budgeted Model Selection 261
and of the generalization accuracy (providing the values μk ) are done using a 5
nearest neighbour. Table 5 reports for each technique and for different values of
L the average relative regret (to be minimized),
K
μ∗ − k=1 plk μk
μ∗
that is the regret (2) normalized with respect to the maximal gain. The average is
done over the 26 datasets and 500 repetitions (each characterized by a resampling
of the training errors).
The results indicate that the SELBEST technique is significantly better than
the GREEDY and the IE technique for all ranges of L. As far as the comparison
with UCB is concerned, the results show that while SELBEST and UCB are
comparable for low values of L, the bandit technique is outperformed when L
increases. This is compatible with the fact that the bandit technique has not been
designed for budgeted model selection tasks and as a consequence its effectiveness
is reduced when the exploration phase is sufficiently large to take advantage of
selecting-the-best approaches.
5 Conclusion
Model selection is more and more confronted with tasks characterised by a huge
number of alternatives with respect to the amount of learning data and the
time allowed for assessment. This is the case of bioinformatics, text mining and
large-scale optimization problems (for instance Monte Carlo tree searches in
games [10]). These new challenges ask for techniques able to compare stochas-
tic configurations more rapidly and with a smaller budget of observations than
conventional leave-one-out techniques. Bandit techniques are at the moment the
most commonly used approaches in spite of the fact that they have not been
conceived for problems where assessment and selection are sequential and not
intertwined like in multi-armed problems. This paper explores an alternative
family of approaches which relies on notions of stochastic optimisation and low
variate approximation of a large number of alternatives. The promising results
open the way to additional validations in real configurations like feature selec-
tion in large feature to sample ratio dataset, notably in bioinformatics and text
mining.
References
1. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed
bandit problem. Machine Learning 47(2/3), 235–256 (2002)
2. Caelen, O.: Sélection Séquentielle en Environnement Aléatoire Appliquée à
l’Apprentissage Supervisé. PhD thesis, ULB (2009)
3. Caelen, O., Bontempi, G.: Improving the exploration strategy in bandit algorithms.
In: Proceedings of Learning and Intelligent OptimizatioN LION II, pp. 56–68 (2007)
262 G. Bontempi and O. Caelen
4. Caelen, O., Bontempi, G.: On the evolution of the expected gain of a greedy action
in the bandit problem. Technical report, Département d’Informatique, Université
Libre de Bruxelles, Brussels, Belgium (2008)
5. Caelen, O., Bontempi, G.: A dynamic programming strategy to balance exploration
and exploitation in the bandit problem. In: Annals of Mathematics and Artificial
Intelligence (2010)
6. Claeskens, G., Hjort, N.L.: Model selection and model averaging. Cambridge Uni-
versity Press, Cambridge (2008)
7. Clark, C.E.: The greatest of a finite set of random variables. Operations Research,
145–162 (1961)
8. Frank, A., Asuncion, A.: UCI machine learning repository (2010)
9. Inoue, K., Chick, S.E., Chen, C.-H.: An empirical evaluation of several methods
to select the best system. ACM Transactions on Modeling and Computer Simula-
tion 9(4), 381–407 (1999)
10. Iolis, B., Bontempi, G.: Comparison of selection strategies in monte carlo tree
search for computer poker. In: Proceedings of the Annual Machine Learning Con-
ference of Belgium and The Netherlands, BeNeLearn 2010 (2010)
11. Kim, S., Nelson, B.: Selecting the Best System. In: Handbooks in Operations Re-
search and Management Science: Simulation. Elsevier, Amsterdam (2006)
12. Law, A.M., Kelton, W.D.: Simulation Modeling & analysis, 2nd edn. McGraw-Hill
International, New York (1991)
13. Madani, O., Lizotte, D., Greiner, R.: Active model selection. In: Proceedings of
the Proceedings of the Twentieth Conference Annual Conference on Uncertainty
in Artificial Intelligence (UAI 2004), pp. 357–365. AUAI Press (2004)
14. Maron, O., Moore, A.W.: The racing algorithm: Model selection for lazy learners.
Artificial Intelligence Review 11(1-5), 193–225 (1997)
15. Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the
American Mathematical Society 58(5), 527–535 (1952)
16. Schneider, J., Moore, A.: Active learning in discrete input spaces. In: Proceedings
of the 34th Interface Symposium (2002)
A Robust Ranking Methodology Based on
Diverse Calibration of AdaBoost
1 Introduction
In the past, the result lists in Information Retrieval were ranked by probabilistic
models, such as the BM25 measure [16], based on a small number of attributes
(the frequency of query terms in the document, in the collection, etc.). The
parameters of these models were usually set empirically. As the number of useful
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 263–279, 2011.
c Springer-Verlag Berlin Heidelberg 2011
264 R. Busa-Fekete et al.
Fig. 1. The schematic overview of our approach. In the first level a multi-class method
(AdaBoost.MH) is trained using different hyperparameter settings. Then we calibrate
the multi-class models in many ways to obtain diverse scoring functions. In the last
step we simply aggregate the scoring functions using an exponential weighting.
The proper choice of the prior on the set of conditional distributions obtained
by the calibration of AdaBoost is an important decision in practice. In this paper,
we use an exponential scheme based on the quality of the rankings implied
by the conditional distributions (via their corresponding conditional ranking
functions) which is theoretically more well-founded than the uniformly weighted
aggregation used by McRank [12].
Figure 1 provides a structural overview of our system. Our approach belongs to
the simplest, pointwise category of learning-to-rank models. It is based on a series
of standard techniques (i) multiclass classification, ii) output score calibration
and iii) exponentially weighted forecaster to combine the various hypotheses. As
opposed to previous studies, we found our approach to be competitive to the
standard methods (AdaRank, ListNET, rankSVM, and rankBoost) of the
theoretically more complex pairwise and listwise approaches.
We attribute this surprising result to the presence of label noise in the ranking
task and the robustness of our approach to this noise. Label noise is inherent in
web page ranking for multiple reasons. First, the relevance labels depend on hu-
man decisions, and so they are subjective and noisy. Second, the common feature
representations account only for simple keyword matches (such as query term
frequency in the document) or query-independent measures (such as PageRank)
that are unable to capture query-document relations with more complex seman-
tics like the use of synonyms, analogy, etc. Query-document pairs that are not
characterized well by the features can be considered as noise from the perspective
of the learning algorithm. Our method suites especially well to practical prob-
lems with label noise due to the robustness guaranteed by the meta-ensemble
step that combines a wide variety of hypotheses. As our results demonstrate, it
compares favorably to theoretically more complex approaches.
The paper is organized as follows. In Section 2 we provide a brief overview
of the related works. Section 3 describes the formal setup. Section 4 is devoted
266 R. Busa-Fekete et al.
2 Related Work
Among the plethora of ranking algorithms, our approach is the closest to the
McRank algorithm [12]. We both use a multi-class classification algorithm at
the core (they use gradient boosting whereas we apply AdaBoost.MH). The
major novelties in our approach are that we use product base classifiers besides
the popular decision tree base classifiers and apply several different calibration
approaches. Both elements add more diversity to our models that we exploit
by a final meta-ensemble technique. In addition, MCRank’s implementation is
inefficient in the sense that the number of decision trees trained in each boosting
iteration is as large as the number of different classes in the dataset.
Even though McRank is not considered a state-of-the-art method itself, its
importance is unquestionable. It can be viewed as a milestone which proved
the raison d’etre of classification based learning-to-rank methods. It attracted
the attention of researchers working on learning-to-rank to classification-based
ranking algorithms. The most remarkable method motivated by McRank is
LambdaMart [19], which adapts the MART algorithm to the subset ranking
problem. In the Yahoo! Learning-to-rank Challenge this method achieved the
best performance in the first track [4].
In the Yahoo! challenge [4], a general conclusion was that listwise and pair-
wise methods achieved the best scores in general, but tailor-made pointwise
approaches also achieved very competitive results. In particular, the approach
presented here is based on our previous work [1]. The main contributions of this
work are that we evaluate a state of the art multiclass classification based ap-
proach on publicly available benchmark datasets, and that we present a novel
calibration approach, namely sigmoid-based class probability calibration (CPC),
which is theoretically better grounded than regression-based calibration. We
also provide an upper bound on the difference between the DCG value of the
Bayes optimal score function and the DCG value achieved by its estimate using
CPC.
In the LambdaMART [19] paper there is a second interesting contribution,
namely a linear combination scheme for two rankers with an O(n2 ) algorithm
where n is the number of documents. This method is simply based on a line
search optimization among the convex combination of two rankers. This rank-
ing combination is then used for adjusting the weights of weak learners. This
combination method has the appealing property that it gives optimal convex
combination of two rankers. However, it is not obvious how to extend it for more
than two rankers, so it is not directly applicable to our setting.
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 267
where xkj are the real valued feature vectors that encode the set of documents
retrieved for Qk . The upper index will always refer to the query index. When it
is not confusing, we will omit the query index and simply write xj for the jth
document of a given query.
The relevance grade of xkj is denoted by yjk . The set of possible relevance
grades is Y = {γ1 , . . . , γK }, usually referred to as relevance labels, i.e. integer
numbers up to a threshold: = 1, . . . , K. In this case, the relation between the
k
relevance grades and relevance labels is yjk = 2j + 1, where kj is the relevance
label for response j given a query Qk .
The goal of the ranker is to output a permutation J = [j1 , . . . , jm ] over the
integers (1, . . . , m). The Discounted Cumulative Gain (DCG) is defined as
m
DCG(J, [yji ]) = ci yji , (1)
i=1
where ci is the discount factor of the ith document in the permutation. The
most commonly used discount factor is ci = log(1+i) 1
. One can also define the
normalized DCG (NDCG) score by dividing (1) with the DCG score of the best
permutation.
We will consider yjk as a random variable with discrete probability distribution
P [yjk = γ|xkj ] = p∗yk |xk (γ) over the relevance grades for document j and query
j j
Since yjK is a random variable, we can define the expected DCG for any permu-
tation J = [j1 , . . . , jmk ] as
m
k m
DCG(J, [yjki ]) = ci Ep∗k yji = ci v ∗ (xkji ).
y |xk
ji ji
i=1 i=1
query Q be the one which maximizes the expected DCG-value, that is,
k
According to Theorem 1 of [7], J ∗k has the property that if ci > ci then for the
Bayes-scoring function it holds that v ∗ (xji∗k ) > v ∗ (xj ∗k
). Our goal is to estimate
i
p∗yk |xk (γ) by pA
y k |xk
(γ), which defines the following scoring function
j j j j
k A
v A (xkj ) = EpAk yj = γpyk |xk (γ), (2)
y |xk j j
j j
γ∈Y
where the label A will refer to the method that generates the probability esti-
mates.
Rd encodes a (query, document) pair. Each label vector zkj ∈ {+1, −1}K encodes
the relevance label using a one-out-of-K scheme, that is, zj, k
= 1 if kj = and −1
otherwise. We used two well-boostable base learners, i.e. decision trees and deci-
sion products [11]. Instead of using uniform weighting for training instances, we
up-weighted relevant instances exponentially proportionally to their relevance,
so, for example, an instance xjk with relevance jk = 3 was twice as important
in the global training cost than an instance with relevance jk = 2, and four
times as important than an instance with relevance jk = 1. Formally, the initial
(unnormalized) weight of th label of the xjk th instance is
k
2 j if kj = ,
wj, =
k
k
2j /(K − 1) otherwise.
The weights are then normalized to sum to 1. This weighting scheme was moti-
vated by the evaluation metric: the weight of an instance in the NDCG score is
exponentially proportional to the relevance label of the instance itself.
T
AdaBoost.MH outputs a strong classifier f (T ) (x) = t=1 α(t) h(t) (x), where
h (x) is a {−1, +1} -valued base classifier (K is number of relevance label
(t) K
values), and α(t) is its weight. In multi-class classification the elements of f (T ) (x)
are treated as posterior scores corresponding to the labels, and the predicted
label is kj = arg max=1,...,K f (xkj ). When posterior probability estimates are
(T )
required, in the simplest case the output vector can be shifted into [0, 1]K using
(T )
1 f (x)
f (x) =
(T )
1 + T ,
t=1 α
2 (t)
f
(T )
(xkj )
pstandard
yjk |xk
(γ ) = K (T )
. (3)
j
=1 f (xkj )
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 269
The output of this calibration step is a probability distribution pA,f yik |xk
(·) on the
i
relevance grades for each document in each query and a Bayes-scoring function
v A,f (·) defined in (2). We will refer to this scheme as class probability-based
calibration (CPC). The upper index A refers to the type of the particular TCF,
so the ensemble of probability distributions is indexed by the type A and the
multi-class classifier f output by AdaBoost.
Given the ensemble of the probability distributions pA,f yik |xk
(·) and an appro-
i
priately chosen prior π(A, f ), we follow a Bayesian approach and calculate a
posterior conditional distribution by
pposterior
k
y |x k (·) = π(A, f )pA,f
y k |xk
(·).
i i i i
A,f
270 R. Busa-Fekete et al.
K
K
v posterior
(xki ) = γ pposterior
yik |xk
(γ ) = γ π(A, f )pA,f
y k |xk
(xki ).
i i i
=1 =1 A,f
K
v posterior (xki ) = π(A, f ) γ pA,f
y k |xk
(xki ) = π(A, f )v A,f (xki ). (5)
i i
A,f =1 A,f
The proper selection of the prior π(·, ·) can further increase the quality of the
posterior estimation. In Section 5 we will describe a reasonable prior definition
borrowed from the theory of experts.
In the simplest case, the TCF can be
M
mk
sΘ (fki (xki ))
LLS (Θ, f ) = − log K .
=1 sΘ (f (xi ))
k
k=1 i=1
We refer to this function as the log-sigmoid TCF. The motivation of the log-
sigmoid TCF is that the resulting probability distribution minimizes the relative
entropy
M
mk
D(p∗ ||p) = − p∗yk |xk (γ ) log pyik |xki (γ ) + p∗yk |xk (γ ) log p∗yk |xk (γ ),
i i i i i i
k=1 i=1
M
mk
sΘ (fki (xki ))
LEWLS
C (Θ) = − log K ×
=1 sΘ (f (xi ))
k
k=1 i=1
C
sΘ f1 xki sΘ fK xki
HM K , . . . , K ,
=i sΘ (f (xi )) =i sΘ (f (xi ))
k
k
K
where HM (p1 , . . . , pK ) = =1 [p (− log p )], and C is a hyperparameter. The
minimization in LS and EWLS TCF can be considered as an attempt to minimize
a cost function, the sum of the negative logarithms of the class probabilities.
Usually, there is a cost function for misclassification associated to the learning
task. This cost function can be used for defining the expected loss TCF
M mk K
L , ki sΘ (f (xki ))
L EL
(Θ) = K ,
=1 sΘ (f (xi ))
k
k=1 i=1 =1
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 271
where ki is the correct label of xki , and L(, ki ) expresses the loss if is predicted
instead of the correct label. We used the standard square loss (L2 ) setup, so
L(, ) = ( − )2 .
If the labels have some structure, e.g., they are ordinal as in our case, it is
possible to calculate an expected label based on a CPC distribution. In this case
we can define the expected label loss TCF
K
M mk sΘ (f (xki ))
L ELL
(Θ) = L K , i .
k
=1 sΘ (f (xi ))
k
k=1 i=1 =1
Here, the goal is to minimize the incurred loss between the expected label and the
correct label ki . We used here also L2 as loss function. Note that the definition of
L(·, ·) might need to be updated if a weighted average of labels is to be calculated,
as the weighted average might not be a label at all.
Finally, we can apply the idea of SmoothGrad [6] to obtain a TCF. In
SmoothGrad a smooth surrogate function is used to optimize the NDCG
metric. In particular, the soft indicator variable can be written as
2
(vsΘ (xki )−vsΘ (xki ))
exp − σ
hΘ,σ (xki , xki ) = .
mk (vsΘ (xkj )−vsΘ (xki ))2
j=1 exp − σ
5 Ensemble of Ensembles
The output of calibration is a set of relevance predictions v A,f (x, S) for each
TCF type A and AdaBoost output f . Each relevance prediction can be used as a
scoring function to rank the query-document pairs represented by the vector xki .
Until this point it is a pure pointwise approach except that the smoothed version
of NDCG was optimized in SN. To fine-tune the algorithm and to make use of
272 R. Busa-Fekete et al.
where ω A,f is the NDCG10 score of the ranking obtained by using v A,f (x). The
parameter c controls the dependence of the weights on the NDCG10 values.
A similar ensemble method can be applied to outputs calibrated by regres-
sion methods. A major difference between the two types of calibration is that
the regression-based scores have to be normalized/rescaled before the exponen-
tially weighted ensemble scheme is applied. We simply rescaled the output of
the regression models into [0, 1] before using them in the exponential ensemble
scheme.
The following proposition gives an upper bound for the difference between
DCG value of the Bayes optimal score function and the DCG value achieved
by its estimate in terms of the quality of the relevance probability estimate.
Proposition: Let p, q ∈ [1, ∞] and 1/p + 1/q = 1. Then
⎛ ⎞1 ⎛ ⎞1
m
p p m q q
≤⎝ (cj v − cj ∗ )γ ⎠ ⎝ pyi |xi (γ) − p∗yi |xi (γ) ⎠ ,
i i
i=1 γ∈Y i=1 γ∈Y
m
m
≥ ci v(xji∗ ) + ci (v ∗ (xjiv ) − v(xjiv ))
i=1 i=1
m m
m
= ci v ∗ (xji∗ ) + ci (v ∗ (xjiv ) − v(xjiv )) + ci (v(xji∗ ) − v ∗ (xji∗ ))
i=1 i=1 i=1
m
m
= DCG(J ∗ , [yji∗ ]) + ci (v ∗ (xjiv ) − v(xjiv )) + ci (v(xji∗ ) − v ∗ (xji∗ )).
i=1 i=1
m m
Here i=1 ci v(xjiv ) ≥ i=1 ci v(xji∗ ), because J is an optimal permutation v
m
= (cj v − cj ∗ )γ(pyi |xi (γ) − p∗yi |xi (γ)),
i i
i=1 γ∈Y
where
jiv and
ji∗ are the inverse permutations of jiv and ji∗ . Then, the Hölder
inequality implies, that
m
(cjiv − cji∗ )γ(pyi |xi (γ) − p∗yi |xi (γ))
i=1 γ∈Y
⎛ ⎞1 ⎛ ⎞1
m
p p m q q
≤⎝ (cj v − cj ∗ )γ ⎠ ⎝ pyi |xi (γ) − p∗yi |xi (γ) ⎠ .
i i
i=1 γ∈Y i=1 γ∈Y
Corollary:
⎛ ⎞1
m
q q
DCG(J ∗ , [yji∗ ]) − DCG(J v , [yjiv ]) ≤ C · ⎝ pyi |xi (γ) − p∗yi |xi (γ) ⎠ ,
i=1 γ∈Y
where ⎛ ⎞1
m
p p
C = max ⎝ (cj v − cj ∗ )γ ⎠ ,
j v ,
j∗ i i
i=1 γ∈Y
j v and
j ∗ are the permutations of 1, . . . , m.
The Corollary shows that as the distance between the “exact” and the esti-
mated conditional distributions over the relevance labels tends to 0, the difference
in the DCG values also tends to 0.
274 R. Busa-Fekete et al.
7 Experiments
In our experiments we used the Ohsumed dataset taken from LETOR 3.0 and
both datasets of LETOR 4.02 . We are only interested in datasets that contain
more than 2 levels of relevance. On the one hand, this has a technical reason:
calibration for binary relevance labels does not make too much sense. On the
other hand, we believe that in this case the difference between various learning
algorithms is more significant. All LETOR datasets we used contain 3 levels of
relevance. We summarize their main statistics in Table 1.
For each LETOR dataset there is a 5-fold train/valid/test split given. We used
this split except that we divided the official train set by a random 80% − 20%
split into training and calibration sets which were used to adjust the parameters
of the different calibration methods. We did not apply any feature engineering
or preprocessing to the official feature set. The NDCG values we report in this
section have been calculated using the provided evaluation tools.
We compared our algorithm to five state-of-the-art ranking methods whose
outputs are available at the LETOR website for each dataset we used:
We only tuned the number of iterations T and the base parameter c in the ex-
ponential weighting scheme (6) on the validation set. In the exponential weighting
combination (6) we set the weights using the NDCG10 performances of the cali-
brated models and c and T were selected based on the performance of v posterior (·)
in terms of NDCG10 . The hyperparameter optimization was performed using a
simple grid search where c ranged from 0 (corresponding to uniform weighting)
to 200 and for T from 10 to 10000. Interestingly, the best number of iterations is
very low compared to the ones reported by [11] for classification tasks. For LETOR
3.0 the best number of iterations is T = 100 and for both LETOR 4.0 datasets
T = 50. The best base parameter is c = 100 for all databases. This value is rela-
tively high considering that it is used in the exponent, but the performances of the
best models were relatively close to each other. We used fixed parameters C = 2
in the TCN function LEWLS
C , and σ = 0.01 in LSN
σ .
0.56 AdaRank−MAP
AdaRank−NCDG
0.54 ListNet
RankBoost
0.52 RankSVM
Exp. weighted ensemble 0.4621
0.5
NCDGk
0.4496
0.48
0.4429
0.46
0.441
0.44
0.4302
0.42 0.414
0.4
1 2 3 4 5 6 7 8 9 10 11
Position
NDCGk
0.42
0.35
0.4403 0.2288
0.41
0.3 AdaRank−MAP
0.4369 AdaRank−NCDG
0.4
ListNet 0.2279
0.25 RankBoost
0.39 0.4335 RankSVM
Exp. weighted ensemble 0.2255
0.38 0.2
2 4 6 8 10 2 4 6 8 10
Position Position
Fig. 2. NDCGk values on the LETOR datasets. We blow up the NDCG10 values to
see the differences.
276 R. Busa-Fekete et al.
Table 2. The NDCG values for various ranking algorithms. In the last three lines the
results of our method are shown using only CPC, only RBC and both.
but consistently the baseline methods on MQ2008. The picture is not so clear for
MQ2007 because our approach shows some improvement only for low truncation
levels.4
Table 2 shows the average NDCG values for LETOR 4.0 along with NCDG10
for LETOR 3.0 (the tool for LETOR 3.0 does not output the average NDCG).
We also calculate the performance of our approach where we put only RBC,
only CPC, and both calibration ensembles into the pool of score functions used
in the aggregation step (three rows in middle of the Table 2). Our approach
consistently achieves the best performance among all methods.
We also evaluate the original AdaBoost.MH with decision tree and deci-
sion product, i.e. without our calibration and ensemble setup (last two rows in
Table 2). The posteriors were calculated according to (3) and here we validated
the i) iteration number and ii) the hyperparameters of base learners (the number
of leaves and number of terms) in order to select a single best setting. Thus, these
runs correspond to a standard classification approach using AdaBoost.MH.
These results shows that our ensemble learning scheme where we calibrated the
individual classifiers improves the standard AdaBoost.MH ranking setup sig-
nificantly.
Fig. 3. The p-values for different calibrations obtained by Fischer’s method on foldwise
p-values of t-test, Letor 4.0/MQ2007. The calibration is calculated according to (3).
The results in Figure 3 indicate that for a subset of TCFs, the estimated
probability distributions were quite close to each other. Although the TCFs are
rather different, it seems that they approximate a similar distribution with just
small differences. We believe that one reason for the experienced efficiency of
the proposed method is that these small differences within the cluster are due
to the estimation noise, so by mixing them, the level of the noise decreases.
8 Conclusions
In this paper we presented a simple learning-to-rank approach based on multi-
class classification, model calibration, and the aggregation of scoring functions.
We showed that this approach is competitive with more complex methods such as
RankSVM or ListNet on three benchmark datasets. We suggested the use of a
sigmoid-based class probability calibration which is theoretically better grounded
than regression based calibration, and thus we expected it to yield better results.
Interestingly, this expectation was confirmed only for the Ohsumed dataset which
is the most balanced set in terms of containing a relatively high number of
highly relevant documents. This suggests that CPC has an advantage over RBC
when all relevance levels are well represented in the data. Nevertheless, the CPC
method was strongly competitive on the other two datasets as well, and it also
has the advantage of coming with an upper bound on the NDCG measure.
Finally, we found AdaBoost to overfit the NDCG score for low number of
iterations during the validation process. This fact indicates the presence of la-
bel noise in the learning-to-rank datasets, according to experiments conducted
by [13] using artificial data. We note here that noise might come either real
noise in the labeling, or from the deficiency of the overly simplistic feature rep-
resentation which is unable to capture nontrivial semantics between a query
and document. As a future work, we plan to investigate the robustness of our
method to label noise using synthetic data since this is an important issue in a
learning-to-rank application: while noise due to labeling might be reduced simply
by improving the consistency of the data, it is less trivial to obtain significantly
278 R. Busa-Fekete et al.
References
1. Busa-Fekete, R., Kégl, B., Éltető, T., Szarvas, G.: Ranking by calibrated AdaBoost.
In: JMLR W&CP, vol. 14, pp. 37–48 (2011)
2. Cao, Z., Qin, T., Liu, T., Tsai, M., Li, H.: Learning to rank: from pairwise ap-
proach to listwise approach. In: Proceedings of the 24rd International Conference
on Machine Learning, pp. 129–136 (2007)
3. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge Uni-
versity Press, New York (2006)
4. Chapelle, O., Chang, Y.: Yahoo! Learning to Rank Challenge Overview. In: Yahoo
Learning to Rank Challenge (JMLR W&CP), Haifa, Israel, vol. 14, pp. 1–24 (2010)
5. Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for
graded relevance. In: Proceeding of the 18th ACM Conference on Information and
Knowledge Management, pp. 621–630. ACM, New York (2009)
6. Chapelle, O., Wu, M.: Gradient descent optimization of smoothed information
retrieval metrics. Information Retrievel 13(3), 216–235 (2010)
7. Cossock, D., Zhang, T.: Statistical analysis of Bayes optimal subset ranking. IEEE
Transactions on Information Theory 54(11), 5140–5154 (2008)
8. Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for
combining preferences. Journal of Machine Learning Research 4, 933–969 (2003)
9. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences 55,
119–139 (1997)
10. Herbrich, R., Graepel, T., Obermayer, K.: Large margin rank boundaries for or-
dinal regression. In: Smola, B., Schoelkopf, S. (eds.) Advances in Large Margin
Classifiers, pp. 115–132. MIT Press, Cambridge (2000)
11. Kégl, B., Busa-Fekete, R.: Boosting products of base classifiers. In: International
Conference on Machine Learning, Montreal, Canada, vol. 26, pp. 497–504 (2009)
12. Li, P., Burges, C., Wu, Q.: McRank: Learning to rank using multiple classification
and gradient boosting. In: Advances in Neural Information Processing Systems,
vol. 19, pp. 897–904. The MIT Press, Cambridge (2007)
13. Mease, D., Wyner, A.: Evidence contrary to the statistical view of boosting. Journal
of Machine Learning Research 9, 131–156 (2007)
14. Niculescu-Mizil, A., Caruana, R.: Obtaining calibrated probabilities from boosting.
In: Proceedings of the 21st International Conference on Uncertainty in Artificial
Intelligence, pp. 413–420 (2005)
15. Rissanen, J.: A universal prior for integers and estimation by minimum description
length. Annals of Statistics 11, 416–431 (1983)
16. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and
beyond. Found. Trends Inf. Retr. 3, 333–389 (2009)
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 279
17. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated
predictions. Machine Learning 37(3), 297–336 (1999)
18. Valizadegan, H., Jin, R., Zhang, R., Mao, J.: Learning to rank by optimizing NDCG
measure. In: Advances in Neural Information Processing Systems, vol. 22, pp. 1883–
1891 (2009)
19. Wu, Q., Burges, C.J.C., Svore, K.M., Gao, J.: Adapting boosting for information
retrieval measures. Inf. Retr. 13(3), 254–270 (2010)
20. Xu, J., Li, H.: AdaRank: a boosting algorithm for information retrieval. In: SIGIR
2007: Proceedings of the 30th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 391–398. ACM, New York
(2007)
Active Learning of Model Parameters for Influence
Maximization
Tianyu Cao1 , Xindong Wu1 , Tony Xiaohua Hu2 , and Song Wang1
1
Department of Computer Science, University of Vermont, USA
2
College of Information Science and Technology, Drexel University, USA
1 Introduction
Social networks have become a hot research topic recently. Popular social networks
such as Facebook and Twitter are widely used. An important application based on so-
cial networks is the so-called “viral marketing”, the core part of which is the influence
maximization problem [5, 11, 9].
A social network is modeled as a graph G = (V, E), where V is the set of users
(nodes) in the network, and E is the set of edges between nodes, representing the con-
nectivity and relationship of users in that network. Under this model, the influence max-
imization problem in a social network is defined as extracting a set of k nodes to target
for initial activation such that these k nodes yield the largest expected spread of in-
fluence, or interchangeably, the largest diffusion size (i.e., the largest number of nodes
activated), where k is a pre-specified positive integer. Two information diffusion mod-
els , i.e., the independent cascade model (IC model) and the linear threshold model (LT
model) are usually used as the underlying information diffusion models. The influence
maximization problem has been investigated extensively recently [3, 2, 4, 1, 14].
To the best of our knowledge, all previous algorithms on influence maximization
assume that the model parameters (i.e., diffusion probabilities in the IC model and
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 280–295, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Active Learning of Model Parameters for Influence Maximization 281
thresholds in the LT model) are given. However, this is rarely true in real world so-
cial networks. In this paper we relax this constraint for the LT model, assuming that
the model parameters are unknown beforehand. Instead, we propose a framework of
active learning to obtain those parameters. In this work, we focus on learning the model
parameters under the LT model since it is relatively simple, and we will investigate the
same problem under the IC model in the future.
Learning the information diffusion models has been studied in [7, 13, 12]. However,
there is a problem with the methods from [7, 13, 12], as these methods assume that a
certain amount of information diffusion data (propagation logs in [7]) on the network
is available. This data is usually held by the social network site and not immediately
available to outsiders. In some cases, it is not available at all due to privacy considera-
tions. Considering a scenario in which we wish to use “viral marketing” techniques to
market some products on Facebook, most likely we cannot get any data of information
diffusion on Facebook due to privacy reasons.
Therefore, we need to actively construct the data of information diffusion in or-
der to learn the information diffusion model parameters. This naturally falls into the
framework of active learning. There are two advantages using the active learning ap-
proach. Firstly we are no longer restricted by the social network sites’ privacy terms.
Secondly we have the additional advantage that we can explicitly control what social
influence to measure. For example, we can restrict the influence scope to either “mu-
sic” or “computer devices”. Based on the scope, we can learn the diffusion model with
a finer granularity. Therefore we can do influence maximization on the “music” prod-
ucts and the “computer device” products separately. Intuitively, the influential nodes of
“music” products and the “computer devices” should be different.
A simple way to construct the information diffusion data is to send free products
to some users and see how their social neighbors react. All social neighbors’ reaction
constitutes the information diffusion data. This is our basic idea to acquire the diffu-
sion data. Now we rephrase this process in the context of information diffusion. We set
some nodes in a network to be active. Then we observe which nodes become active at
the following time steps. The observed activation sequences can be used as the infor-
mation diffusion data. We can then make inference of the model parameters based on
this observed activation sequences.
In this context, we would naturally want to achieve the following two goals: (1) we
would like to send as few free products as possible to learn the information diffusion
model as accurately as possible; (2) we would like to make sure the learned diffusion
model is useful for influence maximization. Ultimately, we would like to make sure
that the set of influential nodes found by using a greedy algorithm on a learned model is
more influential than that found by the greedy algorithm on a randomly guessed model.
Motivated by these two objectives, in this paper we firstly empirically show that the
influence maximization problem is sensitive to model parameters under the LT model.
We define the problem of active learning of the LT model for influence maximization.
In the sequel, we propose a weighted sampling algorithm to solve this active learn-
ing problem. Extensive experiments are conducted to evaluate our proposed algorithm
on five networks. Results show that the proposed algorithm outperforms pure random
sampling under the linear threshold model.
282 T. Cao et al.
The rest of the paper is organized as follows. Section 2 reviews the preliminaries,
including the LT information diffusion model and the study of parameter sensitivity un-
der the LT model. We also define the problem of finding model parameters as an active
learning problem in this section. Section 3 details our weighted sampling algorithm to
learn the model parameters. Experimental results are shown in section 4. Finally, we
review related work in section 5 and conclude in section 6.
In this section, we first introduce the LT information diffusion model. We then detail
our study on the sensitivity of model parameters for the LT model, which motivates
the necessity to learn the model parameters when they are unknown. We then formally
define the problem of finding the model parameters for influence maximization as an
active learning problem.
In [9], because the threshold of each node is unknown, the influence maximization
problem is defined as finding the set of nodes that can activate the largest expected
number of nodes under all possible thresholds distributions. However, in our research,
we assume that each nodes ni in the network has a fixed threshold θi , which we intend
to learn.
The linear threshold model [8, 9] assumes that one node u will be activated if the
fraction of its activated neighbors are larger than a certain threshold θu . In a more
general case, each neighbor v may have a different weight w(u, v) to node u’s decision.
In this case, a node u becomes active if the sum of the weights of its activated neighbors
is greater than θu . The requirement of a node u to become active can be described by
the following equation:
Σv w(u, v) ≥ θu ;
In this section we will empirically check whether the influence maximization problem
is sensitive to the model parameters under the LT model.
To check the sensitivity of model parameters, we assume that there is a true model
with the true parameters. We also have an estimated model. We use a greedy algorithm
on the estimated model and get a set of influential nodes. Denote this set as Sestimate .
We perform the greedy algorithm on the true model, get another set of influential nodes,
Active Learning of Model Parameters for Influence Maximization 283
2000
1800
1600
1400
1200
influence spread
1000
800
600
400
200
true model
guessed model
0
0 5 10 15 20 25 30 35 40 45 50
number of initial seeds
and denote this set as Strue . We check the influence spread of Sestimate and Strue on
the true model. The sensitivity of models is then defined as follows: if the difference
between Strue and Sestimate is smaller than a given small number, we can infer that the
influence maximization problem is not very sensitive to model parameters; otherwise it
is sensitive to model parameters.
To test the sensitivity of the LT model, we assume that the thresholds of the true
model are drawn from a truncated normal distribution with mean 0.5 and standard de-
viation of 0.25. Suppose all the thresholds in the estimated model are 0.5. Figure 1
shows that the influence spread of Sestimate is significantly lower than that of the set
Strue . We have conducted similar experiments on other collaboration networks and ci-
tation networks, and similar pattern can be found in those networks. These observations
motivate us to find the parameters in the LT model for influence maximization.
Since the influence spread under the LT model is quite sensitive to model parameters,
we now present a formal definition of active model parameter learning for the LT model
for influence maximization. Notations used in our problem definition are presented in
Table 1.
We assume that there is a true fixed threshold θi of each node ni in the social network
G(V, E). Our goal is to learn θ as accurately as possible. In order to make the problem
definition easier, we will actively construct the information diffusion data D over mul-
tiple iterations. In each iteration, we can use at most κ nodes to activate other nodes
in the network. After we acquire the activation sequences in each iteration, the model
parameters can be inferred. More specifically, we can infer the lower bound θlowbd and
the upper bound θupbd of the thresholds of some nodes according to the activation se-
quences D. The details of the inference will be introduced in Section 3. With more and
more iterations, we can get the thresholds θ more tightly bounded or even hit the ac-
curate threshold value. The activation sequences of different iterations are assumed to
be independent. That means at the beginning of each iteration, none of the nodes are
activated (influenced). The above process can be summarized into the following three
functions.
284 T. Cao et al.
Table 1. Notations
Symbol Meaning
G(V, E) the social network
κ the budget for learning in each iteration
θ the true thresholds of all nodes
θ̂ the estimated thresholds of all nodes
D the activation sequences
M (θ) the true Model
M (θ̂) the estimated Model
Strue the set of influential nodes found
by using the true model M (θ)
Sestimate the set of influential nodes
found by using the estimated model M (θ̂)
f (Strue , M (θ)) the influence spread
of the set Strue on M (θ)
f (Sestimate , M (θ)) the influence spread
of the set Sestimate on M (θ)
f (Sestimate , M (θ̂)) the influence spread
of the set Sestimate on M (θ̂)
f2 : (G, M (θ), S)
→D (2)
→ {θ̂ , θlowbd
f3 : (G, D, θ̂, θlowbd , θupbd )
, θupbd } (3)
Function (1) is the process of finding which set of nodes to target in each iteration.
Function (2) is the process of acquiring the activation sequences D. Function (3) is the
process of threshold inference based on the activation sequences and the old threshold
estimate. In each iteration these three functions are performed in sequence.
In this setting, there are two questions to ask: (1) How to select the set S in each
iteration so that the parameters learned are the most accurate; (2) When will the learned
model parameters θ̂ be good enough so that it is useful for the purpose of influence max-
imization? More specifically, when will the influential nodes found on the estimated
model provide a significantly higher influence spread than that found on a randomly
guessed model. Our solution is guided by these two questions (or interchangeably, ob-
jectives). However it is difficult to combine these two questions into one objective func-
tion. Since our final goal is to conduct influence maximization, we rephrase these two
objectives in the context of influence maximization as follows.
The first goal is that the influence spread of a set of nodes on the estimated model is
close to the influence spread of the same set of nodes on the true model. This goal is in
essence a prediction error. If they are close, it implies that the two models are close. The
second goal is that the influential nodes found by using the estimated model will give
Active Learning of Model Parameters for Influence Maximization 285
an influence spread very close to that found by using the true model. The second goal
measures the quality of the estimated model in the context of influence maximization.
We combine these two goals in the following equation.
|f (Strue , M (θ)) − f (Sestimate , M (θ))| measures whether the set of influential nodes
determined by using the estimated model can give an influence spread close to the
influence spread of the set of influential nodes determined by using the true model.
|f (Sestimate , M (θ))−f (Sestimate , M (θ̂))| measures the difference of influence spreads
between the true model and the estimated model. It is an approximation to measure the
“model distance”, which we will define in section 4.
In this section we will firstly show the difficulty of the active learning problem and then
present our algorithmic solution: the Weighted Sampling algorithm.
The difficulty of the above active learning problem is two-fold. The first difficulty is
that even learning the exact threshold of a single node is quite expensive if the edges of
a network are weighted.
Assume for each edge E in a social network G(V, E), there is an associated weight
w. For the simplicity of analysis, we assume ∀w, w ∈ Z + . An edge e = {u, v} is
active if either u or v is active. What we can observe from the diffusion process is then
a sequence of node activations (ni , ti ). In this setting, suppose that at time t the sum of
weights of the active edges of an inactive node ni is ct . At some future time tk , the node
ni becomes active and the sum of the weight of the active edge at time is ctk . We use wi
to denote the sum of weights of all edges that connect to node ni . We can infer that the
threshold of node ni ∈ [ct /wi ctk /wi ]. More specially if ctk = ct + 1, the threshold is
exactly ctk /wi , and if this is the case, a binary search method can be used to determine
the threshold of a node ni deterministically, which is detailed as follows.
Assume that the set of edges connected to a node ni is Ei . There is a weight w
associated with each edge in Ei . S is the set of weights associated with Ei . Because
w ∈ Z + , this means that S is a set of integers. There is a response
F :T
function →
{0, 1} based on a threshold θ, where T ⊆ S. Here θ = θ ∗ (S), and (S) means
the sum of elements of set S.
1 (T ) ≥ θ
F (T ) = (5)
0 (T ) < θ
This function maps the target set S, the graph G, the estimated model M (θ̂) and the
true model M (θ) to the expected reduction in threshold uncertainty E(Red) if we set
S as the initial active nodes. Γ (S, G, M (θ̂), M (θ)) measures the gain if we select S as
the initial target nodes. Since we do not know the true model parameters and therefore
we cannot possibly know the activation sequence of a target set S under the true model
parameters. It is therefore impossible to know the exact value of Γ (S, G, M (θ̂), M (θ)).
Γ (S, G, M (θ̂), M (θ)) is not a monotonically non-decreasing function with respect to
set S, which means even if we know the value of Γ (S, G, M (θ̂), M (θ)), a deterministic
greedy algorithm is not a good solution. However, we still want to choose the set S that
maximizes Γ (S, G, M (θ̂), M (θ)) in each learning iteration. We use weighted sampling
to approximate this goal. In each iteration we sample a set of κ nodes according to the
following three probabilities.
pi ∝ I(i, j) (9)
j
pi ∝ I(i, j) ∗ w(i, j) (10)
j
pi ∝ I(i, j) ∗ (θ(j)upbd − θ(j)lowbd ) (11)
j
I(i, j) is the indicator function. It is equal to 1 if there is an edge between i and j and the
threshold of node j is unknown, otherwise it is 0. Essentially we are trying to sample κ
nodes that connect to the most number of nodes with the most uncertainty of thresholds.
There are different ways to measure the uncertainty of the threshold of a node. Formula
Active Learning of Model Parameters for Influence Maximization 287
(11) measures the uncertainty by how tight the bounds of the threshold are. In formula
(9) the uncertainty value is 1 if the threshold is unknown and 0 otherwise. Formula
(10) differs from formula 9 in that weights of edges w(i, j) are added. We perform
weighted sampling on the nodes without replacement. The hope is that the sampled set
S can yield a high value of Γ (S, G, M (θ̂), M (θ)). The pseudo code of our weighted
sampling algorithm is summarized in Algorithm 1.
Steps 5 to 13 show the learning process. We sample κ nodes in each iteration accord-
ing to the sampling probability p. We then set the κ nodes as the initial active nodes
and simulate the diffusion process on the true model M (θ). After that we can observe a
series of activation sequences (ni , ti ). We can update the threshold estimation θ̂, θlowbd
and θupbd accordingly. After that we update the sampling probability and the iteration
ends.
4 Experimental Evaluation
In this section, we present the network datasets, the experimental setup and experimen-
tal results.
Table 2 lists the datasets that we use. NetHEPT is from [3]. NETSCI, GEOM, LEDERB
and ZEWAIL are from Pajek’s network collections.
288 T. Cao et al.
Figures 2(a),2(b),2(c),2(d),and 2(e) show that both metut and mctut are better than
random on all the networks in terms of model accuracy. There is little difference
between metut and mctut in all these figures. The difference between metut and
random is more noticeable in Figures 2(a),2(d) and 2(e). In Figure 2(b), the differ-
ence become noticeable when the number of learning iteration approaches 3000. In
Figure 2(c) the difference is the largest between iterations 200 and 1000. After that the
difference between metut and random decreases over iterations. The difference be-
comes quite small when the number of learning iterations approaches 3000. To sum up,
we can see that both metut and mctut perform well. They are better than the baseline
method and the model distance decreases with iterations.
Figures 3(a),3(b),3(c),3(d) and 3(e) show that metut beats random on all the net-
works in terms of objective function. mctut outperforms random in Figures 3(a),
3(b),3(c) and 3(d). In Figure 3(e), mctut is worse than random at iteration 500. But
after that mctut is better random. metut is more stable than mctut in a sense. The
absolute value of difference of the objective function is actually very large. In Figure
3(d) we can see that the largest difference between metut and random is about 500.
So far we can assume that metut is the best of the three in terms of objective function
and model distance.
Finally we devise a way to measure the quality of the estimated model. If the influ-
ence spread of the solution found on the estimated model is very close to that found
on the true model, we can say the estimated model has good quality, otherwise the es-
timated model has low quality. Figure 4(a) shows the quality of the estimated model
found by metut in learning iterations of 0 and 3000. We can see that the initial guessed
model at iteration 0 has extremely low quality. The influence spread of influential nodes
obtained by using the greedy algorithm on the estimated model at iteration 0 is notice-
ably less than that obtained by using the greedy algorithm on the true model. However
after 3000 iterations of learning, this gap is narrowed sharply. The influence spread of
influential nodes obtained by using the greedy algorithm on the estimated model at it-
eration 3000 is very close to that obtained by the using the greedy algorithm on the true
model. This shows that the proposed algorithm can indeed narrow the gap between the
influence spread of the influential nodes on the estimated model and the true model.
Figures 4(b) and 4(c) show the quality of the estimated model found by mutut and
random in learning iterations 0 and 3000.
From Figures 4(a), 4(b) and 4(c) we can also observe that the estimated models found
by metut and mutut have higher quality than the estimated model found by random
at iteration 3000. We can notice that the influence spread of the solution found by using
the estimated models of metut and mutut is larger than the influence spread of the
solution using the estimated model of random at iteration 3000. It indirectly shows
that metut and mutut learn models more accurately than random. Similar patterns
can be observed in Figures 4(d), 4(e), 4(f), 4(j), 5(a), 5(b), 5(c), 5(d) and 5(e). Although
the differences between metut, mutut and random are not as obvious as the case of
Figures 4(a), 4(b) and 4(c). Figures 4(g), 4(h) and 4(i) are exceptions. Figures 4(g),
4(h) and 4(i) show that the learned models of random, metut and mctut have almost
identical quality. This is probably because the dataset NETSCI is a small dataset.
290 T. Cao et al.
model distance against learning iterations model distance against learning iterations
0.12 0.09
metut metut
mctut mctut
random random
0.115 0.08
0.11 0.07
model distance
model distance
0.105 0.06
0.1 0.05
0.095 0.04
0.09 0.03
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
learning iterations learning iterations
0.05
0.05
model distance
model distance
0.04
0.04
0.03
0.03
0.02
0.02
0.01
0 0.01
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
learning iterations learning iterations
0.055
0.05
model distance
0.045
0.04
0.035
0.03
0.025
0.02
0 500 1000 1500 2000 2500 3000
learning iterations
(e) ZEWAIL
Fig. 2. Comparison of metut,mctut and random in terms of model distance in different datasets
(the lower the better)
To this end, we have shown two points. Firstly, the learning process can indeed help
narrow the gap between an estimated model and the true model. Secondly, metut per-
forms better than random because of the fact that the solution found by using metut’s
estimated model produces influence spread larger than the influence spread of the solu-
tion found by using random’s estimated model.
Active Learning of Model Parameters for Influence Maximization 291
objective function against learning iterations objective function against learning iterations
800 900
metut metut
mctut mctut
random random
700 800
600 700
500 600
objective function
objective function
400 500
300 400
200 300
100 200
0 100
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
learning iterations learning iterations
2800
200
2600
150
objective function
objective function
2400
100
2200
50
2000
0 1800
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
learning iterations learning iterations
450
400
objective function
350
300
250
200
150
100
0 500 1000 1500 2000 2500 3000
learning iterations
(e) ZEWAIL
Fig. 3. Comparison of metut,mctut and random in terms of objective function in different datasets
(the lower the better)
5 Related Work
We review related works on influence maximization and learning information diffusion
models in this section.
292 T. Cao et al.
2000 2000
1800 1800
1600 1600
1400 1400
1200 1200
influence spread
influence spread
1000 1000
800 800
600 600
400 400
1800
1200
1600
1000
1400
1200
influence spread
influence spread
800
1000
600
800
600
400
400
200
200 true model true model
0 iterations 0 iterations
3000 iterations 3000 iterations
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
number of initial seeds number of initial seeds
1200 1200
1000 1000
influence spread
influence spread
800 800
600 600
400 400
200 200
true model true model
0 iterations 0 iterations
3000 iterations 3000 iterations
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
number of initial seeds number of initial seeds
400 400
350 350
300 300
influence spread
influence spread
250 250
200 200
150 150
100 100
400 4500
4000
350
3500
300
3000
influence spread
influence spread
250
2500
200
2000
150
1500
100
1000
Fig. 4. Comparison of metut, metut and random on the improvement of influence spread over a
guessed model on different networks
Active Learning of Model Parameters for Influence Maximization 293
5000 5000
4500 4500
4000 4000
3500 3500
3000 3000
influence spread
influence spread
2500 2500
2000 2000
1500 1500
1000 1000
1200 1200
1000 1000
influence spread
influence spread
800 800
600 600
400 400
200 200
true model true model
0 iterations 0 iterations
3000 iterations 3000 iterations
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
number of initial seeds number of initial seeds
1200
1000
influence spread
800
600
400
200
true model
0 iterations
3000 iterations
0
0 5 10 15 20 25 30 35 40 45 50
number of initial seeds
Fig. 5. Comparison of metut, metut and random on the improvement of influence spread over a
guessed model on different networks
First we review works on influence maximization. [9] proved that the influence max-
imization problem is an NP-hard problem under both the IC model and the LT model.
[9] defined an influence function f (A), which maps a set of active nodes A to the ex-
pected number of nodes that A activates. They proved that f (A) is submodular under
both models. Based on the submodularity of the function f (A), they proposed a greedy
algorithm which iteratively selects a node that gives the maximal margin on the func-
tion f (A). The greedy algorithm gives a good approximation to the optimal solution in
terms of diffusion size. Follow-up works mostly focus on either improving the running
time of the greedy algorithm [10, 3] or providing faster heuristics that can give influence
spread that is close to the greedy algorithm [3, 2, 4, 1, 14]. [3] used a Cost-Effective
Lazy Forward method to optimize the greedy algorithm to save some time. [3] proposed
294 T. Cao et al.
degree discount heuristics that reduce computational time significantly. [4] and [2] pro-
posed heuristics based on a most likely propagation path. The heuristics in these two
papers are tunable with respect to influence spread and computational time. [1, 14] used
a community structure to partition the network into different communities, find the in-
fluential nodes in each community, and combine them together. In [9], the problem of
calculating the influence function f (A) was left as an open problem. Later [2] and [4]
proved that it is P hard to calculate the influence function f (A) under both the IC and
models. [10] performed the bound percolation process on graphs and used strongly con-
nected component decomposition to save some time on the calculation of the influence
function f (A). All these works assume that the model parameters are given.
Then we review research efforts on learning information diffusion models. [7] pro-
posed both static and time dependent models for capturing influence from propagation
logs. [13] used the EM algorithm to learn the diffusion probabilities of IC model. [12]
proposed asynchronous time delayed IC and LT models. After that they used maximum
likelihood estimation to learn the models and evaluated the models on real world blog
networks. Interestingly, [6] focused on a different problem, i.e., learning the network
topology which information diffusion relies on. [6] used maximum likelihood estima-
tion to approximate the most probable network topology.
6 Conclusions
In this paper, we have studied the influence maximization problem under unknown
model parameters, specifically, under the linear threshold model. To this end, we first
showed that the influence maximization problem is sensitive to model parameters un-
der the LT model. Then we defined the problem of finding the model parameters as
an active learning problem for influence maximization. We showed that a deterministic
algorithm is costly for model parameter learning. We then proposed a weighted sam-
pling algorithm to solve the active learning problem. We conducted experiments on five
datasets and compared the weighted sampling algorithm with a naive solution: pure
random sampling. Experimental results showed that the weighted sampling achieves
better results than the naive method in terms of both the objective function and model
accuracy we defined. Finally we showed that by using the learned model parameters
from the weighted sampling algorithm, we can find the influential nodes that give an
influence spread very close to the influence spread of influential nodes found on the true
model, which further justifies the effectiveness of our proposed approach. In the future,
we will investigate on how to learn the model parameters under the IC model.
References
[1] Cao, T., Wu, X., Wang, S., Hu, X.: Oasnet: an optimal allocation approach to influence
maximization in modular social networks. In: SAC, pp. 1088–1094 (2010)
[2] Chen, W., Wang, C., Wang, Y.: Scalable influence maximization for prevalent viral market-
ing in large-scale social networks. In: KDD, pp. 1029–1038 (2010)
Active Learning of Model Parameters for Influence Maximization 295
[3] Chen, W., Wang, Y., Yang, S.: Efficient influence maximization in social networks. In:
KDD, pp. 199–208 (2009)
[4] Chen, W., Yuan, Y., Zhang, L.: Scalable influence maximization in social networks under
the linear threshold model. In: ICDM, pp. 88–97 (2010)
[5] Domingos, P., Richardson, M.: Mining the network value of customers. In: KDD, pp. 57–66
(2001)
[6] Gomez-Rodriguez, M., Leskovec, J., Krause, A.: Inferring networks of diffusion and influ-
ence. In: KDD, pp. 1019–1028 (2010)
[7] Goyal, A., Bonchi, F., Lakshmanan, L.V.S.: Learning influence probabilities in social net-
works. In: WSDM, pp. 241–250 (2010)
[8] Granovetter, M.: Threshold models of collective behavior. The American Journal of Soci-
ology 83(6), 1420–1443 (1978)
[9] Kempe, D., Kleinberg, J.M., Tardos, É.: Maximizing the spread of influence through a
social network. In: KDD, pp. 137–146 (2003)
[10] Kimura, M., Saito, K., Nakano, R.: Extracting influential nodes for information diffusion
on a social network. In: AAAI, pp. 1371–1376 (2007)
[11] Richardson, M., Domingos, P.: Mining knowledge-sharing sites for viral marketing. In:
KDD, pp. 61–70 (2002)
[12] Saito, K., Kimura, M., Ohara, K., Motoda, H.: Selecting information diffusion models over
social networks for behavioral analysis. In: ECML/PKDD (3), pp. 180–195 (2010)
[13] Saito, K., Nakano, R., Kimura, M.: Prediction of information diffusion probabilities for
independent cascade model. In: KES (3), pp. 67–75 (2008)
[14] Wang, Y., Cong, G., Song, G., Xie, K.: Community-based greedy algorithm for mining
top-k influential nodes in mobile social networks. In: KDD, pp. 1039–1048 (2010)
Sampling Table Configurations for the
Hierarchical Poisson-Dirichlet Process
1 Introduction
In general machine intelligence domains such as image and text modeling, hier-
archical reasoning is fundamental. Bayesian hierarchical modeling of problems
is now widely used with applications including n-gram modeling and smoothing
[1–3], dependency models for grammar [4, 5], data compression [6], clustering in
arbitrary dimensions [7], topic modeling over time [8], and relational modeling
[9]. Bayesian hierarchical n-gram models correspond well to versions of Kneser-
Ney smoothing [1], the state of the art method in applications and result in com-
petitive string compression algorithms [6]. These hierarchical Bayesian models
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 296–311, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Sampling Table Configurations for the Hierarchical PDP 297
are intriguing from the probability perspective, as well as sometimes being com-
petitive with performance based approaches. Newer methods and applications
are reviewed in [10].
The two-parameter Poisson-Dirichlet process (PDP), also referred to as the
Pitman-Yor process (named so in [11]), is an extension of the Dirichlet process
(DP). Related is a particular interpretation of a marginalized version of the
model known as the Chinese restaurant process (CRP). The CRP gives an ele-
gant analogy of incremental sampling for these models. These provide the basis
of many Bayesian hierarchical modeling techniques. One particular use of the
PDP/DP is in the area of topic models where hierarchical PDPs and hierar-
chical DPs provide elegant machinery for improving the standard simple topic
model [12, 13], for instance, with flexible selection of the number of topics using
HDP-LDA [14], and allowing document structure to be incorporated into the
modeling [15, 16].
This paper proposes a new sampler for the hierarchical PDP based on a new
table representation and is organized as follows: Section 2 gives a brief review
of the hierarchical Poisson-Dirichlet process. Section 3 then reviews methods for
sampling the hierarchical PDP. We present the new table representation for the
HPDP in Section 4. A block Gibbs sampler is developed in Section 5, where we
also apply our block Gibbs sampler to the HDP-LDA model. Experiment results
are reported in Section 6.
p0
Probability vector hierarchy:
This depicts, for instance, that
vectors p1 to pK should be p1 p2 pK
similar to p0 . So for the j2 -th
node branching off node j1 ,
pj2 ∼ PDP(aj1 , bj1 , pj1 ). The
root node p0 could be Dirichlet p... p... pj1
distributed if it is finite or could
have a PDD distribution if
infinite. p... pj2 p...
Note this does assume a particular ordering of the entries in p. Here our a param-
eter is usually called the discount parameter in the literature, and b is called the
concentration parameter. The DP is the special case where a = 0, and has some
quite distinct properties such as slower convergence of the sum ∞ k=1 pk to one.
General results for the discrete case of the PDP are reviewed in [18].
A suitable definition of a Poisson-Dirichlet process is that it extends the
Poisson-Dirichlet distribution using Formula (1), referred to as PDP(a, b, H(·)).
Thus the PDP is a functional on distributions: it takes as input a base distribu-
tion and yields as output a discrete distribution with a finite or countable set of
possible values on the same domain.
The output distribution of a PDP can subsequently be used as a base dis-
tribution for another PDP, and so-forth, to create a hierarchy of distributions.
This situation is depicted in the graphical model of Figure 1 where the distri-
bution is over vectors p indexed by their positions in the hierarchy. Each vector
represents a discrete probability distribution so is a suitable base distribution
for the next level of PDPs. The hierarchical case for the DP is presented in [14],
and the hierarchical and discrete case of the PDP in [19, 20]. The hierarchical
PDP (HPDP) thus depends on the discrete distribution at the root node, and
on the hyper-parameters a, b used at each node. This hierarchical occurrence of
probability vectors could be a model in itself, as is the case for n-gram models
of strings, or it could occur as part of some larger model, as is the case for the
HDP-LDA model.
Intuitively, the HPDP structure can be well explained using the nested CRP
mechanism, which has been widely used as a component of different topic models
[21]. It goes as follows: a Chinese restaurant has an infinite number of tables,
each of which has infinite seating capacity. Each table serves a dish, and multiple
tables can serve the same dish. In the nested CRP, each restaurant is also linked
to its parent restaurant and child restaurants in a tree-like structure. A newly
Sampling Table Configurations for the Hierarchical PDP 299
c c c j-1,10
c c c c
j-1,8 j-1,6
j-1,2 j-1,9 j-1,5 j-1,11
c t =1 c t =3
j-1,1 j-1,3 c t =2
j-1,4 c
j-1,7 t =1 。。。
c
j-1,1 j-1,2 j-1,3 j-1,4
t t
j,1 j,4
j-1,12
t t j,2
j,3 j-1
c c j,10
j,12
c
c c c
j,6
j,2 j,9 j,5
c t =1
j,1 c t =2
j,3 c t =2
j,4 c
j,7 t =3 。。。
c
j,1 j,2 j,3 j,4
t j,8 c
t
j,11
t t
j+1,1
j+1,4
j+1,2 j+1,3 j
c j+1,10 c
c c c c
j+1,6
j+1,2 j+1,9 j+1,5 j+1,12
c t =1 c t =2
j+1,1 j+1,3 c t =2
j+1,4 c t =1
j+1,7
。。。
c
j+1,1 j+1,2 j+1,3 j+1,4
j+1,8 c
j+1,11
j+1
arrived customer can choose to sit at an active table (i.e., a table which at least
has one customer), or choose a new table. If a new table is chosen (i.e. activated),
this table will be sent as a new customer to the corresponding parent restaurant,
which means a table in any given restaurant reappears as a proxy customer [3]
in its parent restaurant. This procedure is illustrated in Figure 2.
3 Related Methods
For the hierarchical PDP, the most popular MCMC algorithm is the Gibbs
sampling method based on the Chinese restaurant representation [10, 14]. For
instance, samplers proposed in [14], e.g., the Chinese restaurant franchise sam-
pler, the augmented Chinese restaurant franchise sampler, and the sampler for
direct assignment. In the CRP representation, each restaurant is represented by
a seating arrangement that contains the total number of customers, the total
number of occupied tables, the customer-table association, the customer-dish
association, and the table-dish association.
With the global probability measure marginalized out, the Chinese restaurant
franchise sampler keeps track of the customer-table association (i.e., recording
table assignments of all customers), which results in extra storage space require-
ment. Its extension to the HPDP is a Gibbs sampler, so called “sampling for
seating arrangements” by Teh [19]. Another sampler for HDP, termed “posterior
sampling with an augmented representation”, introduces an auxiliary variable
to construct the global measure for its children DPs so that these DPs can be
decoupled. A further extension of the augmented sampler gives the sampler for
300 C. Chen, L. Du, and W. Buntine
“direct assignment”, in which each data point is directly assigned to one com-
ponent, instead of sampling at which table it sits. This is the sampler used in
Teh’s implementation of HDP-LDA[22].
Recently, a new collapsed Gibbs sampling algorithm for the HPDP is proposed
in [15, 18]. It sums out all the seating arrangements by introducing a constrained
latent variable, called the table count that represents the number of tables serv-
ing the same dish, similar to the representation in the direct assignment sampler.
Du et al. [16] have applied it to a first order Markov chain.
Except for the sampling based algorithms, there also exist variational based
inference algorithms for the HPDP. For example, Wang et al. proposed a vari-
ational algorithm for the nested Chinese restaurant process [23], and recently
proposed an online variational inference for the HDP [24]. As a compromise,
Teh et al. developed a sampling-variational hybrid algorithm for HDP in [25].
In this paper, we propose a new table representation for the HPDP by intro-
ducing another auxiliary latent variable, called table indicator variable, to track
which level the data (or customer) has contributed a table count (i.e. the cre-
ation of a new table) in the hierarchy. The aforementioned table count variable
can be easily constructed from the table indicator variables by summation, which
indicates the exchangeability of the proposed representation. To apply the new
representation, we develop a block Gibbs sampling algorithm to jointly sample
the dish and table indicator for each customer.
Definition 2 (Node index j). In the tree structure, each node (i.e., a restau-
rant) is indexed with an integer starting from 0 in a top-down and left to right
manner. The root of the tree is indicated by j = 0. Based on this definition, we
define a mapping d : Z + → Z + which maps the node j to its level d(j) in the
tree, here Z + means non-negative integers.
1
We will use these terminologies interchangeably in this paper.
Sampling Table Configurations for the Hierarchical PDP 301
Definition 4 (Table indicator ul ). The table indicator ul for each data item
l (i.e., a customer) is an auxiliary latent variable which indicates up to which
level in the tree l has contributed a table count (i.e. activated a new table). If
ul = 0, it means the data item l takes the responsibility of creating a new table
in each node between the root and the current node.
where n0jk is the number of actual data points in j with zl = k (k ∈ {1, . . . , K}),
tjk is the number of tables serving k, njk is the total number of data points
(including those sent by the child nodes of j) with zl = k, Tj is the total number
of tables, and Nj is the total number of data points; and D(j) is a set of data
points attached to j, C(j) is a set of child nodes of j and T (j) is its closure,
the set of nodes in the sub-tree rooted at j, Obviously, the multiplicity for each
distinct value, i.e., each dish, can be constructed from table indicator variables.
(b|a)T
K
Pr (z1 , z2 , · · · , zN , t1 , · · · , tK ) = H(zk∗ )tk Stnkk,a (4)
(b)N
k=1
302 C. Chen, L. Du, and W. Buntine
N
where SM,a is the generalized Stirling number2 , (x|y)N denotes the Pochhammer
symbol with increment y, and T = K i=1 ti .
Making use of Lemma 1 by [18], we can derive, based on the new table repre-
sentation of the HPDP, the joint probability of samples z 1:J and table indicator
variables u1:J as
Theorem 1. Given the base distribution H0 for the root node, the joint posterior
distribution of z 1:J and u1:J for the HPDP in a tree structure is 3 :
(bj |aj )Tj njk tjk !(njk − tjk )!
Pr (z 1:J , u1:J | H0 ) = Stjk ,aj . (5)
(bj )Nj njk !
j≥0 k
Proof (sketch). Let us consider just one restaurant (i.e., one node in the tree) at a
time. The set t = {t1 , · · · , tK } indicates the table configuration, i.e., the number
of tables serving each dish; and the set u = {u1 , · · · , uN } as the table indicator
configuration, i.e., table indicators attached to each data item. Clearly, only one
table configuration can be reconstructed from a table indicator configuration,
as
shown by Eq. (3). But given one table configuration, one can yield k tk !(nnkk−t !
k )!
possible table indicator configurations. Thereby, the joint posterior distribution
of z and t is computed as
nk !
Pr (z, t) = Pr (z, u) (6)
tk !(nk − tk )!
k
Proof (sketch). Follows by inspection of the posterior and that the statistics used
are all sums over data items.
2
A generalized Stirling number is given by the linear recursion [15, 16, 18, 19] as
N+1 N N N
SM,a = SM −1,a + (N −M a)SM,a , for M ≤ N . It is 0 otherwise and S0,a = δN,0 .
These numbers rapidly become very large so computation needs to be done in log
space using a logarithmic addition.
3
Note that t0k ≤ 1 since this node is the PDD defined in definition 1.
Sampling Table Configurations for the Hierarchical PDP 303
In this section, we elaborate on our block Gibbs sampling algorithm for the new
table representation, and show, as an example, how it can be applied to the
HDP-LDA model.
njk , if l ∈
/ D(j) & ul > d(j) + 1
njk = . (7)
njk − 1, if (l ∈
/ D(j) & ul ≤ d(j) + 1 or l ∈ D(j)) & zl = k
304 C. Chen, L. Du, and W. Buntine
After adding l, ul :
Tj , if ul > d(j) tjk , if ul > d(j)
Tj = , tjk =
Tj + 1, if ul ≤ d(j) tjk + 1, if ul ≤ d(j) & zl = k
Nj , if l ∈
/ D(j) & ul > d(j) + 1
Nj =
Nj + 1, if l ∈
/ D(j) & ul ≤ d(j) + 1 or l ∈ D(j)
njk , if l ∈
/ D(j) & ul > d(j) + 1
njk = . (8)
njk + 1, if (l ∈
/ D(j) & ul ≤ d(j) + 1 or l ∈ D(j)) & zl = k
With respect to the above analysis, the full joint conditional probability for zl
and ul is:
δ δ
(tjk ) tjk =tjk (njk − tjk ) njk −tjk =njk −tjk
δn
. (9)
=n
(njk ) jk
jk
As discussed in the previous section, when we add or remove a data item from
the tree, the statistics associated with this data item cannot be changed unless
some conditions are satisfied. For removing a data item l from the tree, the
related statistics can be changed if one of the following conditions holds:
When l is added back to the tree, its table indicator cannot always take any
value from 0 to L, they should be constrained in some range, i.e., if l is added
to menu item k with current table count tk = 0, l will created a new table and
contribute one table count to the current node (i.e., tk + 1), and ul will be set to
d(j) such that ul < L. Thus, the value of ul should be in the following interval:
ul ∈ [umin
l , umax
l ]
Sampling Table Configurations for the Hierarchical PDP 305
where umin
l denotes the minimum value of the table indicator for j, and umax
l
the maximum value. These can be shown as
min {d(j) : j ∈ path(l), tjk = 0} , if ∃j, tjk = 0
umax =
l L, if ∀j, tjk > 0
0, if t0k = 0
umin
l = . (10)
1, others
3. If tjk = 0, t0k
= 0,
6 Experiments
We compared the proposed algorithm with Teh et al.’s [14] “posterior sampling
by direct assignment” sampler4 as well as Buntine and Hutter’s collapsed sampler
[18] on five datasets, namely, Health dataset, Person dataset, Obama dataset,
NIPS dataset and Enron dataset. All three algorithms are implemented in C,
and run on a desktop with Intel(R) Core(TM) Qaud CPU (2.4GHz), although
our code is not multi-threaded.
The Obama dataset came from a collection of 7M Blogs (from ICWSM 2009,
posts from Aug-Sep 2008) by issuing the query “obama debate” under Lucene.
We performed fairly standard tokenization, created a vocabulary of terms that
occurred at least five times after excluding stopwords, and then built the bag
of words. The Health data set is similar except the source is 1M News articles
(LDC Gigaword) using the query “health medicine insurance” and words needed
to occur at least twice. The Person dataset has the source of 805k News articles
(Reuters RCV1) using the query “person” and using words that occurred at least
four times. The Enron and NIPS datasets have been obtained as preprocessed
bagged data from UCI where the data has been used in several papers. We list
some statistics of these datasets in Table 1.
4
The “posterior sampling by direct assignment” is preferred than the other two sam-
plers in [14], due to its straightforward bookkeeping, as suggested by Teh et al.
Sampling Table Configurations for the Hierarchical PDP 307
The algorithms tested are: our proposed block Gibbs sampler for HDP-LDA,
Sampling by Table Configurations, denoted as STC, Teh et al.’s “Sampling by
Direct Assignment” algorithm[14], denoted as SDA, and the Collapsed Gibbs
Table Sampler by Buntine et al.[18], denoted as CTS, and finally, a variance of
the proposed STC by initializing word topics with SDA and sampling tables for
each document using STC, denoted as SDA+STC. The reason for using the fourth
algorithm is to isolate the impact of the new sampler.
Coming to the evaluation criteria for topic models, there are many differ-
ent evaluation methods such as importance sampling methods, Harmonic mean
method, “left-to-right” algorithm, etc., see [27, 28] for a complete survey. In this
paper, we adopt the “left-to-right” algorithm to calculate the test perplexities
because it is unbiased [27]. This algorithm calculates the perplexity over words
following a “left-to-right” manner for each document, which is defined as[27, 28]:
Pr (w|Φ, αm) = Pr (wn |w<n , Φ, αm) (15)
n
where w is all the words in the documents, Φ is the topic distribution matrix, α is
the concentration parameter of the Dirichlet prior over topics, and m is its base
measure. The perplexity is computed in the log space. Since table counts are
also latent variables in the formulation of perplexity, we force all table counts
to be less than or equal to one so that the unbiased method can be applied
directly. This condition has been observed to hold generally when the PDP
hyperparameters are well optimized/fit.
12.4
SDA SDA SDA
STC STC STC
12.2 CTS 12.2 CTS CTS
testing perplexity
testing perplexity
testing perplexity
13
SDA+STC SDA+STC SDA+STC
12 12
12.5
11.8 11.8
11.6 12
11.6
22 9600 19232 28822 36010 32 9619 19215 28830 36046 83 18025 32409 46832 61218 75611 68405
training time (s) training time (s) training time (s)
(a) Health data with I = 100 (b) Health data with I = 200 (c) Person data with I = 1000
testing perplexity
testing perplexity
12.5
11.5 11.5
12
11 11
116 18041 32439 46817 61246 75629 67 18030 32428 46800 61214 75611 103 18013 32444 46842 61207 75629
training time (s) training time (s) training time (s)
(d) Person data with I = 2000 (e) Obama data with I = 1000 (f ) Obama data with I = 2000
12.5 SDA
STC SDA SDA
SDA+STC 11.4 STC 11.4 STC
testing perplexity
SDA+STC SDA+STC
testing perplexity
testing perplexity
12
11.2 11.2
11.5 11 11
10.8 10.8
11
10.6 10.6
808 14422 29060 43283 57700 72231 86455 10.4 10.4
training time (s) 58 19205 38417 57650 76805 138 19210 38456 57627 76934
training time (s) training time (s)
(g) Enron data with I = 1000 (h) NIPS data with I = 1000 (i) NIPS data with I = 2000
Fig. 3. Test log 2 (perplexities) evolved with training time, I means initial number of
topics
6.3 Perplexities
We first give the experimental results for the four algorithms on the five datasets
in term of testing perplexity using the “left-to-right” algorithm [27, 28]. We also
Sampling Table Configurations for the Hierarchical PDP 309
need to set the initial number of topics for each of these algorithms though
the final values can be sampled by these algorithms. Generally, to accelerate
the convergent speed, we initialized more topics to large datasets than those
to small datasets. Specifically, we initialized Health dataset with 100 and 200
topics, Enron dataset with 500 and 1000 topics5 , and other three datasets with
1000 and 2000 topics, respectively. Furthermore, we used 2000 major cycles to
burn-in6 . Table 2 summarizes the test perplexities for these algorithms.
From Table 2, we can see that the proposed fully exchangeable block Gibbs
sampler STC obtains significantly better results than the “sampling by direct
assignment” sampler does in most cases7 , while using SDA to burn in, and sam-
pling tables with the proposed corrected sampler, SDA+STC obtains consistently
better results than SDA. One interesting observation is that the collapsed Gibbs
sampler does not perform as well as expected, and we expect this is due to poor
mixing of the Gibbs sampler since the table counts are updated separately from
the topics.
7 Conclusion
In this paper, we have proposed a new table representation that inherits the full
exchangeability from the two-parameter Poisson-Dirichlet processes or Dirichlet
Processes to their hierarchical extensions, and developed a block Gibbs sam-
pling algorithm for doing inference on this representation. Meanwhile, we have
5
This dataset is too large to initialize with large number of initial topics.
6
For some large datasets, a 2000 burn-in is impractical, thus we set a maximum
burn-in time for 24 hours.
7
There is a case, i.e., Obama dataset, that STC is not as good as SDA, this may be due
to the initialization.
310 C. Chen, L. Du, and W. Buntine
applied the proposed algorithm to the HDP-LDA model, and used the block
Gibbs sampler to jointly sample the topic and table indicator configuration for
each word.
Experimental results showed that with proper initializations and sampling
only tables, SDA+STC can yield consistently better results than “sampling by
direct assignment” algorithm SDA does in term of testing perplexity; and the
block Gibbs sampler STC with full exchangeability can always outperform SDA.
Furthermore, we have demonstrated that though the proposed model is more
complicated than the original one, its convergence speed is always faster than the
“sampling by direct assignment” algorithm’s. Interestingly, an earlier alternative
collapsed Gibbs sampler performed poorly, and we expect this is because STC
allows better mixing of the HPDP/HDP parts of the model with the other parts
of the model.
We claim that our new representation and the performance improvements
should extend to many of the other Gibbs algorithms now in use for different
models embedding the HPDP or HDP. Thus our methods can be applied quite
broadly within the non-parameteric Bayesian community. The superiority of our
method over the various CRP-based approaches mentioned in Section 3 and our
earlier collapsed sampler are:
– The introduction of the table indicator variable guarantees full exchangeabil-
ity. This is important to eliminate sequential effects from the Gibbs sampler.
– Tracking the table contribution can reduce the information loss that may
result from summing out all the seating arrangements. This makes the Gibbs
sampler more rapidly mixing.
References
1. Teh, Y.W.: A hierarchical Bayesian language model based on Pitman-Yor pro-
cesses. In: ACL 2006, pp. 985–992 (2006)
2. Goldwater, S., Griffiths, T., Johnson, M.: Interpolating between types and tokens
by estimating power-law generators. In: NIPS 2006, pp. 459–466 (2006)
3. Mochihashi, D., Sumita, E.: The infinite Markov model. In: NIPS 2008, pp. 1017–
1024 (2008)
4. Johnson, M., Griffiths, T., Goldwater, S.: Adaptor grammars: A framework for
specifying compositional nonparametric Bayesian models. In: NIPS 2007, pp. 641–
648 (2007)
5. Wallach, H., Sutton, C., McCallum, A.: Bayesian modeling of dependency trees
using hierarchical Pitman-Yor priors. In: Proceedings of the Workshop on Prior
Knowledge for Text and Language (in Conjunction with ICML/UAI/COLT), pp.
15–20 (2008)
6. Wood, F., Archambeau, C., Gasthaus, J., James, L., Teh, Y.: A stochastic memo-
izer for sequence data. In: ICML 2009, pp. 119–116 (2009)
Sampling Table Configurations for the Hierarchical PDP 311
7. Rasmussen, C.: The infinite Gaussian mixture model. In: NIPS 2000, pp. 554–560
(2000)
8. Pruteanu-Malinici, I., Ren, L., Paisley, J., Wang, E., Carin, L.: Hierarchical
Bayesian modeling of topics in time-stamped documents. TPAMI 32, 996–1011
(2010)
9. Xu, Z., Tresp, V., Yu, K., Kriegel, H.P.: Infinite hidden relational models. In: UAI
2006, pp. 544–551 (2006)
10. Teh, Y.W., Jordan, M.I.: Hierarchical Bayesian nonparametric models with appli-
cations. In: Bayesian Nonparametrics: Principles and Practice (2010)
11. Ishwaran, H., James, L.: Gibbs sampling methods for stick-breaking priors. Journal
of ASA 96, 161–173 (2001)
12. Buntine, W., Jakulin, A.: Discrete components analysis. In: Subspace, Latent
Structure and Feature Selection Techniques (2006)
13. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
14. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes.
Journal of the ASA 101, 1566–1581 (2006)
15. Du, L., Buntine, W., Jin, H.: A segmented topic model based on the two-parameter
Poisson-Dirichlet process. Mach. Learn. 81, 5–19 (2010)
16. Du, L., Buntine, W., Jin, H.: Sequential latent Dirichlet allocation: Discover un-
derlying topic structures within a document. In: ICDM 2010, pp. 148–157 (2010)
17. Pitman, J., Yor, M.: The two-parameter Poisson-Diriclet distribution derived from
a stable subordinator. Annals Prob. 25, 855–900 (1997)
18. Buntine, W., Hutter, M.: A Bayesian review of the Poisson-Dirichlet process. Tech-
nical Report arXiv:1007.0296, NICTA and ANU, Australia (2010)
19. Teh, Y.: A Bayesian interpretation of interpolated Kneser-Ney. Technical Report
TRA2/06, School of Computing, National University of Singapore (2006)
20. Buntine, W., Du, L., Nurmi, P.: Bayesian networks on Dirichlet distributed vectors.
In: PGM 2010, pp. 33–40 (2010)
21. Blei, D.M., Griffiths, T.L., Jordan, M.I.: The nested Chinese restaurant process and
Bayesian nonparametric inference of topic hierarchies. J. ACM 57, 1–30 (2010)
22. Teh, Y.: Nonparametric Bayesian mixture models - release 2.1. Technical Re-
port University College London (2004), http://www.gatsby.ucl.ac.uk/~ywteh/
research/software.html
23. Wang, C., Blei, D.: Variational inference for the nested Chinese restaurant process.
In: NIPS 2009, pp. 1990–1998 (2009)
24. Wang, C., Paisley, J., Blei, D.: Online variational inference for the hierarchical
Dirichlet process. In: AISTATS 2011 (2011)
25. Teh, Y., Kurihara, K., Welling, M.: Collapsed variational inference for HDP. In:
NIPS 2007 (2007)
26. Blunsom, P., Cohn, T., Goldwater, S., Johnson, M.: A note on the implementation
of hierarchical Dirichlet processes. In: ACL 2009, pp. 337–340 (2009)
27. Buntine, W.: Estimating likelihoods for topic models. In: Zhou, Z.-H., Washio, T.
(eds.) ACML 2009. LNCS, vol. 5828, pp. 51–64. Springer, Heidelberg (2009)
28. Wallach, H., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for
topic models. In: ICML 2009, pp. 672–679 (2009)
Preference-Based Policy Iteration: Leveraging
Preference Learning for Reinforcement Learning
Abstract. This paper makes a first step toward the integration of two
subfields of machine learning, namely preference learning and reinforce-
ment learning (RL). An important motivation for a “preference-based”
approach to reinforcement learning is a possible extension of the type of
feedback an agent may learn from. In particular, while conventional RL
methods are essentially confined to deal with numerical rewards, there
are many applications in which this type of information is not naturally
available, and in which only qualitative reward signals are provided in-
stead. Therefore, building on novel methods for preference learning, our
general goal is to equip the RL agent with qualitative policy models, such
as ranking functions that allow for sorting its available actions from most
to least promising, as well as algorithms for learning such models from
qualitative feedback. Concretely, in this paper, we build on an existing
method for approximate policy iteration based on roll-outs. While this
approach is based on the use of classification methods for generalization
and policy learning, we make use of a specific type of preference learning
method called label ranking. Advantages of our preference-based policy
iteration method are illustrated by means of two case studies.
1 Introduction
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 312–327, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Preference-Based Policy Iteration: Leveraging Preference Learning for RL 313
environment itself but, say, by a human expert (e.g., “In this situation, action a
would have been better than a ”), is typically of a qualitative nature, too.
In order to make RL more amenable to qualitative feedback, we build upon
formal concepts and methods from the rapidly growing field of preference learn-
ing [5]. Roughly speaking, we consider the RL task as a problem of learning
the agent’s preferences for actions in each possible state, that is, as a problem
of contextualized preference learning (with the context given by the state). In
contrast to the standard approach to RL, the agent’s preferences are not nec-
essarily expressed in terms of a utility function. Instead, more general types of
preference models, as recently studied in preference learning, can be envisioned,
such as total and partial order relations.
Interestingly, this approach is in a sense in-between the two extremes that
have been studied in RL so far, namely learning numerical utility functions for
all actions (as in Q-learning [15]) and, on the other hand, directly learning a
policy which predicts a single best action in each state [11]. One may argue that
the former approach is unnecessarily complex, since precise utility degrees are
actually not necessary for taking optimal actions, whereas the latter approach
is not fully effectual, since a prediction in the form of a single action does nei-
ther suggest alternative actions nor offer any means for a proper exploration.
An order relation on the set of actions seems to provide a reasonable compro-
mise, as it supports the exploration of acquired knowledge, i.e., the selection of
(presumably) optimal actions, as well as the exploration of alternatives, i.e., the
selection of suboptimal but still promising actions.
In this paper, we make a first step toward the integration of preference learn-
ing and reinforcement learning. We build upon a policy learning approach called
approximate policy iteration, which will be detailed in Section 2, and propose
a preference-based variant of this algorithm (Section 3). While the original ap-
proach is based on the use of classification methods for generalization and pol-
icy learning, we employ label ranking algorithms for incorporating preference
information. Advantages of our preference-based policy iteration method are
illustrated by means of two case studies presented in Sections 4 and 5.
is often defined as maximizing the expected sum of rewards (given the initial
state s), with future rewards being discounted by a factor γ ∈ [0, 1]:
∞
π t
V (s) = E γ r(st , π(st )) | s0 = s (1)
t=0
return Q̃
Several methods for label ranking have already been proposed in the literature;
we refer to [14] for a comprehensive survey. The idea of learning by pairwise
comparison (LPC) [8] is to train a separate model Mi,j for each pair of labels
(yi , yj ) ∈ Y × Y , 1 ≤ i < j ≤ k; thus, a total number of k(k − 1)/2 models is
needed. At classification time, a query x is submitted to all models, and each
prediction Mi,j (x) is interpreted as a vote for a label. More specifically, assuming
scoring classifiers that produce normalized scores fi,j = Mi,j (x) ∈ [0, 1], the
weighted voting technique interprets fi,j and fj,i = 1 − fi,j as weighted votes for
predicts the class y ∗ with the highest sum of
classes yi and yj , respectively, and
∗
weighted votes, i.e., y = arg maxi j =i fi,j . We refer to [8] for a more detailed
description of LPC in general and a theoretical justification of the weighted
voting procedure in particular.
than all remaining actions.1 For the preference-based approach, on the other
hand, it suffices if only two possible actions yield a clear preference in order to
obtain (partial) training information about that state. Note that a corresponding
comparison may provide useful information even if both actions are suboptimal.
In Section 5, an example will be shown in which actions are not necessarily
comparable, since the agent seeks to optimize multiple criteria at the same time
(and is not willing to aggregate them into a one-dimensional target). In general,
this means that, while at least some of the actions will still be comparable in a
pairwise manner, a unique optimal action does not exist.
Regarding the type of prediction produced, it was already mentioned earlier
that a ranking-based reinforcement learner can be seen as a reasonable compro-
mise between the estimation of a numerical utility function (like in Q-learning)
and a classification-based approach which provides only information about the
optimal action in each state: The agent has enough information to determine
the optimal action, but can also rely on the ranking in order to look for alter-
natives, for example to steer the exploration towards actions that are ranked
higher. We will briefly return to this topic at the end of the next section. Before
that, we will discuss the experimental setting in which we evaluate the utility of
the additional ranking-based information.
For each task and method, we tried five numbers of state samples s ∈ {10, 20,
50, 100, 200}, five maximum numbers of roll-outs r ∈ {10, 20, 50, 100, 200}, and
three levels of significance c ∈ {0.025, 0.05, 0.1}. Each of the 5 × 5 × 3 = 75
parameter combinations was evaluated ten times, such that the total number of
experiments per learning task was 750. We tested both domains, mountain car
and inverted pendulum, with a ∈ {3, 5, 9, 17} different actions each.
Our prime evaluation measure is the success rate (SR), i.e., the percentage
of learned sufficient policies. Following [3], we plot a cumulative distribution of
the success rates of all different parameter settings over a measure of learning
complexity, where each point (x, y) indicates the minimum complexity x needed
to reach a success rate of y. However, while [3] simply use the number of roll-outs
(i.e., the number of sampled states) as a measure of learning complexity, we use
the number of performed actions over all roll-outs, which is a more fine-grained
complexity measure. The two would coincide if all roll-outs are performed a con-
stant number of times. However, this is typically not the case, as some roll-outs
may stop earlier than others. Thus, we generated graphs by sorting all successful
runs over all parameter settings (i.e., runs which yielded a sufficient policy) in
increasing order regarding the number of applied actions and by plotting these
runs along the x-axis with a y-value corresponding to its cumulative success rate.
This visualization can be interpreted roughly as the development of the success
rate in dependence of the applied learning complexity.
# success rate
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1000 10000 100000 1e+06 1e+07 1e+08 100000 1e+06 1e+07 1e+08 1e+09
# actions # actions
# success rate
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1000 10000 100000 1e+06 1e+07 1e+08 100000 1e+06 1e+07 1e+08 1e+09
# actions # actions
# success rate
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1000 10000 100000 1e+06 1e+07 1e+08 100000 1e+06 1e+07 1e+08 1e+09
# actions # actions
# success rate
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1000 10000 100000 1e+06 1e+07 1e+08 100000 1e+06 1e+07 1e+08 1e+09
# actions # actions
Fig. 1. Comparison of API, PAPI and PBPI for the inverted pendulum task (left) and
the mountain car task (right). The number of actions is increasing from top to bottom.
322 W. Cheng et al.
# success rate
# success rate
0.6 0.6
0.4 0.4
0.2 0.2
0 0
1000 10000 100000 1e+06 1e+07 1e+08 1 10 100 1000 10000
# actions # preferences
Fig. 2. Comparison of complete state evaluation (PBPI) with partial state evaluation
in three variants (PBPI-1, PBPI-2, PBPI-3)
in some domains, it may also fail to find optimal solutions in other cases [11].
Selecting a pair of actions and following the better one may be a simple but
effective way of trading off exploration and exploitation for state sampling. We
are currently working on a more elaborate investigation of this issue.
2 tumor size
toxicity
1
0
0 (1.0) 1 (0.7) 2 (0.1) 3 (0.7) 4 (1.0) 5 (0.7) 6
Fig. 3. Illustration of the simulation model showing the patient’s status during the
treatment. The initial tumor size is 1.5 and the initial toxicity is 0.5. On the x-axis is
the month with the corresponding dosage level the patient receives. The dosage levels
are selected randomly.
π π ⇔
(CX ≤ CX ) and (CS ≤ CS ) (3)
Preference-Based Policy Iteration: Leveraging Preference Learning for RL 325
It is important to remark that thus defined, as well as the induced strict order
, are only partial order relations. In other words, it is thoroughly possible that
two policies are incomparable. For our preference learning framework, this means
that less pairwise comparisons may be generated as training examples. However,
in contrast to standard RL methods as well as the classification approach of [11],
this is not a conceptual problem. In fact, since these approaches are based on a
numerical reward function and, therefore, implicitly assume a total order among
policies (and actions in a state), they are actually not applicable in the case of
a partial order.
7
6
5
Toxicity
4
3
2
1
0
0 1 2 3 4 5 6
Tumor Size
Fig. 4. Illustration of patients status under different treatment policies. On the x-axis
is the tumor size after 6 months. On the y-axis is the highest toxicity during the 6
months. From top to bottom: Extreme dose level (1.0), high dose level (0.7), random
dose level, learned dose level, medium dose level (0.4), low dose level (0.1). The values
are averaged from 200 patients.
6 Conclusions
The goal of this work is to make first steps towards lifting conventional rein-
forcement learning into a qualitative setting, where reward is not available on
an absolute, numerical scale, but where comparative reward functions can be
used to decide which of two actions is preferable in a given state. To cope with
this type of training information, we proposed a preference-based extension of
approximate policy iteration. Whereas the original approach essentially reduces
reinforcement learning to classification, we tackle the problem by means of a
preference learning method called label ranking. In this setting, a policy is rep-
resented by a ranking function that maps states to total orders of all available
actions.
To demonstrate the feasibility of this approach, we performed two case studies.
In the first study, we showed that additional training information about lower-
ranked actions can be successfully used for improving the learned policies. The
second case study demonstrated one of the key advantages of a qualitative policy
iteration approach, namely that a comparison of pairs of actions is often more
feasible than the quantitative evaluation of single actions.
The work reported in this paper provides a point of departure for extensions
along several lines. For example, while the setting we assumed is not uncommon
in the literature, the existence of a generative model is a strong assumption.
In future work, we will therefore focus on generalizing our approach toward an
on-line learning setting with on-policy updates.
References
1. Barto, A.G., Sutton, R.S., Anderson, C.: Neuron-like elements that can solve dif-
ficult learning control problems. IEEE Transaction on Systems, Man and Cyber-
netics 13, 835–846 (1983)
2. Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor-critic algo-
rithms. Automatica 45(11), 2471–2482 (2009)
3. Dimitrakakis, C., Lagoudakis, M.G.: Rollout sampling approximate policy itera-
tion. Machine Learning 72(3), 157–171 (2008)
4. Fern, A., Yoon, S.W., Givan, R.: Approximate policy iteration with a policy lan-
guage bias: Solving relational markov decision processes. Journal of Artificial In-
telligence Research 25, 75–118 (2006)
5. Fürnkranz, J., Hüllermeier, E. (eds.): Preference Learning. Springer, Heidelberg
(2010)
6. Gabillon, V., Lazaric, A., Ghavamzadeh, M.: Rollout allocation strategies for
classification-based policy iteration. In: Auer, P., Kaski, S., Szepesvàri, C. (eds.)
Proceedings of the ICML 2010 Workshop on Reinforcement Learning and Search
in Very Large Spaces (2010)
7. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The
weka data mining software: An update. SIGKDD Explorations 11(1), 10–18 (2009)
8. Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning
pairwise preferences. Artificial Intelligence 172, 1897–1916 (2008)
9. Kersting, K., Driessens, K.: Non-parametric policy gradients: a unified treatment
of propositional and relational domains. In: Cohen, W.W., McCallum, A., Roweis,
S.T. (eds.) Proceedings of the 25th International Conference on Machine Learning
(ICML 2008), pp. 456–463. ACM, Helsinki (2008)
10. Konda, V.R., Tsitsiklis, J.N.: On actor-critic algorithms. SIAM Journal of Control
and Optimization 42(4), 1143–1166 (2003)
11. Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: Leveraging
modern classifiers. In: Fawcett, T.E., Mishra, N. (eds.) Proceedings of the 20th
International Conference on Machine Learning (ICML 2003), pp. 424–431. AAAI
Press, Washington, DC (2003)
12. Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine
Learning 3, 9–44 (1988)
13. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods
for reinforcement learning with function approximation. In: Solla, S.A., Leen, T.K.,
Müller, K.-R. (eds.) Advances in Neural Information Processing Systems 12 (NIPS-
1999), pp. 1057–1063. MIT Press, Denver (1999)
14. Vembu, S., Gärtner, T.: Label ranking algorithms: A survey. In: Fürnkranz and
Hüllermeier [5], pp. 45–64.
15. Watkins, C.J., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)
16. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine Learning 8, 229–256 (1992)
17. Zhao, Y., Kosorok, M., Zeng, D.: Reinforcement learning design for cancer clinical
trials. Statistics in Medicine 28, 3295–3315 (2009)
Learning Recommendations in Social Media Systems by
Weighting Multiple Relations
Boris Chidlovskii
1 Introduction
Social media sharing sites like Flickr and YouTube contain billions of image and videos
uploaded and annotated by millions of users. Tagging the media objects is proven to be
a powerful mechanism capable to improve media sharing and search facilities [19].
Tags play the role of metadata, however they come in a free form reflecting the individ-
ual user’s choice. Despite this freedom of tag choice, some common usage topics can
emerge when people agree on the semantic description of a group of objects.
The wealth of annotated and tagged objects on the social media sharing sites can
form a solid base for a reliable tag recommendation [10]. There are two most frequent
modes of using the recommendation systems on the media sharing sites. In the bootstrap
mode, the recommender system suggests the most relevant tags for newly uploaded
media objects by observing the characteristics of the objects as well as their social
context. In the query mode, an object is annotated with one or more tags and the system
attempts to suggest the user how to extend the tag set. In both modes, the system assists
the user in the annotation and help expand the coverage of tags on objects.
Modern social media sharing sites attract users’ attention by offering a multitude of
valuable services. They organize users activities around entities of different types, such
as images, tags, users and groups and relations between them. The tag recommenda-
tion for images or videos is one of possible scenarios in such a relational world, other
scenarios may concern user contact recommendation, group recommendation, etc. This
paper addresses the recommendation tasks in the multi-relational setting and out target
is to determine the optimal combination of the available relations for a given recom-
mendation task.
In the multi-relational setting, entities of different types get connected and form two
types of relations. The first type is the relations between entities of the same type,
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 328–342, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Learning Recommendations in Social Media Systems 329
2 Prior Art
Existing methods for tag recommendation often target the social network as a source
of collective knowledge. Recommender systems based on collective social knowledge
have been proven to provide relevant suggestions [9,15,18]. Some of these systems ag-
gregate the annotations used in a large collection of media objects independently of the
users that annotate them [18]. In other systems, recommendations can be personalized
by using the annotations for the images of a single user [9]. Both approaches come with
their advantages and drawbacks. When the recommendations are based on collective
knowledge the system can make good recommendations on a broad range of topics,
but is likely to miss some recommendations that are particularly relevant in a personal
context.
330 B. Chidlovskii
Random walks on weighted graphs is a well established model [8,13]. The random
walks have been used for ranking Flickr tags in [14], where the walks are executed on
one relation only, such as a image-to-tag graph for the tag recommendation. Similar
techniques have used used in [2] for analysing the user-video graph on Youtube site and
for providing personalized suggestions.
Many recommendation algorithms are inspired by the analysis of the MovieLens
collection and Netflix competition; they focus only on using ratings information, while
disregarding information about the context of the recommendation process. With the
growth of social sites, several methods have been proposed for predicting tags and
adding contextual information from social networks to improve the performance of
recommendation systems.
One common approach is to address the problem as the multi-label classification in
a (multi-)relational graph. Techniques of the label propagation on the relational graph
is proposed in [17]. It develops an iterative algorithm for the inference and learning for
the multi-label and multi-relational classification. Inference is performed iteratively by
propagating scores according to the multi-relational structure of the data. The method
extends the techniques of collective classification [11] in order to handle multiple rela-
tions and to perform multi-label classification in multi-graphs.
The concept of relational graph for integrating contextual information is used in [3].
It makes it straightforward to include different types of contextual information. The
recommendation algorithm in [3] models the browsing process of a user on a movie
database website by taking non-weighted random walks over the relational graph.
The approach closest to our is [7]; it tries to automatically learn ranking function
for searching in typed (entity-relation) graphs. User input is in the form of a partial
preference order between pairs of nodes, associated with a query. The node pairs are
instances in learning the ranking function. For each pair the method assigns a label
representing the relative relevance. It then trains a classification model with the labelled
data and makes use existing classification methodologies like SVM, Boosting, etc. [5].
The pairwise approaches to learning the rank are known for several important limi-
tations [6]. First, the objective of learning is formalized as minimizing errors in classifi-
cation of node pairs, rather than minimizing errors in node ranking. Second, the training
process is computationally costly, as the number of document pairs is very large. Third,
the number of generated node pairs varies largely from query to query; this results in
training a model biased toward queries with more node pairs.
All three issues are particularly critical in the social networking environment. One
alternative to the pairwise approach is based on the listwise approach, where node lists
and not node pairs are used as instances in learning [6].
What we propose in this paper is a probabilistic method to calculate the listwise loss
function for the recommendation tasks. We transform both the scores of nodes in the
relational graph assigned by a ranking function and the the node annotations by hu-
mans into probability distributions. We can then utilize any metric between probability
distributions as the loss function. In other words, our approach represents a listwise al-
ternative to the pairwise one in [7] and shows its application the recommendation tasks
to social media sites.
Learning Recommendations in Social Media Systems 331
Finally, the weight learning for random walks in social networks is recently ad-
dressed in [1]. In the link prediction setting, authors try to infer which interactions
among existing network members are likely to occur in the near future. They develop
an algorithm based on supervised random walks that combines the information from the
network structure with node and edge level attributes. To guide a random walk on the
graph, they formulate a supervised learning task where the goal is to learn a function
that assigns strengths to edges in the network such that a random walker is more likely
to visit the nodes to which new links will be created in the future.
3 Relational Graph
The relational graph aims at representing all available entity types and relations between
them in one uniform way. The graph is given by G = (E, R), where an entity type
ek ∈ E is represented as a node and relation rkl ∈ R between entities of types ek and
el is represented as a edge between the two nodes.
The relational graph for (a part of) Flickr web site1 is sketched in Figure 1. Nodes
represent five entities types, E={image, user, tag, group, comment}. Edges rep-
resent relations between entities of the same or different types. One example is relation
tagged with between imageand tagentities indicating which images are tagged with
which tags. Another example is relation contact between userentities which encodes
the list of user contacts. The Flickr example reports one relation between entities of ek
and el . In the general case, the model can accommodate any number of relations be-
tween any two entity types. Another advantage of the relational graph is its capacity to
add any new type of contextual information.
The relational setting reflects the variety of user activities ans services on Flickr and
similar sites. Users can upload their images and share them with other users, partici-
pate in different interest groups, browse images of other users, comment and tag them,
navigate through the image collection by tags, groups, etc.
Each individual relation rkl ∈ R is expected to be internally homogeneous. In other
words, bigger values in the relation tend to have a higher importance. Instead, different
relations may have different importance for a given recommendation task. For example,
relation annotated with(image,tag) is expected to be more important For the tag rec-
ommendation task than the relation member(user, group). On the other side, for the
user contact recommendation, the importance of these two relations may be opposite.
In the following, we model the importance of a relation toward a given recommendation
task with a non-negative weight.
Every relation rkl ∈ R is unfolded (instantiated) in the form of matrix Akl =
{aij kl
kl }, k = 1, . . . , |ei |, l = 1, . . . , |ej |, where aij indicates the relation between entity
i ∈ ek and entity j ∈ el . Values are binary (for example, in the tagged with relation,
aij = 1 if image i is tagged with tag j, 0 otherwise). In the general case, aij are non-
negative real values. Any matrix Akl is non-negatively defined. For the needs of random
walks, we assume Akl is a probability transition matrix, which can be obtained by the
row normalization.
1
http://www.flickr.com
332 B. Chidlovskii
Random Walks
We combine the relations that induce a probability distribution over entities by learn-
ing a Markov chain model, such that its stationary distribution is a good model for
a specific prediction task. Constructing Markov chains whose stationary distributions
are informative has been used in multiple applications, including the Google PageRank
algorithm [16] and HITS-like algorithms [4].
A Markov chain over a set of states S is specified by an initial distribution P0 over
S, and a set of state transition probabilities P (St |St−1 ). A Markov chain defines a
distribution over sequences of states, via a generative process in which the initial state
S0 is first sampled according to distribution P0 , and then states St (for t = 1, 2, . . .)
are sampled according to the transition probabilities. The stationary distribution of the
Markov chain is given by π(s) = lim∞ P (St = s), if the limit exists.
To ensure that the Markov chain has a unique stationary distribution, the process can
be reset with a probability α > 0 according to the initial state distribution P0 . In practice
this prevents the chain from getting stuck in nodes having no transitions and small loops.
Having the Markov chain S0 , S1 , . . . with the initial state S0 distributed according to P0 ,
state transitions given by P and resetting probability α, it is straightforward to express
the stationary distribution π as follows:
∞
π=α (1 − α)t P0 P t . (1)
t=0
Equation (1) can be used to efficiently compute π. Because terms corresponding to large
t have very little weight (1 − α)t , when computing π, this sequence may be truncated
after the first few (on the order 1/α) terms without incurring significant error.
In this section we extend the Markov chain model to the unfolded relational graph.
Assume the relational graph includes b entity types, e1 , . . . , eb . The total number of
b
entities is denoted N = k=1 |ek |. The unfolded relational graph is composed of
b2 blocks, one block for each (ek , el ) pair, k, l = 1, . . . , b. Available relations fill up
some blocks, other blocks can be left empty or filled up with composed relations using
Learning Recommendations in Social Media Systems 333
the relation transitivity rule Akl = Akm Aml , where Akm and Aml are basic or other
composed relations. Note that there might exist several ways to compose a relation; a
particular choice often depends on the recommendation task.
In the Flickr relational graph (Figure 1), there are seven basic relations, which fill
up the corresponding blocks and can be used to compose other relations. The tag co-
occurrence relationship is an example of composed relation. If matrix AIT describes
relation tagged with (image,tag), the tag co-occurrence matrix can be obtained by
AT T = AIT AIT . Higher values in AT T indicate that more images are tagged with a
given tag pair.
When a random walk moves through relation rkl ∈ R between entities of types
ek and el , contribution of rij to the walk is expressed by a non-negative weight wkl .
The random walk is therefore performed over matrix A which is a weighted sum over
relations,
A = kl wkl Akl . Matrix A is ensured to be the probability transition ma-
trix if j wkl = 1, l = 1, . . . , b. Also, π(s)j denotes a projection of the stationary
distribution π on the entity type j.
To initiate the random walk, the initial distribution P0 is composed of b vectors
δj , j = 1, . . . , b, with all elements relevant to the query. For the tag recommendation
task, we compose three vectors δI , δU , and δT , for images, users and tags. When recom-
mending tags for image i, i-th element in the image vector is 1 with all other elements
δIj , j i set to 0. Similarly, in the user vector δU only the owner u of image i is set to 1.
The default choice for the tag vector δt is 1 vector. We however prefer to add a bias on
the user tag vocabulary and preferences. We set t-th of vector δT the log value of fre-
quencies of using t by user u in the collection, δT t = log(AUT (u, t) + 1). Then the ini-
tial distribution P0 is defined as normalization of the composed vector (δ1 , δ2 , . . . , δb ).
If weights wkl are known or recommended by an expert, equation (1) can be used
for estimating the stationary distribution π and its projection πj . If the weights are
unknown a priori, we propose a method which determines such values for weights wkl
which minimize a loss function on the training set.
4 Weight Learning
To learn relation weights in the random walk, we approximate the stationary distri-
bution π with the truncated version and look for an such instantiation of the Markov
model where weights minimize a prediction error on a training set T . We express the
optimization problem on weights wkl as minimization of loss function on the training
set.
The weighted random walk defined by a Markov chain query produces a probabil-
ity distribution. Nodes having more links (with higher weights) with query nodes will
accumulate more probability than nodes having less links and of lower weights.
probability of i and let p its estimation by H. The price we pay when predicting p in
place of y is defined as a loss function l(y, p). We use the square loss2 between y and p
in the following form:
Without loss of generality, in the following sections we assume to cope with the tag
recommendation. Assume the tag set includes L tags. For a given image, let YB denote
a binary vector YB = (y1 , . . . , yL ) where yi is 1 if the image is tagged with tag i, 0 oth-
erwise, i = 1, . . . , L. The probability distribution over the tag set is Y = (y1 , . . . , yn )
where yi is 0 or 1/|YB |, i = 1, . . . , L.
Let P denote an estimated tag probability distribution, P = (p1 , . . . , pL ), where
L
i=1 pi = 1. To measure the loss of using estimated distribution P in the place of true
distribution Y , we use a loss function which is symmetric in y and p and equals 0 only
if y = p. The best candidate is the multi-label square function defined as follows
For the square loss function Lsq , we take its gradient as follows
∂
∇Lsq (Y, P ) = lsq (yi , pi ) = 2(yi − pi ),
∂pi i=1,...,L
If we dispose a training set T of images with the tag probability distribution Y , we try
to define such a scoring function H which minimizes the empirical loss over T . It is
defined as follows
1
Loss(H) = Lsq (Yj , Pj ), (4)
|T |
j∈T
where Yj is the true probability vector for image j and Pj is the prediction probability
distribution. b
The weighted sum of composed of b distinct entity types A = kl wkl Akl . Bigger
values of wkl indicate higher importance of relation between entities ek and el to the
task. We assume that matrix Akl for relation rkl is normalized with each raw forming
a state transition distribution. The mixture matrix A satisfies the same condition if the
2
Other loss functions will be equally tested in the evaluation section.
Learning Recommendations in Social Media Systems 335
constraint j wkl = 1, wkl ≥ 0 holds. The matrix A is however is not required to be
symmetric, so wkl
= wlk in the general case. Thus we obtain the following optimization
problem:
minwkl Loss(H)
s.t.
(5)
0 ≤ w kl ≤ 1
l wkl = 1, k = 1, . . . , b.
∂Loss(H) 1 ∂Pj
= ∇Lsq (Yj , Pj ) , (6)
∂wkl |T | ∂wkl
j∈T
k
where Pj = α t=1 (1 − α)t P0j At and P0t is the initial probability distribution for
image j.
The power series At , t = 1, 2, ... are the only terms in Pj depending on wkl . To
compute the derivative of a composite function, we use the chain rule for matrices. We
obtain the recursive for the first derivatives at step t
Algorithm 1 presents a meta-code for the loss function Loss(H) and its gradient ∇Loss(H)
needed for solving the optimization problem (5) with quasi-Newton method.
3
A regularization term on vkl can be added to the objective function.
336 B. Chidlovskii
Hessian Matrix
The Hessian matrix H for the loss function may help the quasi-Newton method to faster
converge to an optimum point. The matrix requires the mixed derivatives for weights
wkl and wk l :
∂Loss(H) 1 ∂ ∂ 2 Pj
= ∇Lsq (Yj , Pj ) , (8)
∂wkl ∂wk l |T | ∂Pj ∂wkl ∂wk l
j∈T
The second derivatives in the Hessian matrix H can be developed by using the chain
rule for matrices, similarly to (7). We obtain a recursive formula where the values at
iteration t depend on the function values and its gradient on the previous step t − 1 of
the random walk.
∂ 2 At
∂wkl ∂wk l =
∂ ∂At−1 t−1
∂wk l ( ∂wkl A + A Akl ) = (9)
2 t−1 t−1 t−1
∂ A
∂wkl ∂wk l
A + ∂wkl Ak l + ∂A
∂A
A .
∂wk l kl
Algorithm 1 can be extended with the evaluation of the Hessian matrix in the straight-
forward way. The extension requires to expand lines 6 to 8 of Algorithm 1 with the
∂2P
evaluation of second derivatives ∂wkl ∂wj for a given object j using the rule (9). Then
k l
lines 12 to 15 should be expanded to get the Hessian matrix for the entire set T using
formula (8).
Learning Recommendations in Social Media Systems 337
The power iteration for the gradient and Hessian can be prohibitive for large full
matrices. Luckily, matrices Akl are all sparse and the matrix product is proportional
to the number of non-zero elements in the matrix. This ensures a relative speed when
performing truncated random walks on the relational graph.
5 Evaluation
In this section we describe the real dataset used in the experiments, the evaluation set-
ting and the results of experiments for two different recommendation tasks.
Flickr dataset. In all evaluations, we use a set of Flickr data which has been down-
loaded from Flickr site with the help of social-network connectors (Flickr API) [12].
The API give an access to entities and relational data, including users, groups of inter-
ests, images with associated comments and tags.
We test the method of relation weight learning for the random walks described in
Section 4 on three entity types, E={image, tag, user}. Three core relations asso-
ciated with the three types are image-to-tag relation RIT = tagged with (image,tag),
user-to-image relation RUI = owner (user,image) and user-to-user relation RUU =
contact (user, user).
We use a fragment of the Flickr dataset with 100,000 images; these images are owned
by 1,951 users and annotated with 127,182 different tags (113,426 tags after normal-
ization). Matrices of the three core relations are sparse, their elements follow the power
low distribution which is very common in the social networks. In the image-to-tag ma-
trix, an image has between 1 and 132 tags, with the average of 5.65 tags per image. The
user-to-image matrix contains 1 to 384 images per user, the average number is 27.32
images. The number of contacts in the user-to-user matrix is between 0 and 43, the
average is 1.24 contacts.4
We run a series of experiments on the dataset where two tasks are tag recommenda-
tion for images and contact recommendation for users. In either task, we use the core
relations to compose other relations. The way the composed relations are generated
depends on the recommendation task:
Tag recommendation: the image-to-image matrix is composed as AII = AIT AIT .
Other composed relations are tag-to-tag AT T = AIT AIT and user-to-tag AUT =
AUI AIT , and their inversion.
User contact recommendation: The image-to-image matrix is composed as AII =
AUI AUI and user-to-tag matrix is given by AUT = AUI AIT .
An all cases, the matrix A is block-wise ; the optimization problem (5) is solved for
b2 weights wkl .
The image tag recommendation runs either in the bootstrap or query mode. In boot-
strap mode, the task is to predict tags for a newly uploaded image. In query mode, an
image may have some tags and the task is to extend them. In both modes, we measure
the performance of predicting the top 5 and |size| tags where the number |size| of tags
vary from image to image but is known in advance (and equals to the test tag set). The
user contact recommendation task has been tested in the query mode only.
4
The multi-relational Flick dataset is available from authors upon request.
338 B. Chidlovskii
Fig. 2. Image tag recommendation in the bootstrap mode for top 5 tags
Fig. 3. Recall and precision values in the query mode: a) Top 5 tags; b) Top |size| tags
Fig. 4. User contact recommendation in the query mode: precision and recall for the top 5 contacts
Resetting Coefficient
Figure 5 shows the impact of resetting coefficient α on the performance of the weight
learning. It reports the relative precision values for the tag recommendation in the boot-
strap mode. Three cases for 1K, 50K and 100K images are presented. As the figure
shows, there exists a convenient range of values α between 0.05 and 0.2 where the
maximum precision is achieved in all cases.
Number of Iterations
Another evaluation item concerns the convergence of the random walks in Equation (1).
Figure 6 shows the impact of truncating after varying number of iterations on the per-
formance of weight learning. It reports the precision and recall values for the tag recom-
mendation in the bootstrap mode. Two cases of 1,000 and 50,000 images are presented,
when the random walk is truncated after 1,2,...,15 iterations. As the figure suggests both
Learning Recommendations in Social Media Systems 341
precision and recall achieve their top values after 5 to 7 iterations in the first case and
after 10 to 12 iterations in the second case.
Finally, we tracked the evaluation of Hessian matrix (9) in addition to the gradient in
Algorithm 1. For small datasets (less 5,000 images), using the Hessian helps to faster
converge to the local optimum, with the saving of 25% for evaluation on 1000 images.
The situation changes drastically for large datasets where evaluation of Hessian matri-
ces become an important handicap. For this reason, Hessians were excluded from the
valuation for 10,000 images and more.
6 Conclusion
We presented the relational graph representation for multiple entity types and relations
available in a social media system. We implemented the Markov chain model and pre-
sented its weighted extension to the relational graph. We has shown how to learn the
relation weights by minimizing the loss between predicted and observed probability
distributions. We reported the evaluation results of the image tag and user contact rec-
ommendation tasks on the Flickr dataset. The results of experiments confirm that the re-
lation weights learned from the training set provide a constant gain over the unweighted
methods.
References
1. Backstrom, L., Lescovec, J.: Supervised random walks: Predicting and recommending links
in social networks. In: Proc. ACM WSDM (2011)
2. Baluja, S., Seth, R., Sivakumar, Y.J., Yagnik, J., Kumar, S., Ravichandran, D., Aly, M.: Video
suggestion and discovery for youtube: Taking random walks through the view graph. In:
Proc. WWW 2008, pp. 895–904 (2008)
3. Bogers, T.: Movie recommendation using random walks over the contextual graph. In: Proc.
2nd Workshop on Context-Aware Recommender Systems, CARS (2010)
342 B. Chidlovskii
4. Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P.: Link analysis ranking: algorithms,
theory, and experiments. ACM Trans. Internet Technol. 5(1), 231–297 (2005)
5. Burges, C.J.C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender,
G.N.: Learning to rank using gradient descent. In: Proc. ICML, pp. 89–96 (2005)
6. Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., Li, H.: Learning to rank: from pairwise approach to
listwise approach. In: Proc. ICML, pp. 129–136 (2007)
7. Chakrabarti, S., Agarwal, A.: Learning parameters in entity relationship graphs from ranking
preferences. In: Proc. PKDD, pp. 91–102 (2006)
8. Coppersmith, D., Doyle, P., Raghavan, P., Snir, M.: Random walks on weighted graphs and
applications to on-line algorithms. J. ACM 40(3), 421–453 (1993)
9. Garg, N., Weber, I.: Personalized, interactive tag recommendation for flickr. In: Proc. ACM
RecSys 2008, pp. 67–74 (2008)
10. Gupta, M., Li, R., Yin, Z., Han, J.: Survey on social tagging techniques. ACM SIGKDD
Explorations Newsletter 12, 58–72 (2010)
11. Jensen, D., Neville, J., Gallagher, B.: Why collective inference improves relational classifi-
cation. In: Proc. ACM KDD 2004, pp. 593–598 (2004)
12. Ko, M.N., Cheek, G.P., Shehab, M., Sandhu, R.: Social-networks connect services. IEEE
Computer 43(8), 37–43 (2010)
13. Toutanova, K., Manning, C.D., Ng, A.Y.: Learning random walk models for inducing word
dependency distributions. In: Proc. ICML (2004)
14. Liu, D., Hua, X.-S., Yang, L., Wang, M., Zhang, H.-J.: Tag ranking. In: Proc. WWW 2009,
pp. 351–360 (2009)
15. Overell, S., Sigurbjörnsson, B., van Zwol, R.: Classifying tags using open content resources.
In: Proc. ACM WSDM 2009, pp. 64–73 (2009)
16. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order
to the Web (1998)
17. Peters, S., Denoyer, L., Gallinari, P.: Iterative annotation of multi-relational social networks.
In: Proc. ASONAM 2010, pp. 96–103 (2010)
18. Sigurbjörnsson, B., van Zwol, R.: Flickr tag recommendation based on collective knowledge.
In: Proc. WWW 2008, pp. 327–336 (2008)
19. Tian, Y., Srivastava, J., Huang, T., Contractor, N.: Social multimedia computing. IEEE Com-
puter 43, 27–36 (2010)
Clustering Rankings in the Fourier Domain
1 Introduction
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 343–358, 2011.
c Springer-Verlag Berlin Heidelberg 2011
344 S. Clémençon, R. Gaudel, and J. Jakubowicz
techniques have also been developed to handle input data, that are themselves
of the form of rankings, for a broad range of purposes: computation of centrality
measures such as consensus or median rankings (see [MPPB07] or [CJ10] for
instance), modelling/estimation/simulation of distribution on sets of rankings
(see [Mal57], [FV86], [LL03], [MM09] and [LM08] among others), ranking based
on preference data (refer to [HFCB08], [CS01] or [dEW06]).
It is the main goal of this paper to consider the issue of clustering rank data
from a novel perspective, taking the specific nature of the observations into ac-
count. We point out that the method promoted in this paper is by no means
the sole possible applicable technique. The most widely used approach to this
problem consists in viewing rank data (and, more generally, ordinal data), when
renormalized in an appropriate manner, as standard numerical data and apply
state-of-the-art clustering techniques, see Chapter 14 in [HTF09]. Alternative
procedures of probabilistic type, relying on mixture modeling of probability dis-
tributions on a set of (partial) rankings, can also be considered, following in
the footsteps of [FV88]. Rank data are here considered as probability distribu-
tions on the symmetric group Sn , n ≥ 1 denoting the number of objects to
be (tentatively) ranked and our approach crucially relies on the Fourier trans-
form on the set of mappings f : Sn → R. Continuing the seminal contribution
of [Dia89], spectral analysis of rank data has been recently considered in the
machine-learning literature for a variety of purposes with very promising re-
sults, see [KB10], [HGG09] or [HG09]. This paper pursues this line of research.
It aims at showing that, in the manner of spectral analysis in signal processing,
Fourier representation is good at describing properties of distributions on Sn
in a sparse manner, ”sparse” meaning here that a small number (compared to
n!) of Fourier coefficients carry most of the significant information, from the
perspective of clustering especially. As shown in [Dia88], the main appeal of
spectral analysis in this context lies in the fact that Fourier coefficients encode
structural properties of ranking distributions in a very interpretable fashion.
Here we propose to use these coefficients as features for defining clusters. More
precisely, we shall embrace the approach developed in [WT10] (see also [FM04]),
in order to find clusters on an adaptively-chosen subset of features in the Fourier
domain.
The article is organized as follows. Section 2 describes the statistical frame-
work, set out the main notations and recall the key notions of Fourier represen-
tation in the context of distributions on the symmetric group that will be used
in the subsequent analysis. Preliminary arguments assessing the efficiency of the
Fourier representation for discrimination purposes are next sketched in Section
3. In particular, examples illustrating the capacity of parsimonious truncated
Fourier expansions to approximate efficiently a wide variety of distributions on
Sn are exhibited. In Section 4, the rank data clustering algorithm we propose,
based on subsets of spectral features, is described at length. Numerical results
based on artificial and real data are finally displayed in Section 5. Technical
details are deferred to the Appendix.
Clustering Rankings in the Fourier Domain 345
Example 2. (Preference data.) One may also consider the case where a col-
lection of objects, drawn at random, are ranked by degree of preference. The
events observed are then of the form E = {σ ∈ Sn : σ(i1 ) < . . . < σ(im )} with
m ∈ {1, . . . , n} and (i1 , . . . , im ) a m-tuple of the set of objects {1, . . . , n}.
Example 3. (Bucket orders.) The top-k list model can be extended the fol-
lowing way, in order to account for situations where preferences are aggregated.
One observes a random partition B1 , . . . , BJ of the set of instances for which:
for all 1 ≤ j < l ≤ J and for any (i, i ) ∈ Bj × Bl , σ(i) < σ(i ).
M
W(C) = D(fi , fj ) · I{(fi , fj ) ∈ Cm
2
}, (1)
m=1 1≤i<j≤n
group R of real numbers, the group Z of integers or the group Z/N Z of integers
modulo N . The elements of the abelian group G act on functions f : G → C
by translation. Recall that, for any g ∈ G, the translation by g is defined by
Tg (f ) : x ∈ G
→ f (x − g). A crucial property of the Fourier transform F is that
it diagonalizes all translation operators simultaneously: ∀g ∈ G,
F (Tg (f ))(ξ) = χg (ξ) · Ff (ξ),
where χg (ξ) = exp(2iπgξ) and ξ belongs to the dual group G (being R, R/Z and
Z/N Z when G is R, Z and Z/N Z respectively). Consequently, Fourier trans-
form provides a sparse representation of all operators that are spanned by the
collection of translations, such as convolution operators.
The diagonalization view on Fourier analysis extends to the case of non-
commutative groups such as Sn . However, in this case, the related eigenspaces
are not necessarily of dimension 1 anymore. In brief, in this context Fourier
transform only ”block-diagonalizes” translations, as shall be seen below.
The Group Algebra C[Sn ]. The set C[Sn ] is a linear space on which
Sn acts
linearly as a group of translations Tσ : f ∈ C[Sn ] → Tσ (f ) = ν∈Sn f (ν ◦
σ −1 )δν . For clarity’s sake, we recall the following notion.
Definition 1. (Convolution Product) Let (f, g) ∈ C[Sn ]2 . The convolu-
to the counting measure on Sn ) is the
tion product of g with f (with respect
→ ν∈Sn f (ν)g(ν −1 ◦ σ) ∈ C.
function defined by f ∗ g : σ ∈ Sn
Remark 1. Notice that one may also write, for any (f, g) ∈ C[Sn ]2 and all
−1
σ ∈ Sn , (f ∗ g)(σ) ν∈Sn f (σ ◦ ν )g(ν). The convolution product f ∗ δσ =
τ ∈ Sn → f (τ ◦ σ −1 ) reduces to the right translation of f by σ, Tσ f namely.
Observe in addition that, for n > 2, the convolution product is not commutative.
For instance: δσ ∗ δτ = δσ◦τ = δτ ◦σ = δτ ∗ δσ , when σ ◦ τ
= τ ◦ σ.
The set C[Sn ] equipped with the pointwise addition and the convolution product
(see Definition 1 above) is referred to as the group algebra of Sn .
Canonical Decomposition. In the group algebra formalism introduced above,
a function f is an eigenvector for all the right translations (simultaneously)
whenever ∀σ ∈Sn , δσ ∗ f = χσ f , where χσ ∈ C for all σ ∈ Sn . For instance, the
function f = σ∈Sn δσ ≡ 1 can be easily seen to be such an eigenvector with
χσ ≡ 1. In addition, denoting by σ the signature of any permutation σ ∈ Sn
(recall that it is equal to (−1)I(σ) where I(σ) is the number of inversions of σ,
i.e. the number ofpairs (i, j) in {1, . . . , n}2 such that i < j and σ(i) > σ(j)),
the function f = σ∈Sn σ δσ is also an eigenvector for all the right translations
with χσ = σ . If one could possibly find n! such linearly independent eigenvec-
tors, one would be able to define a notion of Fourier transform with properties
very similar to those of the Fourier transform of functions defined on Z/N Z.
Unfortunately, due to the lack of commutativity of Sn , the functions mentioned
above are the only eigenvectors common to all right translation operators, up to
a multiplicative constant. Switching from the notion of eigenvectors to that of
irreducible subspaces permits to define the Fourier transform, see [Ser88].
348 S. Clémençon, R. Gaudel, and J. Jakubowicz
and .
Hence, the Fourier transform of any function f ∈ C[Sn ] is of the form: ∀ξ ∈
Rn ,
F f (ξ) = f (σ)ρξ (σ),
σ∈Sn
where ρξ is a function on Sn that takes its values in the setof unitary ma-
trices with complex entries of dimension dξ × dξ . Note that ξ d2ξ = n!. For
clarity, we recall the following result, that summarizes some crucial properties
of the spectral representation on Sn , analogous to those of the standard Fourier
transform. We refer to [Dia88] for an nice account of linear representation the-
ory of the symmetric group Sn as well as some of its statistical applications.
Example 4. For illustration purpose, Fig. 1 below displays the spectral analy-
sis
nof the Mallows distribution (cf [Mal57]) when n = 5, given by fσ0 ,γ (σ) =
{ j=1 (1−exp{−γ})/(1−exp{−jγ})}·exp{−γ ·dτ (σ, σ0 )} for all σ ∈ Sn , denot-
ing by dτ (σ, ν) = 1≤i<j≤n I{σ ◦ ν −1 (i) > σ ◦ ν −1 (j)} the Kendall τ distance,
for several choices of the location and concentration parameters σ0 ∈ Sn and
γ ∈ R∗+ . Precisely, the cases γ = 0.1 and γ = 1 have been considered. As shown
by the plots of the coefficients, the more spread the distribution (i.e. the smaller
γ), the more concentrated the Fourier coefficients.
0.20
[32415] [32415]
[35421] [35421]
0.4
0.15
[12453] [12453]
0.2
0.10
0.0
0.05
−0.4 −0.2
0.00
[32415]
[35421]
[12453]
0.02
0.008
−0.02
0.004
[32415]
[35421]
−0.06
0.000
[12453]
Fig. 1. Coefficients of the Mallows distribution and of its Fourier transform for several
choices of the pair of parameters (σ, γ)
and/or denoised representations of rank data in certain situations, that are use-
ful for discrimination purposes.
with M = #Rn , dξ(m) denoting the dimension of the irreducible space related
to the frequency ξ(m) .
3. Keeping the K coefficients with largest second order moment, invert the
Fourier transform, producing the approximant
1
K
FK (σ) = dξ(k) < ρξ(k) (σ), F F (ξ(k) ) >dξ(k) . (2)
n!
k=1
1 1 1
γ = 10 γ = 10 γ = 10
γ=1 γ=1 γ=1
0.8 γ = 0.1 0.8 γ = 0.1 0.8 γ = 0.1
0.6 0.6 0.6
distortion
distortion
distortion
0.4 0.4 0.4
0 0 0
Roughly speaking, the inequality (3) above says in particular that, if the Fourier
representation is sparse, i.e. F f (ξ) is zero at many frequencies ξ, f (σ) is not.
De-noising in C[Sn ]. The following example shows how to achieve noise sup-
pression through Fourier representation in some specific cases. Let A = ∅ be a
subset of Sn and consider the uniform distribution on it: fA = (1/#A)· σ∈A δσ .
We suppose that it is observed with noise and shall study the noise effect in the
Fourier domain. The noise is modeled as follows. Let T denote a transposi-
tion drawn at random in the set T of all transpositions in Sn (P{T = τ } =
2/(n(n − 1)) for all τ ∈ T), the noisy observation is: f = fA ∗ δT . Notice that
the operator modeling the noise here is considered in [RKJ07] for a different
purpose.
We first recall the following result, proved by means of the Murnaghan-
Nakayama rule ([Mur38]), that provides a closed analytic form for the expected
Fourier transform of the random distribution f .
The proposition above deserves some comments. Notice first that the map ξ ∈
Rn → aξ is antisymmetric (i.e. aξ = −aξ for any ξ in Rn ) and one can show
that aξ is a decreasing function for the natural partial order on Young diagram
[Dia88]. As it √is shown in [RS08], |aξ | = O(n−1/2 ) for diagrams ξ satisfying
r(ξ), c(ξ) = O( n). For the two lowest frequencies, (n) and (n, n − 1) namely,
the proposition shows that a(n) = 1 and a(n−1,1) = (n − 3)/(n − 1). Roughly
speaking, this means that the noise leaves almost untouched (up to a change of
sign) the highest and lowest frequencies, while attenuating moderate frequencies.
Looking at extreme frequencies hopefully allows to recover/identify A.
Remark 5. (A more general noise model) The model above can be extended
in several manners, by considering, for instance, noisy observations of the form
f = fA ∗ δSm , where Sm is picked at random among permutations that can be
decomposed as a composition of m ≥ 1 transpositions (and no less). One may
(m) (m)
then show that E[F f (ξ)] = aξ ·FfA (ξ), for all ξ ∈ Rn , where the aξ ’s satisfy
the following property: for all frequencies ξ ∈ Rn whose row number and column
√
number are both less than c1 n for some constant c1 < ∞, |aξ | ≤ c2 · n−m/2 ,
(m)
where c2 denotes some finite constant. See [RS08] for further details.
L
M(C) = ||fi − fj ||2 · I{(fi , fj ) ∈ Cl2 }
l=1 1≤i, j≤N
1
L
= dξ ||Ffi (ξ) − Ffj (ξ)||2HS(dξ ) ,
n!
ξ∈Rn l=1 1≤i, j≤N : (fi ,fj )∈Cl2
switching to the Fourier domain by using Parseval relation (see Proposition 1).
Given the high dimensionality of rank data (cf Remark 3), such a rigid fash-
ion of measuring dissimilarity may prevent the optimization procedure from
identifying the clusters, the main features that are possibly responsible for the
differences being buried in the criterion. As shown in the previous section, in
certain situations only a few well-chosen spectral features may permit to exhibit
similarities or dissimilarities between distributions on Sn . Following in the foot-
steps of [WT10], we propose to achieve a sparse clustering of the rank data by
considering the optimization problem: λ > 0 being a tuning parameter, minimize
dξ
L
ω (C) =
M ωξ ||Ffi (ξ) − Ffj (ξ)||2HS(dξ ) (5)
n!
ξ∈Rn l=1 1≤i, j≤N : (fi ,fj )∈Cl2
Clustering Rankings in the Fourier Domain 353
dξ
L
ωξ,m (F fi (ξ)m − Ffj (ξ)m )2 .
ξ∈Rn m=1 l=1 1≤i, j≤N : (fi ,fj )∈Cl2
5 Numerical Experiments
This section presents experiments on artificial and real data which demonstrate
that sparse clustering on Fourier representation recovers clustering information
on rank data. For each of the three studied datasets, a hierarchical clustering is
354 S. Clémençon, R. Gaudel, and J. Jakubowicz
0.2
0.20
[6517423]
[6517423]
0.10
0.10
0.10
[6517432]
[6514723]
0.00
0.00
0.00
[6512473]
[1567423]
[6734512]
[6517423]
[6512473]
[1567423]
[6514723]
[6517432]
[4736215]
[3746512]
[6734512]
[4736512]
[4637512]
[4637512]
[4736512]
[4736215]
[3746512]
[6514723]
[6517432]
[6512473]
[1567423]
[6734512]
[4637512]
[3746512]
[4736512]
[4736215]
(a) Law on S7 , γ = 0.1 (b) Law on S7 , γ = 1 (c) Law on S7 , γ = 10
0.14
0.10
0.10
0.08
[1567423]
0.02
0.00
[6517432]
[4637512]
0.00
[4736215]
[6734512]
[3746512]
[1567423]
[6734512]
[3746512]
[1567423]
[6512473]
[4736512]
[4736215]
[6512473]
[6514723]
[6512473]
[4736512]
[4637512]
[4736215]
[6517432]
[6517423]
[6514723]
[4637512]
[6734512]
[6517423]
[6517432]
[6517423]
[6514723]
[4736512]
[3746512]
# of coefficients
Representation Total Selected Selected Selected
when γ = 0.1 when γ = 1 when γ = 10
Probability distribution 5,040 8 3 1
Fourier transform 5,040 257 54 6
0.030
0.020
0.010
0.000
A first step to this goal is to group users with similar behavior, that means to
group users based on the top-k rankings associated to their past purchases.
We consider the 149 users which have purchased at least 5 products among the
8 most purchased products. The sparse hierarchical clustering approach receives
as input the 6,996 smallest frequency coefficients and selects 5 of them. The
corresponding dendogram (cf Figure 5) clearly shows 4 clusters among the users.
On 7 independent splits of the dataset in two parts of equal sizes, the criterion
optimized by sparcl varies from 46.7 to 51.3 with a mean value of 49.1 and a
standard deviation of 1.4. The stability of the criterion increases the confidence
in this clustering of examples.
0.010
0.006
0.002
6 Conclusion
||Ff (ξ)||HS(dξ ) = f (σ)ρξ (σ) ≤ dξ ||f ||1
σ∈Sn
≤ dξ (#supp(f ))1/2 ||f ||
⎛ ⎞1/2
1
≤ dξ (#supp(f ))1/2 ⎝ dξ ||Ff (ξ)||2HS(dξ ) ⎠ ,
n!
ξ∈supp(F f )
where the two last bounds result from Cauchy-Schwarz inequality and Plancherel
def
relation respectively, with ||f ||1 = σ∈Sn |f (σ)|. Now, the desired bound im-
mediately follows.
References
[CJ10] Clémençon, S., Jakubowicz, J.: Kantorovich distances between rankings
with applications to rank aggregation. In: Proceedings of ECML 2010
(2010)
[CS01] Crammer, K., Singer, Y.: Pranking with ranking. In: NIPS (2001)
[CV09] Clémençon, S., Vayatis, N.: Tree-based ranking methods. IEEE Transac-
tions on Information Theory 55(9), 4316–4336 (2009)
[dEW06] desJardins, M., Eaton, E., Wagstaff, K.: Learning user preferences for sets
of objects. In: Airoldi, E.M., Blei, D.M., Fienberg, S.E., Goldenberg, A.,
Xing, E.P., Zheng, A.X. (eds.) ICML 2006. LNCS, vol. 4503, pp. 273–280.
Springer, Heidelberg (2007)
[Dia88] Diaconis, P.: Group representations in probability and statistics. Institute
of Mathematical Statistics, Hayward (1988)
[Dia89] Diaconis, P.: A generalization of spectral analysis with application to
ranked data. The Annals of Statistics 17(3), 949–979 (1989)
[DS89] Donoho, D., Stark, P.: Uncertainty principles and signal recovery. SIAM
J. Appl. Math. 49(3), 906–931 (1989)
[FISS03] Freund, Y., Iyer, R.D., Schapire, R.E., Singer, Y.: An efficient boosting
algorithm for combining preferences. JMLR 4, 933–969 (2003)
[FM04] Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes.
JRSS 66(4), 815–849 (2004)
[FV86] Fligner, M.A., Verducci, J.S.: Distance based ranking models. JRSS Series
B (Methodological) 48(3), 359–369 (1986)
[FV88] Fligner, M.A., Verducci, J.S.: Multistage ranking models. JASA 83(403),
892–901 (1988)
[HFCB08] Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by
learning pairwise preferences. Artificial Intelligence 172, 1897–1917 (2008)
[HG09] Huang, J., Guestrin, C.: Riffled independence for ranked data. In: Pro-
ceedings of NIPS 2009 (2009)
[HGG09] Huang, J., Guestrin, C., Guibas, L.: Fourier theoretic probabilistic infer-
ence over permutations. JMLR 10, 997–1070 (2009)
[HTF09] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learn-
ing, 2nd edn., pp. 520–528. Springer, Heidelberg (2009)
[K8̈9] Körner, T.: Fourier Analysis. Cambridge University Press, Cambridge
(1989)
358 S. Clémençon, R. Gaudel, and J. Jakubowicz
[KB10] Kondor, R., Barbosa, M.: Ranking with kernels in Fourier space. In: Pro-
ceedings of COLT 2010 (2010)
[KLR95] Kahane, J.P., Lemarié-Rieusset, P.G.: Fourier series and wavelets. Rout-
ledge, New York (1995)
[Kon06] R. Kondor. Sn ob: a C++ library for fast Fourier transforms on the sym-
metric group (2006), http://www.its.caltech.edu/~ risi/Snob/
[LL03] Lebanon, G., Lafferty, J.: Conditional models on the ranking poset. In:
Proceedings of NIPS 2003 (2003)
[LM08] Lebanon, G., Mao, Y.: Non-parametric modeling of partially ranked data.
JMLR 9, 2401–2429 (2008)
[Mal57] Mallows, C.L.: Non-null ranking models. Biometrika 44(1-2), 114–130
(1957)
[MM09] Mandhani, B., Meila, M.: Tractable search for learning exponential models
of rankings. In: Proceedings of AISTATS 2009 (2009)
[MPPB07] Meila, M., Phadnis, K., Patterson, A., Bilmes, J.: Consensus ranking under
the exponential model. Proceedings of UAI 2007, 729–734 (2007)
[MS73] Matolcsi, T., Szücs, J.: Intersection des mesures spectrales conjuguées. CR
Acad. Sci. S r. I Math. (277), 841–843 (1973)
[Mur38] Murnaghan, F.D.: The Theory of Group Representations. The Johns Hop-
kins Press, Baltimore (1938)
[PTA+ 07] Pahikkala, T., Tsivtsivadze, E., Airola, A., Boberg, J., Salakoski, T.:
Learning to rank with pairwise regularized least-squares. In: Proceedings
of SIGIR 2007, pp. 27–33 (2007)
[RBEV10] Richard, E., Baskiotis, N., Evgeniou, T., Vayatis, N.: Link discovery using
graph feature tracking. In: NIPS 2010, pp. 1966–1974 (2010)
[RKJ07] Howard, A., Kondor, R., Jebara, T.: Multi-object tracking with represen-
tations of the symmetric group. In: Proceedings og ICML 2007 (2007)
[RS08] Rattan, A., Sniady, P.: Upper bound on the characters of the symmetric
groups for balanced Young diagrams and a generalized Frobenius formula.
Adv. in Math. 218(3), 673–695 (2008)
[Ser88] Serre, J.P.: Algebraic groups and class fields. Springer, Heidelberg (1988)
[TWH01] Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters
in a data set via the gap statistic. J. Royal Stat. Soc. 63(2), 411–423 (2001)
[WT10] Witten, D.M., Tibshirani, R.: A framework for feature selection in clus-
tering. JASA 105(490), 713–726 (2010)
[WX08] Wünsch, D., Xu, R.: Clustering. IEEE Press, Wiley (2008)
PerTurbo: A New Classification Algorithm
Based on the Spectrum Perturbations of the
Laplace-Beltrami Operator
1 Introduction
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 359–374, 2011.
c Springer-Verlag Berlin Heidelberg 2011
360 N. Courty, T. Burger, and J. Laurent
Fig. 1. The training examples (points in red) are used to evaluate a low-dimensional
space where the data live (in light blue). Then, classicaly, once a new sample (in green)
has to be labelled, it can be projected in this embedding, where an embedded metric
is used (a). In the case of Perturbo (b), the deformation of the manifold is evaluated,
thanks to the Laplace-Beltrami operator.
2 State-of-the Art
Let us consider a Riemannian manifold M , (i.e. differentiable everywhere). Its
geometric structure can be expressed by defining the Laplace-Beltrami opera-
tor (.), which can be seen as a generalization of the Laplacian operator over
Riemannian manifolds. It completely defines the manifold up to an isometry [5].
Performing an eigenanalysis of this operator makes it possible to conduct various
tasks on the manifold. Let ui (x) be the eigenfunctions and their corresponding
eigenvalues λi of this operator. They are the solutions to the equation:
whereas the Kernel PCA considers the Gram matrix G(T ), whose (ith, jth) el-
ement is q(xi , xj ), where q is a kernel function. If the kernel is the euclidian dot
product, i.e. if qi,j =< xi , xj >, then G(T ) turns out to have the same eigen-
vectors and eigenvalues as Cov(T ) (up to a square root). Finally, the projection
x̃Y of any test sample x̃ in the latent space Y is given by the following linear
combination: ⎛ ⎞
|T |
dim(Y)
x̃Y = ⎝ αi · q(x̃, xi )⎠ · e (5)
=1 i=1
where 11ij = 1. As can be seen in [10], finding the space spanned by the eigen-
vectors of the dim(Y) smallest eigenvalues of L is equivalent to find the space
spanned by the eigenvectors of the dim(Y) largest eigenvalues of T . Then, per-
forming a Kernel PCA with kernel q(xi , xj ) = Tij is equivalent to applying GLE.
The main interest of GLE with respect to generic kernel PCA, is to provide
a more interpretable feature space. On the other hand, as pointed out in [10],
the projection of a test sample onto Y is not possible, as there is no analytical
form for the kernel: The consideration of a new sample would mean redefining
G as well as q, and re-performing the kernel PCA. Thus, GLE can not be used
for classification issue.
In [8], the manifold M practically corresponds to the 3D surface of an ob-
ject that one wants to reconstruct from a mesh T which samples M . To make
sure that the reconstruction is the most accurate possible while minimizing the
number of points in the mesh, it is necessary to add or remove samples to T .
To evaluate the interest of a point in the mesh, the authors propose to estimate
the modification of the Laplace-Beltrami spectrum induced by the adjunction of
this point. This work is based on the interpretation of this pertubation estima-
tion in a kernel machine learning setting. In this paper, we propose to interpret
the perturbation measure from [8] as a dissimilarity criterion. We show that the
latter allows to conduct classification according to the dimensionality reduction
performed in GLE. We also establish that this measure provides an efficient tool
to design queries in an active learning scheme.
364 N. Courty, T. Burger, and J. Laurent
||r(x̃)||2 = ||Φ(ΦT Φ)−1 ΦT φ(x̃)||2 = (Φ(ΦT Φ)−1 ΦT φ(x̃))T · Φ(ΦT Φ)−1 ΦT φ(x̃)
= φ(x̃)T Φ((ΦT Φ)−1 )T ΦT Φ(ΦT Φ)−1 ΦT φ(x̃)
= (φ(x̃)T Φ)((ΦT Φ)T )−1 (ΦT φ(x̃)) (7)
which, with the kernel notation kx̃ = ΦT φ(x̃) and noting that K = ΦT Φ, gives
||r(x̃)||2 = kTx̃ K−1 kx̃ . Finally, remarking that for a Gaussian kernel ||φ(x̃)||2 =
k(x̃, x̃) = 1, the final perturbation measure reads [8]:
Fig. 2. Illustration of the measure on a spiral toy dataset: (a-b) The measure τ is
shown with respectively two different sigma parameters σ = 0.1 and σ = 1. (c) The
regularized τ measure according to equation (11) with σ = 1 and α = 0.1.
Practically, this measure belongs to [0, 1]. If τ (x̃, M ) = 0, then, the new
point does not modify the manifold, whereas if τ (x̃, M ) = 1, the modification is
maximum, as is illustrated in Figure 3. From this measure, a natural class-wise
PerTurbo: A New Classification Algorithm 365
measure arises, where a dissimilarity to each class is derived from the perturba-
tion of the manifold M associated to class :
Then, each new test sample x̃, is associated to the class with the least induced
perturbation, which reads as:
K̃ = K + αI, (11)
where I is the identity matrix, and α ∈ R∗+ . In such a case, it appears that the
perturbation measure τ (.) shares important similarities with the kernel Maha-
lanobis distance derived from the regularized covariance operator [20]. The main
difference derives from the fact that the measure τ is normalized and thus al-
lows for comparisons between several classes. Hence, in spite of lack of centered
data, the perturbation measure can be pictured as a Mahalanobis distance in
the feature space. Let us note that, as indicated in [20], the kernel Mahalanobis
distance is not affected by kernel centering.
366 N. Courty, T. Burger, and J. Laurent
Fig. 3. Illustration of the classification procedure on two different datasets: (a-b-c) Two
imbricated spirals and (d-e-f) a mixture of 3 Gaussians. In both cases, the lefthand
images (a-d) present the classification with the original Perturbo measure, whereas
center images (b-e) present sprectrum cut (95% of the eigenvalues were kept) and the
righthand images (c-f) present the classification with the regularized Perturbo measure.
In the two examples, we have σ = 1 and α = 0.1.
to derive methods to detect these particular test examples [21]. To do so, rather
intuitive and efficient strategies are based on defining particular regions of in-
terest in X , and to query the test samples which lives in them. More precisely,
it is particularly efficient to query near the expected boundaries of the classes.
Following the same idea, several other strategies can be defined, as is described
in [21].
Depending on the classification algorithm, defining a region of interest around
the expected borders of the classes may be rather complicated. To this extent,
PerTurbo is particularly interesting, as such a region of interest can easily be
defined according to the level sets of the implicit surfaces derived from the po-
tentials of the τ (x̃), ∀. In the case there are only two classes, the border
corresponds to the zeros of |τ1 (.) − τ2 (.)|. Then, the region of the query should
be defined around the border:
4 Experimental Assesment
In this part, we provide experimental comparisons between PerTurbo and the
state-of-the-art. First we only consider the off-line classification, whithout any
active learning method. Second, we focus on the relevance of the query strategies.
Fig. 4. Illustration of the estimation of the location of the border (top), and selection
of the region to query (bottom), for toy examples: A two spiral case (left) and a three
Gaussian case (right).
the performances can be computed. On the other hand, for datasets with pre-
defined training and testing set (such as Hill-valley), we keep those provided so
that the results are reproducible.
Three versions of PerTurbo are considered. In PerTurbo (full), we do not
reduce the dimensionality of K−1 . In PerTurbo (gle), on the contrary, we consider
only its highest eigenvalues of K−1 , so that they sum up to 95% of its trace, in
a GLE-inspired way. Although it is possible, here we do not try to optimize the
dimensionality reduction according to the dataset. Finally, in PerTurbo (reg),
we apply a regularisation to K. K−1 being non-inversible is mentionned in the
experiments. Theoretically this should not happen, unless two samples are very
close to each other, or due to numerical precision issues . Finally, the tuning of
the σ parameter (from the Gaussian kernel) and the α parameter when needed
are obtained by a classical cross-validation procedure operated on the learning
set. Those parameters were set once for all the classes in a given dataset, although
one value per class could have been obtained.
PerTurbo: A New Classification Algorithm 369
Table 1. Description of the GMM simulated datasets and of the datasets from UCI
Table 2. Comparison of the accuracy rates (percentages) with the various classifica-
tion algorithms. For simulated datasets, the mean and standard deviation on ten-fold
repetition is given. N.I stands for “not inversible", which refers to the inverse of K.
Accuracy rates are given in Table 2. Let us note that, in this table, k-NN ac-
curacies are over-estimated, as the k parameter is not estimated during a cross
validation, but directly on the test set. This explains why its performances are
sometimes higher than with SVM. On the contrary, SVM performances are opti-
mized with cross-validations, as SVM is much too sensitive to (hyper-)parameter
tuning. Finally, PerTurbo evaluations are loosly optimized, and a common tuning
for each class is used. Hence, the comparison is sharp with respect to PerTurbo,
and loose with respect to reference methods. From the accuracy rates, it ap-
pears that, except for the Wine datasets, where SVM completely outperforms
both PerTurbo and k-NN, there is no strict and major dominance of a method
over the others. Moreover, the relative interest of each method strongly depends
on the datasets, such as suggested by the No Free Lunch theorem [24]. There
are numerous datasets promoting PerTurbo (SimData-1, SimData-2, SimData-
3, Glass, hill-valley1, hill-valley2), whereas some other promote SVM (Diabete,
Blood-transfusion, Wine, Parkinsons), while the remaining are not significant to
determine a best method (Ionosphere, Ecoli, Letter-recognition), as the perfor-
mances are similar. Hence, even if from these tests, it is impossible to assess the
superiority of PerTurbo, it appears that this framework is able to provide classi-
fication methods comparable to that of the state-of-the-art, and we are confident
that, in the future, more elaborated and tuned version of PerTurbo will provide
even more accurate results.
Concerning, the variations of PerTurbo (full, gle, reg), it appears that the
best method is not always the same, and that accuracies strongly depend on
the datasets. Hence, further investigations are required to find the most adapted
variation, as well as a methodology to tune the corresponding parameters.
PerTurbo: A New Classification Algorithm 371
Beyond these tests, we have observed that, in general, PerTurbo is less ro-
bust/accurate than SVM when:
(1) values of some variables are missing or when there are integer-valued
variables (binary variables being completely unadapted),
(2) the number of example is very small, as the manifold is difficult to charac-
terize (However, the cross-validation for the tuning of the numerous parameters
of SVM is also not possible with too few examples).
On the other hand, we have observed that when the number of classes is
important, PerTurbo is more efficient than SVM. Moreover, in presence of nu-
merous noisy variables, dimensionality reduction or regularization techniques
provide a fundamental advantage to PerTurbo with respect to SVM. Finally,
several other elements promotes PerTurbo: First, its extreme algorithmic sim-
plicity, comparable to that of a PCA; Second, its simplicity of use and the re-
stricted number of parameters; Third, its high computational efficiency, for both
training and testing. As a matter of fact, the training is so efficient that the
active learning procedures requiring several iterations of the training are easily
manageable.
Fig. 5. Comparison of the active learning stragegy (red, upper trajectory) with the
reference one (blue lower trajectory), for SimData-1 (Top), Ecoli (middle) and Wine
(bottom) datasets. The classification algorithm is either PerTurbo (right column) or
k-NN (left column). The thick line represents the mean trajectory, and the shaded strip
around the mean represents the volatility.
5 Conclusion
manifold where the perturbation is the smallest. This makes it possible to de-
rive an interesting strategy for sample queries in an active learning scenario. The
method is very simple, easy to implement and involves very few extra parameters
to tune. The experiments conducted with toy examples and real world datasets
showed performances comparable or slightly better than classical methods of the
state-of-the-art, including SVM. Future works will focus on a more systematical
consideration of the parameters of the algorithm, as well as comparisons with
more powerful kernel-based methods.
References
1. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, Chichester
(2001)
2. Chavel, I.: Eigenvalues in Riemannian geometry. Academic Press, Orlando (1984)
3. Lafon, S., Lee, A.B.: Diffusion maps and coarse-graining: A unified framework for
dimensionality reduction, graph partitioning, and data set parameterization. IEEE
Transactions on Pattern Analysis and Machine Intelligence 28, 1393–1403 (2006)
4. Nadler, B., Lafon, S., Coifman, R., Kevrekidis, I.: Diffusion maps, spectral cluster-
ing and eigenfunctions of fokker-planck operators. In: NIPS (2005)
5. Reuter, M., Wolter, F.-E., Peinecke, N.: Laplace-beltrami spectra as “shape-dna"
of surfaces and solids. Computer-Aided Design 38(4), 342–366 (2006)
6. Rustamov, R.: Laplace-beltrami eigenfunctions for deformation invariant shape
representation. In: Proc. of the Fifth Eurographics Symp. on Geometry Processing,
pp. 225–233 (2007)
7. Knossow, D., Sharma, A., Mateus, D., Horaud, R.: Inexact matching of large and
sparse graphs using laplacian eigenvectors. In: Torsello, A., Escolano, F., Brun, L.
(eds.) GbRPR 2009. LNCS, vol. 5534, pp. 144–153. Springer, Heidelberg (2009)
8. Öztireli, C., Alexa, M., Gross, M.: Spectral sampling of manifolds. ACM, New York
(2010)
9. Coifman, R.R., Lafon, S.: Diffusion maps. Applied and Computational Harmonic
Analysis 21(1), 5–30 (2006)
10. Ham, J., Lee, D., Mika, S., Schölkopf, B.: A kernel view of the dimensionality reduc-
tion of manifolds. In: Proc. of the International Conference on Machine learning,
ICML 2004, pp. 47–57 (2004)
11. Belkin, M., Sun, J., Wang, Y.: Constructing laplace operator from point clouds
in rd. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Dis-
crete Algorithms, SODA 2009, pp. 1031–1040. Society for Industrial and Applied
Mathematics, Philadelphia (2000)
12. Dey, T., Ranjan, P., Wang, Y.: Convergence, stability, and discrete approximation
of laplace spectra. In: Proc. of ACM-SIAM Symposium on Discrete Algorithms,
SODA 2010, pp. 650–663 (2010)
13. Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4),
395–416 (2007)
14. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data
representation. Neural computation 15(6), 1373–1396 (2003)
15. Neumaier, A.: Solving ill-conditioned and singular linear systems: A tutorial on
regularization. Siam Review 40(3), 636–666 (1998)
16. Lee, J.A., Verleysen, M.: Nonlinear dimensionality reduction. Springer, Heidelberg
(2007)
374 N. Courty, T. Burger, and J. Laurent
17. Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computing 10(5), 1299–1319 (1998)
18. Aizerman, M.A., Braverman, E.M., Rozonoèr, L.: Theoretical foundations of the
potential function method in pattern recognition learning. Automation and remote
control 25(6), 821–837 (1964)
19. Meyer, C.: Matrix Analysis and Applied Linear Algebra. Society for Industrial and
Applied Mathematics, Philadelphia (2000)
20. Haasdonk, B., Pȩkalska, E.: Classification with Kernel Mahalanobis Distance Clas-
sifiers. In: Advances in Data Analysis, Data Handling and Business Intelligence,
Studies in Classification, Data Analysis, and Knowledge Organization, pp. 351–361
(2008)
21. Settles, B.: Active learning literature survey. Computer Sciences Technical Report
1648, University of Wisconsin–Madison (2009)
22. Frank, A., Asuncion, A.: UCI machine learning repository (2010)
23. Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab–an S4 package for
kernel methods in R. Journal of Statistical Software 11(9) (2004)
24. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE
Transactions on Evolutionary Computation 1(1), 67–82 (1997)
Datum-Wise Classification:
A Sequential Approach to Sparsity
1 Introduction
Feature Selection is one of the main contemporary problems in Machine Learning
and has been approached from many directions. One modern approach to feature
selection in linear models consists in minimizing an L0 regularized empirical risk.
This particular risk encourages the model to have a good balance between a
low classification error and high sparsity (where only a few features are used for
classification). As the L0 regularized problem is combinatorial, many approaches
such as the LASSO [1] try to address the combinatorial problem by using more
practical norms such as L1 . These approaches have been developed with two
main goals in mind: restricting the number of features for improving classification
speed, and limiting the used features to the most useful to prevent overfitting.
These classical approaches to sparsity aim at finding a sparse representation of
the features space that is global to the entire dataset.
This work was partially supported by the French National Agency of Research (Lam-
pada ANR-09-EMER-007).
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 375–390, 2011.
c Springer-Verlag Berlin Heidelberg 2011
376 G. Dulac-Arnold et al.
We propose a new approach to sparsity where the goal is to limit the number
of features per datapoint, thus datum-wise sparse classification (DWSC). This
means that our approach allows the choice of features used for classification to
vary relative to each datapoint; data points that are easy to classify can be in-
ferred on without looking at very many features, and more difficult datapoints
can be classified using more features. The underlying motivation is that, while
classical approaches balance between accuracy and sparsity at the dataset level,
our approach optimizes this balance at the individual datum level, thus resulting
in equivalent accuracy at higher overall sparsity. This kind of sparsity is inter-
esting for several reasons: First, simpler explanations are always to be preferred
as per Occam’s Razor. Second, in the knowledge extraction process, such datum-
wise sparsity is able to provide unique information about the underlying structure
of the data space. Typically, if a dataset is organized onto two different subspaces,
the datum-wise sparsity principle will allows the model to automatically choose
to classify using only the features of one or another of the subspace.
DWSC considers feature selection and classification as a single sequential deci-
sion process. The classifier iteratively chooses which features to use for classifying
each particular datum. In this sequential decision process, datum-wise sparsity
is obtained by introducing a penalizing reward when the agent chooses to in-
corporate an additional feature into the decision process. The model is learned
using an algorithm inspired by Reinforcement Learning [2].
The contributions of the paper are threefold: (i.) We propose a new approach
where classification is seen as a sequential process where one has to choose which
features to use depending on the input being inferred upon. (ii.) This new ap-
proach results in a model that obtains good performance in terms of classification
while maximizing datum-wise sparsity, i.e. the mean number of features used for
classifying the whole dataset. It also naturally handles multi-class classification
problems, solving them by using as few features as possible for all classes com-
bined. (iii.) We perform a series of experiments on 14 different corpora and
compare the model with those obtained by the LARS [3], and a L1 -regularized
SVM, thus providing a qualitative study of the behaviour of our algorithm.
The paper is organized as follow: First, we define the notion of datum-wise
sparse classifiers and explain the interest of such models in Section 2. We
then describe our sequential approach to classification and detail the learning
algorithm and the complexity of such an algorithm in Section 3. We describe
how this approach can be extended to multi-class classification in Section 4. We
detail experiments on 14 datasets, and also give a qualitative analysis of the
behaviour of this model in Section 6. The related work is given in Section 7.
the set of parameters learned from a training set composed of input/output pairs
Train = {(xi , yi )}i∈[1..N ] . These parameters are commonly found by minimizing
the empirical risk defined by:
N
1
θ∗ = argmin Δ(fθ (xi ), yi ), (1)
θ N i=1
where Δ is the loss associated to a prediction error.
This empirical risk minimization problem does not consider any prior as-
sumption or constraint concerning the form of the solution and can result in
overfitting models. Moreover, when facing a very large number of features, ob-
tained solutions usually need to perform computations on all the features for
classifying any input, thus negatively impacting the model’s classification speed.
We propose a different risk minimization problem where we add a penalization
term that encourages the obtained classifier to classify using on average as few
features as possible. In comparison to classical L0 or L1 regularized approaches
where the goal is to constraint the number of features used at the dataset level,
our approach performs sparsity at the datum level, allowing the classifier to use
different features when classifying different inputs. This results in a datum-wise
sparse classifier that, when possible, only uses a few features for classifying
easy inputs, and more features for classifying difficult or ambiguous ones.
We consider a different type of classifier function that, in addition to predicting
a label y given an input x, also provides information about which features have
been used for classification. Let us denote Z = {0; 1}n. We define a datum-wise
classification function f of parameters θ as:
X →Y ×Z
fθ : ,
fθ (x) = (y, z)
where y is the predicted output and z is a n-dimensional vector z = (z 1 , ..., z n ),
where z i = 1 implies that feature i has been taken into consideration for comput-
ing label y on datum x. By convention, we denote the predicted label as yθ (x)
and the corresponding z-vector as zθ (x). Thus, if zθi (x) = 1, feature i has been
used for classifying x into category yθ (x).
This definition of data-wise classifiers has two main advantages: First, as we
will see in the next section, because fθ can explain its use of features with zθ (x),
we can add constraints on the features used for classification. This allows us to
encourage datum-wise sparsity which we define below. Second, while this is not
the main focus of our article, analysis of zθ (x) gives a qualitative explanation
of how the classification decision has been made, which we study in Section 6.
Note that the way we define datum-wise classification is an extension to the
usual definition of a classifier.
N N
1 1
θ∗ = argmin Δ(yθ (xi ), yi ) + λ zθ (xi )0 . (2)
θ N i=1 N i=1
The term zθ (xi )0 is the L0 norm 2 of zθ (xi ), i.e. the number of features selected
for classifying xi , that is, the number of elements in zθ (xi ) equal to 1. In the
general case, the minimization of this new risk results in a classifier that on
average selects only a few features for classifying, but may use a different set of
features w.r.t to the input being classified. We consider this to be the crux of
the DWSC model : the classifier takes each datum into consideration differently
during the inference process.
(left) (right)
Fig. 1. The sequential process for a problem with 4 features (f1 , ..., f4 ) and 3 possible
categories (y1 , ..., y3 ). Left: The gray circle is the initial state for one particular input
x. Small circles correspond to terminal states where a classification decision has been
made. In this example, the classification (bold arrows) has been made by sequentially
choosing to acquire feature 3 then feature 2 and then to classify x in category y1 . The
bold (red) arrows correspond to the trajectory made by the current policy. Right:
The value of zθ (x) for the different states are illustrated. The value on the arrows
corresponds to the immediate reward received by the agent assuming that x belongs to
category y1 . At the end of the process, the agent has received a total reward of 0 − 2λ.
Note that the optimization of the loss defined in equation (2) is a combinatorial
problem that cannot be easily solved. In the next section of this paper, we propose
an original way to deal with this problem, based on a Markov Decision Process.
Policy. We define a parameterized policy πθ , which, for each state (x, z), returns
the best action as defined by a scoring function sθ (x, z, a):
πθ : X × Z → A and πθ (x, z) = argmax sθ (x, z, a).
a
The policy πθ decides which action to take by applying the scoring function to
every action possible from state (x, z) and greedily taking the highest scoring
action. The scoring function reflects the overall quality of taking action a in
state (x, z), which corresponds to the total reward obtained by taking action a
in (x, z) and thereafter following policy πθ 4 :
T
sθ (x, z, a) = r(x, z, a) + rθt |(x, z), a.
t=1
Here (rθt
| (x, z), a) corresponds to the reward obtained at step t while having
started in state (x, z) and followed the policy with parameterization θ for t steps.
Taking the sum of these rewards gives us the total reward from state (x, z) until
the end of the episode. Since the policy is deterministic, we may refer to a
parameterized policy using simply θ. Note that the optimal parameterization θ∗
obtained after learning (see Sec. 3.3) is the parameterization that maximizes the
expected reward in all state-action pairs of the process.
4
This corresponds to the classical Q-function in Reinforcement Learning.
380 G. Dulac-Arnold et al.
Reward. The reward function reflects the immediate quality of taking action
a in state (x, z) relative to the problem at hand. We define a reward function
over the training set (xi , yi ) ∈ T : R : X × Z × A → R which reflects how
good of a decision taking action fj on state (xi , z) for input xi is relative to our
classification task. This reward is defined as follows5 :
– If a corresponds to a feature selection action, then the reward is −λ.
– If a corresponds to a classification action i.e. a = y, we have:
r(xi , z, y) = 0 if y = yi and = −1 if y
= yi
(t)
where πθ (xi , zθ ) is the action taken at time t by the policy πθ for the training
example xi .
5
Note that we can add −λ · z0 to the reward at the end of the episode, and give a
constant intermediate reward of 0. These two approaches are interchangeable.
Datum-Wise Classification: A Sequential Approach to Sparsity 381
Due to the infinite number of possible inputs x, the number of states is also
infinite. Moreover, the reward function r(x, z, a) is only known for the values
of x that are in the training set and cannot be computed for any other input.
For these two reasons, it is not possible to compute the score function for all
state-action pairs in a tabular manner, and this function has to be approximated.
The scoring function that underlies the policy sθ (x, z, a) is approximated with
a linear model6 :
s(x, z, a) = Φ(x, z, a); θ
and the policy defined by such a function consists in taking in state (x, z) the
action a that maximizes the scoring function i.e a = argmaxa∈A Φ(x, z, a); θ.
Due to their infiniteness, the state-action pairs are represented in a feature
space. We note Φ(x, z, a) the featurized representation of the (x, z), a state-
action pair. Many definitions may be used for this feature representation, but
we propose a simple projection: we restrict the representation of x to only the
selected features. Let μ(x, z) be the restriction of x according to z:
i xi if z i = 1
μ(x, z) = .
0 elsewhere
In Φ(x, z, a), the block φ(x, z) is at position ia · |φ(x, z)| where ia is the index
of action a in the set of all the possible actions. Thus, φ(x, z) is offset by an
amount dependent on the action a.
6
Although non-linear models such as neural networks may be used, we have chosen
to restrict ourselves to a linear model to be able to properly compare performance
with that of other state-of-the-art linear sparse models.
382 G. Dulac-Arnold et al.
3.3 Learning
The goal of the learning phase is to find an optimal policy parameterization θ∗
which maximizes the expected reward, thus minimizing the datum-wise regu-
larized loss defined in (2). As explained in Section 3.2, we cannot exhaustively
explore the state space during training, and therefore we use a Monte-Carlo
approach to sample example states from the learning space. We use the Approx-
imate Policy Iteration (API) algorithm with rollouts [6]. Sampling state-action
pairs according to a previous policy πθ(t−1) , API consists in iteratively learning
a better policy πθ(t) by way of the Bellman equation. The API With Rollouts
algorithm is composed of three main steps that are iteratively repeated:
1. The algorithm begins by sampling a set of random states: the x vector is
sampled from a uniform distribution in the training set, and z is also sampled
using a uniform binomial distribution.
2. For each state in the sampled state, the policy πθ(t−1) is used to compute the
expected reward of choosing each possible action from that state. We now
have a feature vector Φ(x, z, a) for each state-action pair in the sampled set,
and the corresponding expected reward denoted Rθ(t−1) (x, z, a).
3. The parameters θ(t) of the new policy are then computed using classical
linear regression on the set of states — Φ(x, z, a) — and corresponding ex-
pected rewards — Rθ(t−1) (x, z, a) — obtained previously. The generalizing
capacity of the classifier gives an estimated score to state-action pairs even
if we have never visited them.
After a certain number of iterations, the parameterized policy converges to a
final policy π which is used for inference.
where sθ (x, z, a) = sθ (z, a) implies that the score is computed using only the
values of z and a — x is ignored. This corresponds to having two different types
of state-action feature vectors Φ depending on the type of action:
if a ∈ Af , Φ(x, z, a) = (0, . . . , 0, z, 0, . . . , 0)
. (4)
if a ∈ Ay , Φ(x, z, a) = (0, . . . , 0, z, Φ(x, z), 0, . . . , 0)
Fig. 2. Difference between the base Unconstrained Model (DWSM-Un) and the Con-
strained Model (DWSM-Con) described in section 4. The figure shows, for 4 different
inputs x1 , ..., x4 the features selected by the classifiers before classification. One can see
that the Constrained Model chooses the features in the same order for all the inputs.
Although this constraint forces DWSC to choose the features in the same
order, it will still automatically learn the best order in which to choose the
features, and when to stop adding features and classify. However, it will avoid
choosing very different features sets for classifying different inputs (the first
features chosen will be common to all the inputs being classified) and thus avoid
the overfitting problem.
5 Complexity Analysis
Learning Complexity: As explained in section 3.3, the learning method is based
on Reinforcement Learning with Rollouts. Such an approach is expensive in term
of computations because it needs — at each iteration of the algorithm — to sim-
ulate trajectories in the decision process, and then to learn the scoring function
sθ based on these trajectories. Without giving the details of the computation,
the complexity of each iteration is O(Ns · (n2 + c)), where Ns is the number
of states used for rollouts (which in practice is proportional to the number of
training examples), n is the number of features and c is the number of possible
categories. This implies a learning method which is quadratic w.r.t. the num-
ber of features; the proposed approach is not able to deal with problems with
thousands of possible features. Breaking this complexity is an active research
perspective with some leads.
6 Experiments
Experiments were run on 14 different datasets obtained from the LibSVM Web-
site7 . Ten of these datasets correspond to a binary classification task, four to
a multi-class problem. The datasets are described in Table 1. For each dataset,
we randomly sampled different training sets by taking from 5% to 75% of the
examples as training examples, with the remaining examples being kept for test-
ing. We performed experiments with three different models: L1-SVM was used
as a baseline linear model with L1 regularization8. LARS was used to obtain
the optimal solution of the LASSO problem for all values of the regularization
coefficient λ at once9 . Datum-Wise Sequential Model (DWSM) was tested
with the two versions presented above: (i) DWSM-Un is the original uncon-
strained model and (ii) DWSM-Con is the constrained model for preventing
overfitting.
For the evaluation, we used a classical accuracy measure which corresponds
to 1 − error rate on the test set of each dataset. We perform 3 training/testing
set splits of a given dataset to obtain averaged figures. The sparsity has been
measured as the proportion of features not used for L1 -SVM and LARS in binary
classification, and the mean proportion of features not used to classify testing
examples in DWSM. For multi-class problems where one LARS/SVM model
is learned for each category, the sparsity is the proportion of features that have
not been used in any of the models.
For the sequential experiments, the number of rollout states (step 1 of the
learning algorithm) has been set to 2,000 and the number of policy iterations
has been fixed to 10. Note that experiments with more rollout states and/or more
7
http://www.csie.ntu.edu.tw/~ cjlin/libsvmtools/datasets/
8
Using LIBLINEAR [7].
9
We use the implementation from the authors of the LARS, available in R.
Datum-Wise Classification: A Sequential Approach to Sparsity 385
85 75
80 70
Accuracy (%)
Accuracy (%)
75 65
70 60
65 55
60 LARS 50 LARS
SVM L1 SVM L1
DWSM-Con DWSM-Con
DWSM-Un DWSM-Un
55 45
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Sparsity Sparsity
Fig. 3. Accuracy w.r.t. to sparsity. In both plots, the left side on the x-axis corresponds
to a low sparsity, while the right side corresponds to a high sparsity. The performances
of the models are usually decreasing when the sparsity increases, except in case of
overfitting.
iterations give similar results. Experiments were made using an alpha mixture
policy with α = 0.9 to ensure the stability of the learning process. We tested
the different models with different values of λ which controls the sparsity. Note
that even with a λ = 0 value, contrary to the baseline models, the DWSM model
does not use all of the features for classification.
6.1 Results
For each corpus and each training size, we have computed sparsity/accuracy
curves showing the performance of the different models w.r.t. to the sparsity of
the solution. Only two representative curves are given in Figure 3. To summarize
the performances over all the datasets, we give the accuracy of the different
models for three levels of sparsity in tables 2 and 3. Due to a lack of space,
these tables do not present the LARS’ performance, which are equivalent to
the performances of the L1 -SVM. Note that in order to obtain the accuracy
for a given level of sparsity, we have computed a linear interpolation on the
different curves obtained for each corpus and each training size. This linear
interpolation allows us to compare the baseline sparsity methods — that choose
386 G. Dulac-Arnold et al.
Table 2. This table contains the accuracy of each model on the binary classification
problems depending on three levels of sparsity (80%, 60%, and 40%) using different
training sizes. The accuracy has been linearly interpolated from curves like the ones
given in Figure 3.
Table 3. This table contains the accuracy of each model on the multi-class classification
problems depending on three levels of sparsity (80%, 60%, and 40%) using different
training sizes.
1
1
0.9
0.9
0.8
0.8
Proportion of Inputs
Proportion of inputs
When using small training sets with some datasets — sonar or ionosphere
— where overfitting is observed (accuracy decreases with more features used),
the DWSM-Con seems to be a better choice than the unconstrained version and
thus is a version of the algorithm that is well-suited when the number of learning
examples is small.
Concerning the multi-class problems, similar effects can be observed (see
Table 3). The model seems particularly interesting when the number of cate-
gories is high, as in segment and vowel. This is due to the fact that the average
sparsity is optimized by the sequential model for the multi-class problem while
L1 -SVM and LARS, which need to learn one model for each category, perform
separate sparsity optimizations for each class.
Figure 4 gives some qualitative results. First, from the left histogram, one
can see that some features are used in 100% of the decisions. This illustrates
the ability of the model to detect important features that must be used for
decision. Note that many of these features are also used by the L1 -SVM and the
388 G. Dulac-Arnold et al.
LARS models. The sparsity gain in comparison to the baseline model is obtained
through the features 1 and 9 that are only used in about 20% of decisions. From
the right histogram, one can see that the DWSM model mainly classifies using
1, 2, 3 or 10 features, showing that the model is able to adapt its behaviour to
the difficulty of classifying a particular input. This is confirmed by the green
and violet histograms that show that for incorrect decisions (i.e. very difficult
inputs) the classifier almost always acquires all the features before classifying.
These difficult inputs seem to have been identified, but the set of features is not
sufficient for a good understanding. This behaviour opens appealing research
directions concerning the acquisition and creation of new features (see Section 8).
7 Related Work
Feature selection comes in three main flavors [8]: wrapper, filter, or embedded
approaches. Wrapper approaches involve searching the feature space for an
optimal subset of features that maximize classifier performance. The feature
selection step wraps around the classifier, using the classifier as a black-box
evaluator of the selected feature subset. Searching the entire feature space is
very quickly intractable and therefore various approaches have been proposed to
restrict the search (see [9,10]). The advantage of the wrapper approaches is that
the feature subset decision can take into consideration feature inter-dependencies
and avoid redundant features, however the problem remains of the exponential
size of the search space. Filter approaches rank the features by some scor-
ing function independent of their effect on the associated classifier. Since the
choice of features is not influenced by classifier performance, filter approaches
rely purely on the adequacy of their scoring functions. Filtering methods are
susceptible to not discriminating redundant features, and missing feature inter-
dependencies (since each feature is scored individually). Filter approaches are
however easier to compute and more statistically stable relative to changes in
the dataset. Embedded approaches include feature selection as part of the
learning machine. These include algorithms solving the LASSO problem [1], and
other linear models involving a regularizer based on a sparsity inducing norm
( p∈[0;1] -norms [11], group LASSO, ...). Kernel machines provide a mixture of
feature selection and construction as part of the classification problem. Decision
trees are also considered embedded approaches although they are also similar to
filter approaches in their use of heuristic scores for tree construction. The main
critique of embedded approaches is two-fold: they are susceptible to include
redundant features, and not all the techniques described are easily applied to
multi-class problems.In brief, both filtering and embedded approaches have their
drawbacks in terms of their ability to select the best subset of features, whereas
wrapper methods have their main drawback in the intractability of searching the
entire feature space. Furthermore, all existing methods perform feature selection
based on the whole training set, the same set of features being used to represent
any data.
Our sequential decision problem defines both feature selection and classifi-
cation tasks. In this sense, our approach resembles an embedded approach. In
Datum-Wise Classification: A Sequential Approach to Sparsity 389
practice, however, the final classifier for each single datapoint remains a sepa-
rate entity, a sort of black-box classifying machine upon which performance is
evaluated. Additionally, the learning algorithm is free to navigate over the en-
tire combinatorial feature space. In this sense our approach resembles a wrapper
method.
There has been some work using similar formalisms [12], but with different
goals and lacking in experimental results. Sequential decision approaches have
been used for cost-sensitive classification with similar models [13]. There have
also been applications of Reinforcement Learning to optimize anytime classifica-
tion [14]. We have previously looked at using Reinforcement Learning for finding
a stopping point in feature quantity during text classification [15].
Finally, in some sense, DWSC has some similarity with decision trees as each
new datapoint that is labeled is following a different path in the feature space.
However, the underlying mechanism is quite different both in term of inference
procedure and learning criterion. There has been some work in using RL for
generating decision trees [16], but that approach is still tied to decision tree
construction heuristics and the end product remains a decision tree.
8 Conclusion
In this article we introduced the concept of datum-wise classification, where we
learn both a classifier and a sparse representation of the data that is adaptive
to each new datum being classified. We took an approach to sparsity that con-
siders the combinatorial space of features, and proposed a sequential algorithm
inspired by Reinforcement Learning to solve this problem. We showed that find-
ing an optimal policy for our Reinforcement Learning problem is equivalent to
minimizing the L0 regularized loss of our classification problem. Additionally we
showed that our model works naturally on multi-class problems, and is easily
extended to avoid overfitting on datasets where the number of features is larger
than the number of examples. Experimental results on 14 datasets showed that
our approach is indeed able to increase sparsity while maintaining equivalent
classification accuracy.
References
1. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B (January 1994)
2. Sutton, R., Barto, A.: Reinforcement Learning. MIT Press, Cambridge (1998)
3. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least-angle regression. Annals
of statistics 32(2), 407–499 (2004)
4. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Pro-
gramming. Wiley, Chichester (1994)
5. Har-Peled, S., Roth, D., Zimak, D.: Constraint classification: A new approach to
multiclass classification. Algorithmic Learning Theory, 1–11 (2002)
6. Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: Leveraging
modern classifiers. In: ICML 2003 (2003)
7. Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: LIBLINEAR: A library for large
linear classification. JMLR 9, 1871–1874 (2008)
390 G. Dulac-Arnold et al.
8. Guyon, I., Elisseefi, A.: An Introduction to Variable and Feature Selection. Journal
of Machine Learning Research 3(7-8), 1157–1182 (2003)
9. Girgin, S., Preux, P.: Feature discovery in reinforcement learning using genetic
programming. In: O’Neill, M., Vanneschi, L., Gustafson, S., Esparcia Alcázar, A.I.,
De Falco, I., Della Cioppa, A., Tarantino, E. (eds.) EuroGP 2008. LNCS, vol. 4971,
pp. 218–229. Springer, Heidelberg (2008)
10. Gaudel, R., Sebag, M.: Feature Selection as a One-Player Game. In: ICML (2010)
11. Xu, Z., Zhang, H., Wang, Y., Chang, X., Liang, Y.: L1/2 regularization. Science
China Information Sciences 53(6), 1159–1169 (2010)
12. Ertin, E.: Reinforcement learning and design of nonparametric sequential decision
networks. In: Proceedings of SPIE, pp. 40–47 (2002)
13. Ji, S., Carin, L.: Cost-sensitive feature acquisition and classification. Pattern Recog-
nition 40(5), 1474–1485 (2007)
14. Póczos, B., Abbasi-Yadkori, Y., Szepesvári, C., Greiner, R., Sturtevant, N.: Learn-
ing when to stop thinking and do something! In: ICML 2009, pp. 1–8 (2009)
15. Dulac-Arnold, G., Denoyer, L., Gallinari, P.: Text Classification: A Sequential
Reading Approach. In: ECIR, pp. 411–423 (2011)
16. Preda, M.: Adaptive building of decision trees by reinforcement learning. In: Pro-
ceedings of the 7th WSEAS, pp. 34–39 (2007)
Manifold Coarse Graining for Online
Semi-supervised Learning
1 Introduction
Semi-supervised learning is a topic of recent research that effectively addresses
the problem of limited data [1]. In order to use unlabeled data in the learning
process efficiently, certain assumptions on the relation between the possible la-
beling functions and the underlying geometry should hold [2]. In many real world
classification problems, data points lie on a low dimensional manifold. The man-
ifold assumption states that the labeling function varies smoothly with respect
to underlying manifold [3]. Manifold structure is modeled by the neighborhood
graph of the data points. SSL methods with manifold assumption prove to be ef-
fective in many applications including image segmentation[4], handwritten digit
recognition and text classification [5].
Online classification of data is required in common applications such as object
tracking [6], face recognition in surveillance systems [11], and image retrieval [7].
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 391–406, 2011.
c Springer-Verlag Berlin Heidelberg 2011
392 M. Farajtabar et al.
coarse graining in exact and approximate modes and explain how it helps us to
preserve LP and manifold structure while reducing the number of data points. In
Section 5 experimental results are provided, after which the paper is concluded
in Section 6.
P (i, j) can be interpreted as the effect of f (j) on f (i). The algorithm is stated
as follows:
1. Propagation: f (t+1) ← P f (t)
2. Clamping: fl = y
Where f (t) is the estimated label at step t. If we decompose W and P according
to labeled and unlabeled parts,
Wuu Wul P P
W = P = uu ul , (3)
Wlu Wll Plu Pll
then under appropriate conditions [15], the solution of LP converges to the HS
and is independent of the initial value (i.e. f (0) ) and may be written as
fu = (I − Puu )−1 Pul y fl = y. (4)
394 M. Farajtabar et al.
With the absorbing stochastic matrix the entire process of LP may be rewritten
as
f t+1 = Qf t , (6)
(0) (0)
where the initial value is f (0) = (fu ; y), and fu may be arbitrary. In this new
formulation estimated labels are computed as limn→∞ Qn f (0) . Defining Q∞ as
Q∞ lim Qn ,
n→∞
we can write (fu ; fl ) = Q∞ f (0) . Since the result is independent of initial states
of unlabeled data, fu (j) can be rewritten as
l+u
fu (j) = Q∞ (j, k)y(k). (7)
k=u+1
the magnitude of all eigenvalues of Puu is less than one, due to the fact
n
that Puu → 0 as n → ∞ [15]. Therefore, λ = 1 has multiplicity l and
the magnitude of all other eigenvalues of Q is less than one and real. It is
straightforward to show that eigenvalues of a stochastic matrix and the new
variation are all real.
For the last part, it can be verified that
Puu Pul
0l×u Il×l × = 0l×u Il×l .
0 I
Therefore, rows of 0l×u Il×l are the left eigenvectors of Q associated to
λ = 1.
Definition 1. From now we refer to eigenvectors corresponding to eigenvalues
equal to one as unitary eigenvectors, which is different from unit eigenvectors
that have unit norm.
A = VR DVLT
and
VLT VR = I,
where D is the diagonal matrix of eigenvalues, columns of VR and VL are the
right and left eigenvectors of A, respectively.
Corollary 1. By unfolding above decomposition we get another expression for
spectral decomposition as
n
A= λi pi uTi ,
i=1
th
where λi , pi and ui are the i eigenvalue, right eigenvector and left eigenvector
respectively.
Now we are ready to prove the main result of this part.
eigenvectors with eigenvalue less than one disappear and the unitary eigenvalues
and eigenvectors remain:
396 M. Farajtabar et al.
l+u
Q∞ = pi uTi .
i=u+1
l+u
fu (j) = pk (j)y(k). (8)
k=u+1
1 2 0
w14
w13 + w23 w14 + w24
w13 w23 w24
3 4 3 4
w48 w35 w48
w35
Fig. 1. Merging two vertices 1 and 2 would not disturb label propagation
It is straightforward to see that q03 = q13 = q23 and q04 = q14 = q24 , so
columns in the first two rows of the above equations are equivalent. Also since
after merging we have q31 + q32 = q30 and q41 + q42 = q40 columns of the
last two rows impose the same effect on nodes 3 and 4. Thus if nodes 1 and 2
are unlabeled, f (t) (1) = f (t) (2) = f (t) (0) and f (t) (3) = f (t) (3) and f (t) (4) =
f (t) (4) in all steps of LP in the original and reduced graph.
This process can be modeled by the transformation Q = LQR where
⎡ ⎤
⎡ d1 ⎤ 1 0 ··· 0
d2
d1 +d2 d1 +d2
0 · · · 0 ⎢1 0 ··· 0⎥
⎢ 0 0 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
L =⎢ . . ⎥ R = ⎢0 ⎥ (9)
⎣ .. .. In−2 ⎦ ⎢ . ⎥
⎣ .. In−2 ⎦
0 0
0
and di = j W (i, j). One can see that the transformation simply merges rows
and columns of Q corresponding to nodes 1 and 2 such that all rows are still
normalized. For an undirected graph its stochastic matrix has the property that
its first left eigenvector is proportional to diag(d1 , . . . , dn ). It’s easy to see that
this is also true for the unlabeled part of the absorbing stochastic matrix Quu ,
which can be viewed as the scaled stochastic matrix of an undirected graph.
Since the first u elements of the eigenvectors of Q are equal to the eigenvectors
of Quu , this is true for unlabeled nodes. For unlabeled nodes di = u1 (i); and
only unlabeled nodes are coarsened, so alternatively u1 (i) may be used in (9).
This transformation has interesting properties and is well studied in [17] which
presents a similar algorithm based on random walk in social networks. In the
general case, R and L can be defined such that the transform merges all the
nodes with the same neighbors. There may be more than two nodes that have
similar neighbors and can thus be merged.
We will proceed using spectral analysis which will help us in later sections
where we introduce non-exact merging of nodes. The next lemma relates spectral
analysis to CG.
398 M. Farajtabar et al.
Lemma 3. Rows i and j of Q are equal if and only if pk (i) = pk (j) for all k
with λk
= 0, where pk is a right eigenvector of Q.
Proof. Proof is immediate from the definition of eigenvectors and Corollary 1 of
spectral decomposition.
Lemma 3 states that nodes to be merged can be selected via eigenvectors of
the absorbing stochastic matrix Q, instead of comparing the rows of Q itself.
We decide to merge nodes if their corresponding elements are equal along all
the eigenvectors with nonzero eigenvalue. We will see how this spectral view of
merging helps us develop and analyze the non-exact case where we merge nodes
even if they aren’t exactly identical along eigenvectors.
We should fix some notations before we proceed. Superscript ” ” will be used
to indicate objects after CG. i.e. Q , p , u , n are the stochastic matrix, right
and left eigenvectors, and number of nodes after CG. Let S1 , . . . , Sn be the
n clusters of nodes found after CG. Also let S be the set of all nodes that are
merged into some cluster.
We wish to use ideas from section 3 to provide a spectral view of coarsening
stated so far in this section. We need the following lemma.
Lemma 4. [17] If conditions of Lemma 3 hold Lp is the right unitary eigenvec-
tor of Q with the same eigenvalue as p, where p is the right unitary eigenvector
of Q.
First note that Lp simply retains elements of p which are not merged and removes
repetition of the same elements for nodes that are merged. So after CG, the
right eigenvectors of Q and associated eigenvalues are preserved. Recall from the
previous section that the right eigenvectors are directly related to the result of
LP. We are now ready to prove the following theorem.
Theorem 2. LP solution is preserved for nodes or cluster of nodes in exact CG,
i.e. when we merge nodes if their corresponding elements are the same along all
right eigenvectors with nonzero eigenvalues.
Proof. Consider equation (8) from previous section for computing labels based
upon right eigenvectors,
u+l
fu (j) = pk (j)y(k). (10)
k=u+1
Considering (Lpk )(j ) = pk (j) we get the result, fu (j) = fu (j ). This means that
labels of unlabeled nodes are preserved under CG.
Manifold Coarse Graining for Online Semi-supervised Learning 399
This kind of data reduction will preserve LP results in the manifold of data and as
a consequence manifold structure in the reduced graph. This is elaborated upon
in the next subsections. Equality along all eigenvectors with nonzero eigenvalues
is a restrictive constraint for CG. In the next section we will see how this criterion
may be relaxed.
u1 (1) u1 (m)
(RLpi )(1) = m pi (1) + · · · + m pi (m) (12)
j=1 u 1 (j) j=1 u1 (j)
we may write
The last inequality is due to the fact that in each cluster along the ith eigenvector,
differences between elements are no more than ηi . Inequality (13) bounds the
difference between elements of eigenvectors corresponding to a node before CG
and the desired value after CG. Note that εi is zero if CG is exact or for a node
that is not merged.
Suppose p is the true right eigenvector of Q . However we would like to have
Lp as its right eigenvector so as to better preserve the manifold structure and
400 M. Farajtabar et al.
where
k
2
n
D= λi u1 (j)εi (j)2 (16)
i=1 j=1
and εi = pi − RLpi where pi s and λi s are right eigenvectors and associated
eigenvalues that CG is performed along (As this bound hints CG need not be
performed along all eigenvectors. We will explain this point shortly). It’s no-
ticeable that the bound (15) is a general bound for any coarsening algorithm,
It’s also originally stated for stochastic matrix of undirected graphs such as P ,
However as stated in 4.1 the unlabeled part of Q can be considered as such a
matrix. √
Considering (14) and (15), if λ D then Lp is a good approximation of p.
Given the eigenvectors that must be preserved we √ can determine how to choose
ηi for a good approximation. The inequality λl ≥ D/ω(1) should be satisfied.
For example we may seek for sufficient conditions to satisfy
√
λl ≥ D/n (17)
for every eigenvector pl that we wish to preserve. Using equation (16) we want
to find ηi for all i such that (17) holds.
For simplicity consider cluster S1 = {1, . . . , m}. By using inequality (13),
u1 (j)
u1 (j)εi (j)2 = u1 (j)ηi 2 (1 − )2 =
r∈S1 u 1 (r)
j∈S1 j∈S1
u1 (j)2 u1 (j)3
ηi 2 u1 (j) − 2 + 2
=
j∈S1 r∈S1 u1 (r) ( r∈S1 u1 (r))
u1 (j)2 u (j)3
ηi 2 u1 (j) − 2 + 1 2
=
r∈S1 1u (r) ( u
r∈S1 1 (r))
j∈S1 j∈S1 j∈S1
2
u1 (j)2 ( r∈S1 u1 (r))3
ηi u1 (j) − 2 + 2
−C ≤
j∈S1 j∈S1 r∈S1 u1 (r) ( r∈S1 u1 (r))
2ηi 2 u1 (j)
j∈S1
(18)
Manifold Coarse Graining for Online Semi-supervised Learning 401
Now we are ready to find an appropriate value for ηi to satisfy (17):
k
2
n
k
2
D= λi u1 (j)εi (j)2 ≤ 2 λi ηi 2 u1 (j) (20)
i=1 j=1 i=1 j∈U
√
Let M = j∈U u1 (j). For λl ≥ D/n to be satisfied for every l:
k
2 λl
2 λi ηi 2 M ≤ (21)
i=1
n
is true for every l, is sufficient condition that will ensure Lpl is almost surely
min is the minimum eigenvalue among the eigenvectors that
preserved, i.e., if λ
must be preserved, then
1 min
λ
ηi 2 ≤ 2 . (23)
2kM λi n
The bound derived in (23) shows how ηi should be chosen to ensure that Lpi is
similar to a right eigenvector of Q .
Fig. 2. Process of CG on a toy dataset with 800 nodes. Two labeled nodes are provided
on head and tail of the spiral and are red asterisks. Green circle and blue square nodes
represent different classes. The area of each circle is proportional to the number of nodes
that reside in the corresponding cluster. After CG 255 nodes remain which means a
reduction of 68%.
5 Experiments
We evaluate our method empirically on 3 real world datasets: digit, letter and
image classification. The first is UCI letter recognition dataset [18]. The next is
USPS digit recognition. We reduce the dimension of each data to 64 with PCA.
Caltech dataset [19] is used for image classification. Features are extracted using
CEDD [20]. Adjacency matrices are constructed using 5-NN with the bandwidth
size set to mean of standard deviation of data. 20 data points are labeled. In
addition to these 20 unitary eigenvectors 5 other top eigenvectors are selected for
CG. ηi is set to divide values along ith eigenvector into I groups, where I is the
final parameter that varies to get different reduction sizes. In all experiments on
digits and letters the average accuracy among 10 pairwise problems are reported.
On Caltech we use 2 specific classes. Four experiments are designed to evaluate
our method.
Table 1. Eigenvalue and eigenvector preservation in CG for top ten eigenvectors which
CG is performed along them
i 1 2 3 4 5 6 7 8 9 10
λi 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 -0.9999 0.9999 0.9999 0.9997
λi 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 -0.9999 0.9999 0.9998 0.9997
(Lpi ) pi
Lpi pi
0.9967 0.9925 0.9971 0.9910 0.9982 0.9964 0.9999 0.9909 0.8429 0.9982
baseline 0.97
0.98
coarse graining
0.96
0.84graph quantization 0.96
Accuracy
Accuracy
Accuracy
0.83 0.95
0.94
0.82 0.94
0.92
0.81 0.93
0.9
0.8 0.92
0.88
0.79
200 400 600 800 200 400 600 800 200 400 600 800
Time step Time step Time step
(a) Letter recognition (b) Digit recognition (c) Image classification
Fig. 3. Online classification. Data is arrived sequentially and maximum buffer size is
200.
coarse graining 1
0.98
graph quantization
0.84
0.96 0.9
Accuracy
Accuracy
Accuracy
0.82 0.94
0.8
0.8 0.92
0.9 0.7
0.78
0.88
100 200 300 100 200 300 0.02 0.03 0.04
Number of clusters Number of clusters Outlier ratio
Fig. 4. (a,b): Capability of methods to preserve manifold structure. 500 nodes are
coarse grained and the classification accuracy is averaged for separately added 500 new
data. (c): Comparison of robustness to outliers in USPS.
6 Conclusion
In this paper, a novel semi-supervised CG algorithm is proposed to reduce the
number of data points while preserving the manifold structure. To this end a new
formulation of LP is used to derive a new spectral view of the HS. We show that
the manifold structure is closely related to the eigenvectors of a variation of the
stochastic matrix. This structure is well preserved by any algorithm which guar-
antees small distortions in the corresponding eigenvectors. Exact and approxi-
mate coarse graining algorithms are provided alongside a theoretical analysis of
how well the LP properties are preserved. The proposed method is evaluated on
three real world datasets and outperforms the state of the art CG in the follow-
ing scenarios, namely online classification, manifold preservation and robustness
against outliers. The performance of our method is comparable to that of an
algorithm that utilizes all the data in a simulated online scenario.
A theoretical analysis of robustness against noise, extending the spectral view
point to other manifold learning methods, and deriving tighter error bounds on
CG, to name a few, are interesting problems that remain as future work.
References
1. Zhu, X.: Semi-Supervised Learning Literature Survey. Technical Report 1530, De-
partment of Computer Sciences, University of Wisconsin Madison (2005)
2. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised Learning. MIT Press, Cam-
bridge (2006)
3. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold Regularization: a Geometric Frame-
work for Learning from Labeled and Unlabeled Examples. Journal of Machine
Learning Research 7, 2399–2434 (2006)
4. Duchenne, O., Audibert, J., Keriven, R., Ponce, J., Segonne, F.: Segmentation by
Transduction. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2008, pp. 1–8 (2008)
5. Belkin, M., Niyogi, P.: Using Manifold Structure for Partially Labeled Classifica-
tion. Advances in Neural Information Processing Systems 15, 929–936 (2003)
6. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised On-Line Boosting for Ro-
bust Tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008)
7. He., X.: Incremental Semi-Supervised Subspace Learning for Image Retrieval. In:
Proceedings of the ACM Conference on Multimedia (2004)
8. Moh, Y., Buhmann, J.M.: Manifold Regularization for Semi-Supervised Sequential
Learning. In: ICASSP (2009)
9. Goldberg, A., Li, M., Zhu, X.: Online Manifold Regularization: A New Learning
Setting and Empirical Study. In: Proceeding of ECML (2008)
10. Dasgupta, S., Freund, Y.: Random Projection Trees and Low Dimensional Mani-
folds. Technical Report CS2007-0890, University of California, San Diego (2007)
406 M. Farajtabar et al.
11. Valko, M., Kveton, B., Ting, D., Huang, L.: Online Semi-Supervised Learning
on Quantized Graphs. In: Proceedings of the 26th Conference on Uncertainty in
Artificial Intelligence, UAI (2010)
12. Lafon, S., Lee, A.B.: Diffusion Maps and Coarse-Graining: A Unified Framework
for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization.
IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9), 1393–1403
(2006)
13. Zhou, D., Bousquet, O., Lal, T., Weston, J., Scholkopf, B.: Learning with local and
global consistency. Neural Information Processing Systems (2004)
14. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-Supervised Learning Using Gaussian
Fields and Harmonic Functions. In: ICML (2003)
15. Zhu, X., Ghahramani, Z.: Learning from Labeled and Unlabeled Data with Label
Propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University
(2002)
16. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes,
The Art of Scientific Computing, 3rd edn. Cambridge University Press, Cambridge
(2007)
17. Gfeller, D., De Los Rios, P.: Spectral Coarse Graining of Complex Networks. Phys-
ical Review Letters 99, 3 (2007)
18. Frank, A., Asuncion, A.: UCI Machine Learning Repository (2010)
19. Fei, L., Fergus, R., Perona, P.: Learning Generative Visual Models From Few Train-
ing Examples: An Incremental Bayesian Approach Tested on 101 Object Cate-
gories. In: IEEE CVPR 2004, Workshop on Generative Model Based Vision (2004)
20. Chatzichristofis, S.A., Boutalis, Y.S.: CEDD: Color and Edge Directivity Descrip-
tor: A Compact Descriptor for Image Indexing and Retrieval. In: ICVS, pp. 312–322
(2008)
Learning from Partially Annotated Sequences
1 Introduction
The problem of labeling, annotating, and segmenting observation sequences
arises in many applications across various areas such as natural language process-
ing, information retrieval, and computational biology; exemplary applications
include named entity recognition, information extraction, and protein secondary
structure prediction.
Traditionally, sequence models such as hidden Markov models [26,14] and
variants thereof have been applied to label sequence learning [9] tasks. Learning
procedures for generative models adjust the parameters such that the joint like-
lihood of training observations and label sequences is maximized. By contrast,
from an application point of view, the true benefit of a label sequence predictor
corresponds to its ability to find the correct label sequence given an observation
sequence. Thus, many variants of discriminative sequence models have been ex-
plored, including maximum entropy Markov models [20], perceptron re-ranking
[7,8], conditional random fields [16,17], structural support vector machines [2,34],
and max-margin Markov models [32].
Learning discriminative sequential prediction models requires ground-truth
annotations and compiling a corpus that allows for state-of-the-art performance
on a novel task is not only financially expensive but also in terms of the time it
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 407–422, 2011.
c Springer-Verlag Berlin Heidelberg 2011
408 E.R. Fernandes and U. Brefeld
Table 1. Different interpretations for ”I saw her duck under the table" [15]
I saw [NP her] [VP duck under the table]. → She ducked under the table.
I [VP saw [NP her duck] [PP under the table]]. → The seeing is done under the table.
I saw [NP her duck [PP under the table]]. → The duck is under the table.
2 Related Work
Learning from partially annotated sequences has been studied by [30] who extend
HMMs to explicitly exclude states for some observations in the estimation of the
models. [22] propose to incorporate domain-specific ontologies into HMMs to
provide labels for the unannotated parts of the sequences, [10] cast learning an
HMM for partially labeled data into a large-margin framework and [33] present
an extension of maximum entropy Markov models (MEMMs) and conditional
random fields (CRFs). The latent-SVM [36] allows for the incorporation of latent
variables in the underlying graphical structure; the additional variables implicitly
act as indicator variables and conditioning on their actual value eases model
adaptation because it serves as an internal clustering.
The generalized perceptron for structured output spaces is introduced by [7,8].
Altun et al. [2] leverage this approach to support vector machines and explore
label sequence learning tasks with implicit 0/1 loss. McAllester et al. [19] propose
to incorporate loss functions into the learning process of perceptron-like algo-
rithms. Transductive approaches for semi-supervised structured learning are for
instance studied in [17,1,35,18,37], where the latter is the closest to our approach
as the authors study transductive support vector machines with completely la-
beled and unlabeled examples.
Generating fully annotated corpora from Wikipedia has been studied by
[24,27,21]. While [21] focus on English and exploit the semi-structured content
of the info-boxes, [24] and [27] propose heuristics to assign tags to Wikipedia
entries by manually defined patterns.
3 Preliminaries
The task in label sequence learning [9] is to find a mapping from a sequential
input x = x1 , . . . , xT to a sequential output y = y1 , . . . , yT , where yt ∈ Σ;
i.e., each element of x is annotated with an element of the output alphabet Σ
which denotes the set of tags. We denote the set of all possible labelings of x by
Y(x).
The sequential learning task can be modeled in a natural way by a Markov
random field where we have edges between neighboring labels and between
3
http://www.wikipedia.com
410 E.R. Fernandes and U. Brefeld
y1 y2 yT
x1 x2 xT
Fig. 1. A Markov random field for label sequence learning. The xt denote observations
and the yi their corresponding hidden class variables.
The feature map exhibits a first-order Markov property and as a result, decod-
ing can be performed by a Viterbi algorithm [11,31] in O(T |Σ|2 ) so that, once
optimal parameters w∗ have been found, these are used as plug-in estimates to
compute the prediction for a new and unseen sequence x ,
The optimal function f (·; w∗ ) minimizes the expected risk E [(y, f (x; w ∗ )] where
is a task-dependent, structural loss function. In the remainder, we will focus
on the 0/1- and the Hamming loss to compute the quality of predictions,
|y|
0/1 (y, ỹ) = 1[y=ỹ] ; h (y, ỹ) = 1[yt =ỹt ] (2)
t=1
Note that in case ŷ t = y t the model is not changed, that is w t+1 ← wt . After
an update, the model favors y t over ŷ t for the input xt and a simple extension
of Novikoff’s theorem [25] shows that the structured perceptron is guaranteed to
converge to a zero loss solution (if one exists) in at most t ≤ ( γ̃r )2 w∗ 2 steps,
where r is the radius of the smallest hypersphere enclosing the data points and
γ̃ is the functional margin of the data [8,2].
σ1 - σ1 - σ1
Q 3 Q 3
J Q
J Q
Q
J Q
Q J
Q
σ2 s σ2
Q
J
- J
- s σ2
Q
J J
@
J @
J
@
@J
@J
J
^
J
^
JJ
@
R
@ @
R
@
σ k - σ - kσ k
t−1 t t+1
Fig. 2. The constrained Viterbi decoding (emissions are not shown). If time t is an-
notated with σ2 , the light edges are removed before decoding to guarantee that the
optimal path passes through σ2 .
The constrained Viterbi decoding guarantees that the optimal path passes through
the already known labels by removing unwanted edges, see Figure 2. Assuming
that a labeled token is at position 1 < t < T , the number of removed edges
is precisely 2(k − 1)k, where k = |Σ|. Algorithmically, the constrained decod-
ing splits sequences at each labeled token in two halves which are then treated
independently of each other in the decoding process.
Given the pseudo labeling y p for an observation x, the update rule of the
loss-augmented perceptron can be used to complement the transductive percep-
tron. The inner loop of the resulting algorithm is shown in Table 2. Note that
augmenting the loss function into the computation of the argmax (step 2) gives
y p = ŷ if and only if the implicit loss-rescaled margin criterion is fulfilled for all
alternative output sequences ỹ.
with appropriately chosen α’s that act as virtual counters, detailing how many
times the prediction ŷ has been decoded instead of the pseudo-output y p for
an observation x. Thus, the dual perceptron has virtually exponentially many
parameters, however, these are initialized with αx (y, y ) = 0 for all triplets
(x, y, y ) so that the counters only need to be instantiated once the respective
triplet is actually seen. Using Eq. (4), the decision function depends only on
inner products of joint feature representations which can then be replaced by
appropriate kernel functions k(x, y, x , y ) = φ(x, y) φ(x , y ).
5 Empirical Results
In this section, we will show that (i) one can effectively learn from partial annota-
tions and that (ii) our approach is superior to standard semi-supervised setting.
We thus compare the transductive loss-augmented perceptron to its supervised
and semi-supervised counterparts. Experiments with CoNLL data use the orig-
inal splits of the respective corpora into training, holdout, and test set, where
parameters are adjusted on the holdout sets. We report on averages of training
3 × 4 = 12 repetitions, involving 3 perceptrons and 4 data sets to account for
the random effects in the algorithm and data generation; error bars indicate
standard error.
Due to the different nature of the algorithms, we need to provide different
ground-truth annotations for the algorithms. While the transductive perceptron
is simply trained on arbitrarily (e.g., partially) labeled sequences, the supervised
baseline needs completely annotated sentences and the semi-supervised percep-
tron allows for the inclusion of additional unlabeled examples. In each setting, we
use the same observation sequences for all methods and only change the distri-
bution of the labels so that it meets the requirements of the respective methods;
however note that the number of labeled tokens is identical for all methods. We
describe the generation of the training sets in greater detail in the following
subsections. All perceptrons are trained for 100 epochs.
84
82
80
F1 78
76
Partially labeled
74 Semi-supervised
Supervised
72
10 20 30 40 50 60 70 80 90 100
annotated tokens (%)
Figure 3 shows F1 scores for different ratios of labeled and unlabeled tokens.
Although the baselines are more likely to capture transitions well because the
labeled tokens form complete annotations, they are significantly outperformed
by the transductive perceptron in case only 10-50% of the tokens are labeled. For
60-100% all three algorithms perform equally well which is still notable because
the partial annotations are inexpensive and easier to obtain. By contrast, the
semi-supervised perceptron performs worst and is not able to benefit from many
unlabeled examples.
We now study the impact of the amount of additional labeled tokens. In Fig-
ure 4, we fix the amount of completely annotated sentences at 20% (left figure)
and 50% (right), respectively, and vary the amount of additional partially an-
notated tokens. The supervised and semi-supervised baselines are constant as
they cannot deal with the additional data where the semi-supervised perceptron
treats the remaining 80% and 50% of the data as unlabeled sentences. Notice
that the semi-supervised baseline performs poorly; as in the previous experiment,
the additional unlabeled data seemingly harm the training process. Similar ob-
servations have for instance been made by [4,23] and particularly for structural
semi-supervised learning by [37]. By contrast, the transductive perceptron shows
in both figures an increasing performance for the partially labeled setting when
the amount of labeled tokens increases. The gain in predictive accuracy is highest
for settings with only a few completely labeled examples (Figure 4, left).
84 84
82
83
80
78
F1
F1
82
76
74
81
Partially labeled Partially labeled
72 Semi-supervised Semi-supervised
Supervised Supervised
70 80
0 20 40 60 80 100 0 20 40 60 80 100
additional labels (%) additional labels (%)
Fig. 4. Varying the amount of additional labeled tokens with 20% (left) and 50% (right)
completely labeled examples
Table 3. An exemplary partially labeled sentence extracted from Wikipedia. The coun-
try Hungary is labeled as a location (LOC) due to the majority vote, while Bukkszek
could not be linked to a tagged article and remains unlabeled.
the major goals in the data generation is to render human interaction unneces-
sary or at least as low as possible. In the following we briefly describe a simple
way to automatically annotate Wikipedia data using existing resources.
Atserias et al. [3] provide a tagged version of the English Wikipedia that
preserves the link structure. We collect the tagged entities in the text that are
linked to a Wikipedia article. In case the tagged entity does not perfectly match
the hyperlinked text we treat it as untagged. This gives us a distribution of
tags for each Wikipedia article as the tagging is noisy and depends highly on the
context.4 The linked entities referring to Wikipedia articles are now re-annotated
with the most frequent tag of the referenced Wikipedia article. Table 3 shows
an example of an automatically annotated sentence. Words that are not linked
to a Wikipedia article (e.g., small) as well as words corresponding to Wikipedia
articles which have not yet been tagged (e.g., Bikkszek) remain unlabeled.
Table 4 shows some descriptive statistics of the extracted data. Since the
automatically generated data is only partially annotated, the average number of
4
For instance, a school could be either tagged as a location or an organization, de-
pending on the context.
Learning from Partially Annotated Sequences 417
CoNLL Wikipedia
tokens 203,621 1,205,137,774
examples 14,041 58,640,083
tokens per example 14.5 20.55
entities 23,499 22,632,261
entities per example 1.67 0.38
MISC 14.63% 18.17%
PER 28.08% 19.71%
ORG 26.89% 30.98%
LOC 30.38% 31.14%
82.7 77
82.6 76.9
82.5 76.8
82.4 76.7
F1
F1
82.3 76.6
82.2 76.5
82.1 76.4
82.0 76.3
0 2x106 4x106 6x106 0 4x106 8x106
# Wikipedia tokens # Wikipedia tokens
Fig. 5. Results for mono-lingual (left) and cross-lingual (right) Wikipedia experiments
entities in sentences is much lower compared to that of CoNLL. That is, there are
potentially many unidentified and missed entities in the data. By looking at the
numbers one could assume that particularly persons (PER) are underrepresented
in the Wikipedia data while organizations (ORG) and others (MISC) are slightly
overrepresented. Locations (LOC) are seemingly well captured.
The experimental setup is as follows. We use all sentences contained in the
CoNLL training set as completely labeled examples and add randomly drawn
partially labeled sentences that are automatically extracted from Wikipedia.
Figure 5 (left) shows F1 scores for varying numbers of additional data. The
leftmost point coincides with the supervised perceptron that only processes the
labeled CoNLL data. Adding partially labeled data shows a slight but significant
improvement over the supervised baseline. Interestingly, the observed improve-
ment increases with the number of partially labeled examples although these
come from a different distribution as shown in Table 4.
418 E.R. Fernandes and U. Brefeld
CoNLL Wikipedia
tokens 264,715 257,736,886
examples 8,323 9,500,804
tokens per example 31.81 27.12
entities 18,798 8,520,454
entities per example 2.26 0.89
MISC 11.56% 27.64%
PER 22.99% 23.71%
ORG 39.31% 32.63%
LOC 26.14% 16.02%
6.0
5.8
training time (min)
5.6
5.4
5.2
5.0
4.8
4.6
0 20 40 60 80 100
additional labels (%)
6 Conclusion
In this paper, we showed that surprisingly simple methods, such as the devised
transductive perceptron, allow for learning from sparse and partial labelings.
Our empirical findings show that a few, randomly distributed labels often lead to
better models than the standard supervised and semi-supervised settings based
on completely labeled ground-truth; the transductive perceptron was observed
to be always better or on par as its counterparts trained on the same amount
of labeled data. Immediate consequences arise for the data collection: while the
standard semi-supervised approach requires completely labeled editorial data, we
can effectively learn from partial annotations that have been generated automat-
ically and without manual interaction; using additional, automatically labeled
data from Wikipedia lead to a significant increase in performance in mono- and
cross-lingual named entity recognition tasks. We emphasize that these improve-
ments come at factually no additional labeling costs at all.
420 E.R. Fernandes and U. Brefeld
Future work will extend our study towards larger-scales. It will certainly be
of interest to extend the empirical evaluation to other sequential tasks, output
structures. As the developed transductive perceptron is a relatively simple algo-
rithm, more sophisticated ways for dealing with partially labeled data are also
interesting research areas.
References
1. Altun, Y., McAllester, D., Belkin, M.: Maximum margin semi–supervised learning
for structured variables. In: Advances in Neural Information Processing Systems
(2006)
2. Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden Markov support vector ma-
chines. In: Proceedings of the International Conference on Machine Learning
(2003)
3. Atserias, J., Zaragoza, H., Ciaramita, M., Attardi, G.: Semantically annotated
snapshot of the english wikipedia. In: European Language Resources Association
(ELRA), editor, Proceedings of the Sixth International Language Resources and
Evaluation (LREC 2008), Marrakech, Morocco (May 2008)
4. Baluja, S.: Probabilistic modeling for face orientation discrimination: Learning
from labeled and unlabeled data. In: Advances in Neural Information Processing
Systems (1998)
5. Cao, L., Chen, C.W.: A novel product coding and recurrent alternate decoding
scheme for image transmission over noisy channels. IEEE Transactions on Com-
munications 51(9), 1426–1431 (2003)
6. Chapelle, O., Schölkopf, B., Zien, A.: Semi–supervised Learning. MIT Press, Cam-
bridge (2006)
7. Collins, M.: Discriminative reranking for natural language processing. In: Pro-
ceedings of the International Conference on Machine Learning (2000)
8. Collins, M.: Ranking algorithms for named-entity extraction: Boosting and the
voted perceptron. In: Proceedings of the Annual Meeting of the Association for
Computational Linguistics (2002)
9. Dietterich, T.G.: Machine learning for sequential data: A review. In: Proceedings
of the Joint IAPR International Workshop on Structural, Syntactic, and Statisti-
cal Pattern Recognition (2002)
10. Do, T.-M.-T., Artieres, T.: Large margin training for hidden Markov models with
partially observed states. In: Proceedings of the International Conference on Ma-
chine Learning (2009)
11. Forney, G.D.: The Viterbi algorithm. Proceedings of IEEE 61(3), 268–278 (1973)
12. Hammersley, J.M., Clifford, P.E.: Markov random fields on finite graphs and lat-
tices (1971) (unpublished manuscript)
13. Joachims, T.: Transductive inference for text classification using support vector
machines. In: Proceedings of the International Conference on Machine Learning
(1999)
14. Juang, B., Rabiner, L.: Hidden Markov models for speech recognition. Techno-
metrics 33, 251–272 (1991)
Learning from Partially Annotated Sequences 421
15. King, T.H., Dipper, S., Frank, A., Kuhn, J., Maxwell, J.: Ambiguity management
in grammar writing. In: Proceedings of the ESSLLI 2000 Workshop on Linguistic
Theory and Grammar Implementation (2000)
16. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic
models for segmenting and labeling sequence data. In: Proceedings of the Inter-
national Conference on Machine Learning (2001)
17. Lafferty, J., Zhu, X., Liu, Y.: Kernel conditional random fields: representation
and clique selection. In: Proceedings of the International Conference on Machine
Learning (2004)
18. Lee, C., Wang, S., Jiao, F., Greiner, R., Schuurmans, D.: Learning to model
spatial dependency: Semi-supervised discriminative random fields. In: Advances
in Neural Information Processing Systems (2007)
19. McAllester, D., Hazan, T., Keshet, J.: Direct loss minimization for structured
perceptronsi. In: Advances in Neural Information Processing Systems (2011)
20. McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for
information extraction and segmentation. In: Proceedings of the International
Conference on Machine Learning (2000)
21. Mika, P., Ciaramita, M., Zaragoza, H., Atserias, J.: Learning to tag and tagging
to learn: A case study on wikipedia. IEEE Intelligent Systems 23, 26–33 (2008)
22. Mukherjee, S., Ramakrishnan, I.V.: Taming the unstructured: Creating structured
content from partially labeled schematic text sequences. In: Chung, S. (ed.) OTM
2004. LNCS, vol. 3291, pp. 909–926. Springer, Heidelberg (2004)
23. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text classification from
labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134
(2000)
24. Nothman, J., Murphy, T., Curran, J.R.: Analysing wikipedia and gold-standard
corpora for ner training. In: EACL 2009: Proceedings of the 12th Conference
of the European Chapter of the Association for Computational Linguistics, pp.
612–620. Association for Computational Linguistics, Morristown (2009)
25. Novikoff, A.B.: On convergence proofs on perceptrons. In: Proceedings of the
Symposium on the Mathematical Theory of Automata (1962)
26. Rabiner, L.: A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE 77(2), 257–285 (1989)
27. Richman, A.E., Schone, P.: Mining wiki resources for multilingual named entity
recognition. In: Proceedings of ACL 2008: HLT, pp. 1–9. Association for Compu-
tational Linguistics, Columbus (2008)
28. Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: language-
independent named entity recognition. In: COLING-2002: Proceedings of the 6th
Conference on Natural Language Learning, pp. 1–4. Association for Computa-
tional Linguistics, Morristown (2002)
29. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared
task: Language-independent named entity recognition. In: Proceedings of CoNLL
2003, pp. 142–147 (2003)
30. Scheffer, T., Wrobel, S.: Active hidden Markov models for information extrac-
tion. In: Proceedings of the International Symposium on Intelligent Data Analysis
(2001)
31. Schwarz, R., Chow, Y.L.: The n-best algorithm: An efficient and exact procedure
for finding the n most likely hypotheses. In: Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (1990)
32. Taskar, B., Guestrin, C., Koller, D.: Max–margin Markov networks. In: Advances
in Neural Information Processing Systems (2004)
422 E.R. Fernandes and U. Brefeld
33. Truyen, T.T., Bui, H.H., Phung, D.Q., Venkatesh, S.: Learning discriminative
sequence models from partially labelled data for activity recognition. In: Pro-
ceedings of the Pacific Rim International Conference on Artificial Intelligence
(2008)
34. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. Journal of Machine Learning
Research 6, 1453–1484 (2005)
35. Xu, L., Wilkinson, D., Southey, F., Schuurmans, D.: Discriminative unsupervised
learning of structured predictors. In: Proceedings of the International Conference
on Machine Learning (2006)
36. Yu, C.-N., Joachims, T.: Learning structural svms with latent variables. In: Pro-
ceedings of the International Conference on Machine Learning (2009)
37. Zien, A., Brefeld, U., Scheffer, T.: Transductive support vector machines for
structured variables. In: Proceedings of the International Conference on Machine
Learning (2007)
38. Zinkevich, M., Weimer, M., Smola, A., Li, L.: Parallelized stochastic gradient
descent. In: Advances in Neural Information Processing Systems, vol. 23 (2011)
The Minimum Transfer Cost Principle for
Model-Order Selection
1 Introduction
Clustering and dimensionality reduction are highly valuable concepts for exploratory
data analysis that are frequently used in many applications for pattern-recognition, vi-
sion, data mining, and other fields. Both problem domains require to specify the com-
plexity of solutions. When partitioning a set of objects into clusters, we must select an
appropriate number of clusters. Learning a low-dimensional representation of a set of
objects, for example by learning a dictionary, involves choosing the number of atoms
or codewords in the dictionary. More generally speaking, learning the parameters of a
model given some measurements requires selecting the number of parameters, i.e. one
must select the model-order.
In this paper we address the general issue of model-order selection for unsuper-
vised learning problems and we develop and advocate the principle of minimal transfer
costs (MTC). Our method generalizes classical cross-validation known from supervised
learning. It is applicable to a broad class of model-order selection problems even when
no labels or target values are given. In essence, MTC can be applied whenever a cost
function is defined. The MTC principle can be easily explained in abstract terms: A
good choice of the model-order based on a given dataset should also yield low costs on
a second dataset from the same distribution. We learn models of various model-orders
from a given dataset X(1) . These models with their respective parameters are then used
to interpret a second data set X(2) , i.e., to compute its costs. The principle selects the
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 423–438, 2011.
c Springer-Verlag Berlin Heidelberg 2011
424 M. Frank, M.H. Chehreghani, and J.M. Buhmann
model-order that achieves lowest transfer cost, i.e. the solution that generalizes best to
the second dataset. Too simple models underfit and achieve high costs on both datasets;
too complex models overfit to the fluctuations of X(1) which results in high costs on
X(2) where the fluctuations are different.
The conceptually challenging part of this procedure is related to the transfer of the
solution inferred from the objects of the first dataset to the objects of the second dataset.
This transfer requires a mapping function which generalizes the conceptually straight-
forward assignments in supervised learning. For several applications, we demonstrate
how to map two datasets to each other when no labels are given.
Our main contribution is to propose and describe the minimum transfer cost principle
(MTC) and to demonstrate its broad applicability on a set of different applications. We
select well-known methods such as singular-value decomposition (SVD), max. likeli-
hood inference, k-means, Gaussian mixture models, and correlation clustering because
the understandability of our principle should not be limited by long explanations of the
complicated models it is applied to. In our real-world applications image denoising,
role mining, and error detection in access-control configurations we pursue the goal to
investigate the reliability of the model order selection scheme, i.e. whether for a prede-
termined method (such as SVD), our principle finds the model-order that performs best
on a second test data set.
In the remainder of the paper we first explain the principle of minimal transfer costs
and we address the conceptual question of how to map a trained model to a previously
unseen dataset. In the following Sections 3, 6, 7 we invoke the MTC principle to select
a plausible (“true”) number of centroids for the widely used Gaussian mixture model,
the optimal number of clusters for correlation clustering and for the k-means algorithm.
In Section 4, we apply MTC to SVD for image denoising and detecting errors in access-
control configurations. In Section 5, we use MTC for selecting the number of factors
for Boolean matrix factorization on role mining data.
model parameters which are learned through an inference procedure. The number k
quantifies the number of model parameters and thereby identifies the model order. In
clustering, for instance, k would be the number of clusters of the solution s(X).
A cost functions imposes a partial order on all possible solutions given the data. Since
usually the measurements are contaminated by noise, one aims at finding solutions that
are robust against the noise fluctuations and thus generalize well to future data. Learning
theory demands that a well-regularized model explains not only the dataset at hand,
but also new datasets generated from the same source and thus drawn from the same
probability distribution.
Let s(1) be the solution (e.g. model parameters) learned from a given set of ob-
jects O(1) = {i : 1 ≤ i ≤ N1 } and the corresponding measurements X(1) . Let the set
O(2) = {i : 1 ≤ i ≤ N2 } represent the objects of a second dataset X(2) drawn from
the same distribution as X(1) . In a supervised learning scenario, the given class labels of
both datasets guide a natural and straightforward mapping of the trained solution from
the first to the second dataset: the model should assign objects of both sets with same
labels to the same classes. However, when no labels are available, it is unclear how to
transfer a solution. To enable the use of cross-validation, we propose to compute the
costs of a learned solution on a new dataset in the following way. We start with defining
a mapping ψ from objects of the second dataset to objects of the first dataset:
ψ : O(2) × X × X → O(1) , i , X(1) , X(2) → ψ(i , X(1) , X(2) ) (1)
This mapping function aligns each object from the second dataset with its nearest neigh-
bor in O(1) . We have to compute such a mapping in order to transfer a solution. Let’s
assume, for the moment, that the given model is a sum over independent partial costs
N
R(s, X, k) = Ri (s(i), xi , k). (2)
i=1
Ri (s(i), xi , k) denotes the partial costs of object i and s(i) denotes the structure part of
the solution that relates to object i. For a parametric centroid-based clustering model
s(i) would be the centroid object i is assigned to. Using the object-wise mapping
function ψ to map objects i ∈ O(2) to objects in O(1) , we define the transfer costs
RT (s(1) , X(2) , k) of a solution s with model-order k as follows:
1
N2
N1
(2)
RT (s(1) , X(2) , k) := Ri (s(1) (i), xi , k) I{ψ(i ,X(1) ,X(2) )=i} . (3)
N2 i=1
i =1
For each object i ∈ O(2) we compute the costs of i with respect to the learned solution
s(X(1) ). The mapping function ψ(i , X(1) , X(2) ) ensures that the cost function treats the
(2)
measurement xi with i ∈ O(2) as if it was the object i ≡ ψ(i , X(1) , X(2) ) ∈ O(1) .
In the limit of many observations N2 , the transfer costs converge to E[R(s(1) , X, k)],
426 M. Frank, M.H. Chehreghani, and J.M. Buhmann
the expected costs of the solution s(1) with respect to the probability distribution of the
measurements. Minimizing this quantity, with respect to the solution is what we are
ultimately interested in. The minimum transfer cost principle (MTC) selects the model-
order k with lowest transfer costs. MTC disqualifies models with a too high complexity
that perfectly explain X(1) but fail to fit X(2) (overfitting), as well as models with too
low complexity which insufficiently explain both of them (underfitting).
We would like to emphasize the relation of our method to cross-validation in su-
pervised learning which is frequently used in classification or regression. In supervised
learning a model is trained on a set of given observations X(1) and labels (or output
variables) y(1) . Usually , we assume i.i.d. training and test data in classification and,
therefore, the transfer problem disappears.
A variant and a special case of the mapping function: In the following, we will describe
two other mapping variants. In many problems such as clustering, a solution is a set
of structures where the objects inside a structure are statistically indistinguishable by
the algorithm. Therefore, the objects O(2) can directly be mapped to the structures
inferred from X(1) rather than to individual objects, since the objects in each structure
are unidentifiable. In this way, the mapping function assigns the objects O(2) to the
solution s(X(1) ) ∈ S:
ψ s : O(2) × S × X → S(O(1) ) , i , s(X(1) ), X(2) → ψ i , s(X(1) ), X(2) . (4)
The generative mapping, another variant of the ψ function, is obtained in a natural way
by data construction. Given the true model parameters, we randomly sample pairs of
data items. This gives the identity mapping between the pairs in O(1) and O(2) and can
be used whenever the data is artificially generated.
ψ G : O(2) → O(1) , i
→ ψ(i ) = i . (5)
In practice, however, the data is usually generated in an unknown way. One has a single
dataset X and subdivides it (eventually multiple times) into random subsets X(1) , X(2)
which are not necessarily of equal cardinality. The nearest-neighbor mapping is ob-
tained by assigning each object i ∈ O(2) to the structure or object where the costs of
R(s, X(O(1) ∪ i ), k) is minimized. In the cases where multiple objects or structures
satisfy this condition, i is randomly assigned to one of them.
N
k
R(μ, Σ, X, k) = − ln πt N (xi |μt , Σ t ) (6)
i=1 t=1
Fig. 1. Selecting the number of Gaussians k. Data is generated from 3 Gaussians. Going from the
upper to the lower row, their overlap is increased. For very high overlap, BIC and MTC select
k = 1. The lower row illustrates the smallest overlap where BIC selects k < 3.
We extract N = 4096 patches of size D = 8 × 8 from the image and arrange each of
them in one row of a matrix X. We randomly split this matrix along the rows into two
sub-matrices X(1) and X(2) and select the rank k that minimizes the transfer costs
1
(1) (1) (1)T
2
RT(s, X, k) = ψN N (X(1) , X(2) ) ◦ X(2) − Uk Sk Vk . (7)
N2 2
The mapping ψN N (X(1) , X(2) ) reindexes all objects of the test set with the indices of
their nearest neighbors in the training set. We illustrate the results for the Lenna image
in Figure 2 by color-coding the peak-SNR of the image reconstruction. As one can
see, there is a crest ranging from a low standard deviation of the added Gaussian noise
and maximal rank (k = 64) down to the region with high noise and low optimal rank
(k = 1). The top of the crest marks the optimal rank for given noise (dashed magenta
line). The rank selected by MTC is highlighted by the solid black line (dashed lines
are three times the standard deviation). The selected rank is always very close to the
optimum. At low noise where the crest is rather broad, the deviation from the optimum
is maximal. There the selection problem is most difficult. However, in this parameter
range the choice of the rank has little influence on the error. For high noise, where a
deviation from the optimum has higher influence, our method finds the optimal rank.
Fig. 2. PSNR (logarithmic) of the denoised image as a function of the added noise and the rank
of the SVD approximation of the image patches. The crest of this error marks the optimal rank at
a given noise level and is highlighted (dashed magenta). The rank selected by MTC (solid black)
is close to this optimum.
access-control configurations. Such a configuration indicates which user has the permis-
sion to access which resources and it is encoded in a Boolean matrix X, where a 1-entry
means that the permission is granted to the user. In practice, a given user-permission as-
signment is often noisy, meaning that some individual user-permission assignments do
not correspond to the regularities of the data and should thus be regarded as excep-
tions or might even be errors. Such irregularities pose not only a security-relevant risk
but they also constitute a problem when such direct access control systems are to be
migrated to role based access control (RBAC) via so-called role mining methods [14].
As most existing role mining methods today are very sensitive to noise [9], they could
benefit a lot from denoising as a preprocessing step. In [16], SVD and other contin-
uous factorization techniques for denoising X are proposed. Molloy et al. compute a
rank-k approximation Uk Sk VkT of X. Then, a function g maps all individual entries
higher than 0.5 to 1 and the others to 0. The distance of the resulting denoised matrix
X̃k = g(Uk Sk VkT ) to the error-free matrix X∗ depends heavily on k. The authors pro-
pose two methods for selecting the rank k. The first method takes the minimal rank such
that the approximation X̃k covers 80% of the entries of X (this heuristic originates from
the rule of thumb that 20% of the entries of X are corrupted). The second methodselects
the smallest rank that decreases the approximation increment ||(X̃k − X̃k+1 )||1 ||X||1
below 0.001.
We also compare with the rank selected by the Bi-crossvalidation method for SVD
presented by Owen and Perry [19]. This method, which we will term OP-CV, divides
the n × d input matrix X1:n,1:d into four submatrices, X1:p,1:q , X1:p,q+1:d , Xp+1:n,1:q ,
and Xp+1:n,q+1:d with p < n and q < d. Let M† be the Moore-Penrose inverse of
(k)
the matrix M. OP-CV learns the truncated SVD X̂p+1:n,q+1:d from Xp+1:n,q+1:d and
430 M. Frank, M.H. Chehreghani, and J.M. Buhmann
Fig. 3. Denoising four different access-control configurations via rank-limited SVD. The ranks
selected by transfer costs and OP-CV are significantly closer to the optimal rank than the ranks
selected by the originally proposed methods [Molloy et al., 2010].
(k)
computes the error score = X1:p,1:q − X1:p,q+1:d (X̂p+1:n,q+1:d )† Xp+1:n,1:q . In our
experiments, we compute for 20 permutations of the input matrix and select the rank
with lowest median error.
We compare the rank selected by the described approaches to the rank selected
by MTC with nearest-neighbor mapping and Hamming distance. The four different
datasets are taken from [16]. The first dataset ’University’ is the access control configu-
ration of a department, the other three are artificially created, each with differing gener-
ation processes as described in [16]. The sizes of the datasets are (users×permissions)
493 × 56, 500 × 347, 500 × 101, and 500 × 190. We display the results in Figure 3.
The optimal rank for denoising is plotted as a big red square. The statistics of the rank
selected by MTC is plotted as small bounded squares. We select the median over 20
random splits of the dataset. As one can see, the minimum transfer cost rank is always
significantly closer to the optimal rank than the ranks selected by the originally pro-
posed methods. The performance of the 80-20 rule is very poor and performance of
the increment threshold depends a lot on the dataset. The Bi-crossvalidation method
by Owen and Perry (OP-CV) finds good ranks, although not so reliably as MTC. It has
been reported that, for smaller validation sets, OP-CV tends to overfit. We could observe
this effect in some of our experiments and also on the University dataset. However, on
the Tree dataset it is actually the method with the larger validation set that overfits.
In this section, we use MTC to select the number of factors in Boolean matrix factor-
ization for role mining [14]. A real-world access-control matrix X with 3000 users and
The Minimum Transfer Cost Principle for Model-Order Selection 431
500 permissions defines the data set for role mining applications. We factorize this user-
permission matrix into a user-role assignment matrix Z and a user-permission assign-
ment matrix U by maximizing the likelihood derived in [22]. Five-fold cross-validation
is performed with 2400 users in the training set and 600 users in the test set. As in the
last section, the mapping function uses the nearest-neighbor rule with Hamming metric.
(2)
Here, the MTC score in (Eq. 3) measures the number
of bits in xi that do
not match
(1) (2)
(2) k (1) (1)
j and a clustering solution s, the set of edges between two clusters u and v is defined as
Eu,v = {(i, j) ∈ E : s(i) = u ∧ s(j) = v}, where s(i) is the cluster index of object
i. Eu,v , v
= u are inter-cluster edges and Eu,u are intra-cluster edges. Given the noise
parameter p and the complexity parameter q, the correlation graph is generated in the
following way:
432 M. Frank, M.H. Chehreghani, and J.M. Buhmann
(c) p = 0.95
Fig. 5. Transfer costs and instability for various noises p. The complexity q is kept fixed at 0.30.
1. Construct a perfect graph, i.e. assign the weight +1 to all intra-cluster edges and
−1 to all inter-cluster edges.
2. Change the weight of each inter-cluster edge in Eu,v , v
= u to +1 with probability
q, increasing structure complexity.
3. With probability p, replace the weight of each edge (Eu,v , v
= u and Eu,u ) by a
random weight.
Let N and k be the number of objects and the number of clusters, respectively. The cost
function counts the number of disagreements, i.e. the number of negative intra-cluster
edges plus the number of positive inter-cluster edges:
1 1
R(s, X, k) = − (Xij −1)+ (Xij +1). (8)
2 2
1≤u≤k (i,j)∈Eu,u 1≤u≤k 1≤v<u (i,j)∈Eu,v
To transfer the clustering solution s(1) to the second dataset X(2) , we use the Hamming
distances between objects i from O(2) and the clusters inferred from X(1) . The cluster
index of object i is determined by:
1 1
H(i , s(1)
v ) =− (Xij − 1) + (Xij + 1), (10)
2 j∈s 2
v 1≤u≤k,u
=v j∈su
1
N2 N1
(1) (2)
RT(s(1) , X(2) , k) = d μc(i) , xi I{ψ(i ,X(1) ,X(2) )=i}
N2 i=1
i =1
1 (1) (2)
≈ d μt , xi I{ψs (i ,s(1) ,X(2) )=t} , (11)
N2 t
i
Fig. 6. Costs and transfer costs (computed with mappings: nearest-neighbor, generative, soft) for
k-means clustering of three Gaussians. Solid lines indicate the median and dashed lines are the
25% and 75% percentiles. The right panel shows the clustering result selected by soft mapping
MTC. Top: equidistant centers and equal variance. Middle: heterogeneous distances between
centers (hierarchical). Bottom: heterogeneous distances and variances.
The setup of the experiment is as follows: We sample 200 objects from three bi-
variate Gaussian distributions (see for instance Figure 6 top right). The task is to find
the appropriate number of clusters. By altering the variances and the pairwise distances
of the centers, we control the difficulty of this problem and especially tune it such
that selecting the number of clusters is hard. We investigate the selection of k by the
The Minimum Transfer Cost Principle for Model-Order Selection 435
nearest-neighbor mapping of the objects from the second dataset to the centroids μ(1)
as well as by the generative mapping where the two data subsets are aligned by con-
struction. We report the statistics over 20 random repetitions of generating the data.
Our findings for three different problem difficulties are illustrated in Figure 6. As
expected, the costs on the training dataset monotonically decrease with k. When the
mapping is given by the generation process of the data (generative mapping), MTC
provides the true number of clusters in all cases. However, recall that the generative
mapping requires knowledge of the true model parameters and leaks information about
the true number of clusters to the costs. Interestingly, MTC with a nearest-neighbor
mapping follows almost exactly the same trend as the original costs on the first dataset
and therefore proposes selecting the highest model-order that we offer to MTC. The
higher the number of clusters is, the closer are the centroids of the nearest neighbors
of each object. This reduces the transfer costs of high k. The only difference between
original costs and transfer costs stems from the average distance between nearest neigh-
bors (the data granularity). Only when the pairwise centroid distances become smaller
than this distance, the transfer costs increase again. Ultimately, the favored solution is
a vector quantization at the level of the data granularity. This is the natural behavior of
k-means, as its cost function has no variances. As we have seen in the first experiments
with Gaussian mixture models, fitting Gaussian data with MTC imposes no particular
difficulties when the appropriate model (here GMM) is used. The k-means behavior is
due to a model mismatch.
Probabilistic Mapping: A variant of MTC can be used to still make k-means applica-
ble to estimating the true model order of Gaussian data. As follows, we extend the
notion of a strict mapping to a probabilistic mapping between objects. Let pi i :=
p(ψ(i , X(1) , X(2) ) = i) be the probability that ψ maps object i from the second
dataset to object i of the first dataset. We define pi i as
(1) (2) (1) (2)
pi i := Z −1 exp −β d(xi , xi ) , Z = exp −β d(xi , xi ) (12)
i
We fix the inverse temperature by the costs of the data with respect to a single cluster:
β = 0.75 ∗ R(s(1) , X(1) , 1)−1 . This choice defines the dynamic range of the model-
order selection problem. When fixing β roughly at the costs of one cluster, the resolution
of individual pairwise distances resembles the visual situation where one looks at the
entire data cloud as a whole.
Results of probabilistic mapping MTC: The probabilistic mapping finds the true num-
ber of clusters when the variances of the Gaussians are roughly the same, even for a
substantial overlap of the Gaussians (Figure 6, top row). Please note that although the
differences of the transfer costs are within the plotted percentiles, the rank-order of the
number of clusters in each single experiment is preserved over the 20 repetitions, i.e.
the variance mainly results from the data and not from the selection of k.
When the problem scale varies on a local level, fixing the temperature at the k = 1
solution does not resolve the dynamic range of the costs. We illustrate this by two hard
problems: The middle problem in Figure 6 has a hierarchical structure, i.e. the pair-
wise distances between centers vary a lot. In the bottom problem in Figure 6, both the
distances and the individual variances of the Gaussians vary. In both cases the number
of clusters is estimated too low. When inspecting the middle plot, this choice seems
reasonable, whereas in the bottom plot clearly three clusters would be desirable. The
introduction of a computational temperature simulates the role of the variances in Gaus-
sian mixture models. However, as the temperature is the same for all clusters, it fails to
mimic situations where the variances of the Gaussians substantially differ. A Gaussian
mixture model would be more appropriate than modeling Gaussian data with k-means.
8 Related Work
In this section we point to related work on model selection for unsupervised learning.
Models that assume an explicit parametric form, are often controlled by a model com-
plexity penalty (a regularizer). Akaike information criterion (AIC) [2] and Bayesian
information criterion (BIC) [21] both trade off the goodness of fit measured in terms of
a likelihood function against the number of model parameters used. In [18], the model
evidence for probabilistic PCA is maximized with respect to the number of components.
Introducing approximations, this score equals BIC. In [12] the number of principal com-
ponents is selected by integrating over the sensitivity of the likelihood to the model pa-
rameters. Minimum description length (MDL) [20] selects the lowest model order that
can explain the data. It essentially minimizes the negative log posterior of the model
and is thus formally identical to BIC [13]. It is unclear how to generalize model-based
critera like [2,21,18,12] to non-probabilistic methods such as, for instance, correlation
clustering, being specified by a cost function instead of a likelihood.
For selecting the rank of truncated SVD, probably the most related approach is the
cross-validation method proposed in [19]. It is a generalization of the method in [11]
and was also applied to NMF. We explain it and compare with it in Section 4.2. A
method with single hold-out entries (i, j) is proposed in [7]. It trains a SVD on the
input matrix without row i and another one without column j. Then it combines U
from one SVD and V from the other and averages their singular values to obtain an
SVD which is independent of (i, j). The method in [7] has been reviewed in [19].
The Minimum Transfer Cost Principle for Model-Order Selection 437
In [17], the authors abandon cross-validation for Boolean matrix factorization. They
found that i) the method in [19] is not applicable and ii) using the rows of the second
matrix of the factorization (here U in Section 5) to explain the hold-out data, tolerates
overfitting. From our experience, cross-validation fails when only the second matrix
is fixed and the first matrix is adapted to the new data. With a predefined mapping to
transfer both matrices to the new data without adapting them, cross-validation works
for Boolean matrix factorization as demonstrated in Section 5.
Specialized to selecting the number of clusters in clustering, gap statistics have been
proposed in [23]. Stability analysis has also shown promising results [6,15]. Stability
neglects to account the informativeness of solutions. An information theoretic model
validation principle has been proposed in [4] to determine the tradeoff between stability
and informativeness based on an information theoretic criterion called approximation
capacity. So far, this principle has been applied to clustering [5] and SVD [10].
9 Conclusion
We defined the minimum transfer cost principle (MTC) and proposed several variants
of how to apply it. Our method extends the cross-validation principle to unsupervised
learning problems as it solves the problem of transferring a learned model from one
dataset to another one when no labels are given. We demonstrated how to apply the
principle to different problems such as max. likelihood inference, k-means clustering,
correlation clustering, Gaussian mixture models, and rank-limited SVD, highlighting its
broad applicability. For each problem, we explained the appropriate mapping function
between datasets and we demonstrated how the principle can be employed with respect
to the specifications of the particular tasks. In all cases, MTC makes a sensible choice of
the model order. It finds the optimal rank for image denoising with SVD and for error
correction in access-control configurations. Future work will cover the application of
our principle to other models as well as to other tasks such as feature selection.
References
1. Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: Ranking and
clustering. Journal of the ACM 55, 23:1–23:27 (2008)
2. Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Auto-
matic Control 19(6), 716–723 (1974)
3. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning 56(1-3), 89–113
(2002)
4. Buhmann, J.M.: Information theoretic model validation for clustering. In: ISIT 2010 (2010)
5. Buhmann, J.M., Chehreghani, M.H., Frank, M., Streich, A.P.: Information theoretic model
selection for pattern analysis. In: JMLR: Workshop and Conference Proceedings, vol. 7, pp.
1–8 (2011)
438 M. Frank, M.H. Chehreghani, and J.M. Buhmann
6. Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number
of clusters in a dataset. Genome biology 3(7) (2002)
7. Eastment, H.T., Krzanowski, W.J.: Cross-validatory choice of the number of components
from a principal component analysis. Technometrics 24(1), 73–77 (1982)
8. Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned
dictionaries. IEEE Transactions on Image Processing 15(12), 3736–3745 (2006)
9. Frank, M., Buhmann, J.M., Basin, D.: On the definition of role mining. In: SACMAT, pp.
35–44 (2010)
10. Frank, M., Buhmann, J.M.: Selecting the rank of truncated SVD by Maximum Approxima-
tion Capacity. In: IEEE International Symposium on Information Theory, ISIT (2011)
11. Gabriel, K.: Le biplotoutil dexploration de données multidimensionelles. Journal de la Soci-
ete Francaise de Statistique 143, 5–55 (2002)
12. Hansen, L.K., Larsen, J.: Unsupervised learning and generalization. In: IEEE Intl. Conf. on
Neural Networks, pp. 25–30 (1996)
13. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer, New York (2001)
14. Kuhlmann, M., Shohat, D., Schimpf, G.: Role mining – revealing business roles for security
administration using data mining technology. In: SACMAT 2003, p. 179 (2003)
15. Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering
solutions. Neural Computation 16(6), 1299–1323 (2004)
16. Molloy, I., et al.: Mining roles with noisy data. In: SACMAT 2010, pp. 45–54 (2010)
17. Miettinen, P., Vreeken, J.: Model Order Selection for Boolean Matrix Factorization. In:
SIGKDD International Conference on Knowledge Discovery and Data Mining (2011)
18. Minka, T.P.: Automatic choice of dimensionality for PCA. In: NIPS, p. 514 (2000)
19. Owen, A.B., Perry, P.O.: Bi-cross-validation of the SVD and the nonnegative matrix factor-
ization. Annals of Applied Statistics 3(2), 564–594 (2009)
20. Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
21. Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6, 461 (1978)
22. Streich, A.P., Frank, M., Basin, D., Buhmann, J.M.: Multi-assignment clustering for Boolean
data. In: ICML 2009, pp. 969–976 (2009)
23. Tibshirani, R., Walther, G., Hastie, T.: Estimating the Number of Clusters in a Dataset via
the Gap Statistic. Journal of the Royal Statistical Society, Series B 63, 411–423 (2000)
A Geometric Approach to Find Nondominated Policies
to Imprecise Reward MDPs
1 Introduction
Markov Decision Processes (MDPs) can be seen as a core to sequential decision prob-
lems with nondeterminism [2]. In an MDP transitions among states are seen as marko-
vian and evaluation is done through a reward function. Many decision problems can
be modelled by an MDP with imprecise knowledge. This imprecision can be stated as
partial observability regarding states [7], intervals of probability transitions [12] or a set
of potential reward functions [13].
Scenarios where reward functions are imprecise are quite common in a preference
elicitation process [4,6]. Preference elicitation algorithms guide a process of sequential
queries to a user so as to elicit his/her preference based on his/her answer. Even if the
process is guide to improve the knowledge about the user’s preference, after a finite
sequential of queries an imprecise representation must be used [3,9]. User’s preference
may be model for example in a reward function [10].
This work was conducted under project LogProb (FAPESP proc. 2008/03995-5). Valdinei
F. Silva thanks FAPESP (proc. 09/14650-1) and Anna H. R. Costa thanks CNPq (proc.
305512/2008-0).
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 439–454, 2011.
c Springer-Verlag Berlin Heidelberg 2011
440 V.F. da Silva and A.H. Reali Costa
In this paper we tackle the problem of Imprecise Reward MDPs (IRMDP). If a deci-
sion must be taken in an IRMDP, optimal action must be properly defined. The minimax
regret approach considers relatively the worst case decision, providing a balanced de-
cision. First, when evaluating a decision, it is considered an adversary that chooses the
reward function that minimises the value of the decision. The regret step compares the
actual chosen decision within the best adversarial option for each feasible reward func-
tion and the decision with less regret in the worst case is considered.
Since optimisation must go through reward functions and adversarial policies, it can-
not be solved by one linear program. Efficient solutions consider an iterative process in
which decisions are chosen using a linear programming and adversarial choices are
made through a mixed integer programming [13,10] or linear programs based on non-
dominated policies [11]. In the latter, the πWitness algorithm was used to generate non-
dominated policies. Nondominated policies are optimal policies for some instantiation
of the imprecise reward function.
Even if the set of nondominated policies is much smaller when compared to the set of
deterministic policies, the cardinality of the set to be considered is still a burden to deal
with. Although πWitness is able to calculate the set of nondominated policies, nothing
is said about choosing efficiently a small subset of nondominated policies. Using a small
subset helps to choose properly and fast the best minimax regret decision.
We propose the πHull algorithm to calculate an efficient small subset of nondom-
inated policies. In order to compare it with πWitness, we also modify πWitness to
generate a small subset of nondominated policies.
The paper is organised as follows. Section 2 introduces theory and notation used and
section 3 describes our modified version of πWitness. Section 4 presents our main con-
tribution, the πHull algorithm. Finally, experiments are given in section 5 and section 6
presents our conclusions.
2 Theoretic Background
In this section we summarise some significant theoretic background regarding MDPs
and Imprecise Reward MDPs.
– the system dynamics is such that s0 ∈ S is drawn from distribution β(s) and if the
process is in the state i at time t and action a is chosen, then: the next state j is cho-
sen according to the transition probabilities Pa (i, j) and a payoff with expectation
r(i, a) is incurred.
A solution for an MDP consists in a policy π : S × A → [0, 1], i.e., at any time t,
π(s, a) indicates the probability of executing action at = a after observing state st = s.
A policy π is evaluated ∞ according to its expected accumulated discounted reward, i.e.,
V (π) = Es0 ∼β,at ∼π [ t=0 γ t rt ].
A policy π induces a discounted occupancy frequency f π (s, a) for each pair (s, a), or
in vector notation f π , i.e., the accumulated expected occurrences of each pair discounted
in time. Let F be the set of valid occupancy frequencies for a given MDP, then for any
f ∈ F it is valid:
[½ − γP] f = β,
where P is a |S||A|×|S| matrix indicating Pa (s, s ), ½ is a |S||A|×|S| matrix with one
in self-transitions, i.e., ½((s, a), s) = 1 for all a ∈ A, and β is a |S| vector indicating
β(s).
Consider the reward function with a vector notation, i.e., r. In this case, the value of
a policy π is given by V (π) = f π · r. An optimal occupancy frequency f ∗ can be found
by solving:
min f · r
f
subject to: [½ − γP] · f − β = 0 . (1)
f ≥0
Given an optimal occupancy frequency f ∗ the optimal policy can be promptly defined.
For any s ∈ S and a ∈ A an optimal policy π ∗ is defined by:
⎧
⎪
⎪ f (s, a)
⎨ , if a ∈A f (s, a) > 0
π ∗ (s, a) = a ∈A f (s, a) . (2)
⎪ 1
⎪
⎩ , if f (s, a) = 0
a ∈A
|A|
An imprecise reward MDP (IRMDP) consists in an MDP in which the reward func-
tion is not precisely defined [13,10,11]. This can occur due to a preference elicita-
tion process, lack of knowledge regarding to evaluation of policies or reward functions
comprising the preferences of a group of people. An IRMDP is defined by a tuple
S, A, Pa (·), γ, β, R, where R is a set of feasible reward functions. We consider that
a reward function is imprecisely determined by nR strict linear constraints. Given a
|S||A| × nR matrix A and a 1 × nR matrix b, the set of feasible reward functions is
defined by R = {r|Ar ≤ b}.
In an IRMDP, it is also necessary to define how to evaluate decisions. The minimax
regret evaluation makes a trade-off between the best and the worst cases. Consider a
feasible occupancy frequency f ∈ F. One can calculate the regret Regret(f , r) of
442 V.F. da Silva and A.H. Reali Costa
taking such the occupancy frequency f relative to a reward function r as the difference
between f and the optimal occupancy frequency under r, i.e.,
Regret(f , r) = max g · r − f · r.
g∈F
Since any reward function r can be chosen from R, the maximum regret MR(f , R)
evaluates the occupancy frequency f , i.e.,
MR(f , R) = max Regret(f , r).
r∈R
Then, the best policy should minimise the maximum regret criterium:
MMR(R) = min MR(f , R).
f ∈F
In order to calculate MMR(R) efficiently, some works use nondominated policies, i.e.,
policies that are optimal for some feasible reward functions [11]. Formally, a policy f
is nondominated with respect to R iff
∃r ∈ R such that f · r ≥ f · r, ∀f
= f ∈ F. (3)
Given any occupancy frequency f ∈ F, define its corresponding policy πf (see equa-
tion 2). Let f [s] be the occupancy frequency obtained by executing policy πf with deter-
mministic initial state s0 = s, i.e., the frequency occupancy of policy πf with an initial
state distribution β (s) = 1.
Let f s:a be the occupancy frequency in the case that if s is the initial state then action
a is executed and policy πf is followed thereafter, i.e.,
f s:a = β(s) es:a + γ f [s ]Pa (s, s ) + f [s ]β(s )
s ∈S s
=s
where es:a is an |S||A| vector with 1 in position (s, a) and zeros elsewhere1 .
The occupancy frequency f s:a can be used to find nondominated policies in two
steps: (i) choose arbitrarily rinit and find an optimal occupancy frequency frinit with
respect to rinit , keeping each optimal occupancy frequency in a set Γ ; (ii) for each
occupancy frequency f ∈ Γ , (iia) find s ∈ S, a ∈ A and r ∈ R such that:
(iib) calculate the respective optimal occupancy frequency fr , and add it into Γ . The
algorithm stops when no reward function can be found such that equation 4 is true. The
reward function in equation 4 is a witness that there exists at least another nondominated
policy to be defined.
Despite of the set of nondominated policies being much smaller than the set of all deter-
ministic policies, it can still be very large, making costly the calculation of the minimax
regret. It is interesting to find a small set of policies that approximates efficiently the
set of nondominated policies. By efficient we mean that a occupancy frequency fΓ
chosen within a small subset Γ is as better as the exact minimax regret decision, i.e.,
1
We changed the original formula [11]:
s:a s:a
f = β(s) e +γ f [s ]Pa (s, s ) + (1 − β(s))f .
s ∈S
In this case the occupancy frequency f a:s has the meaning of executing action a when starting
in state s with probability
β(s) + πf (s, a)(1 − β(s)).
444 V.F. da Silva and A.H. Reali Costa
MR(fΓ , R)− MMR(R) 0. Consider a witness rw and its respective optimal occupancy
frequency f w . The difference
can be used to define the gain when adding f w to the set Γ . If a small subset of nondom-
inated policies is desired, Δ(f w , Γ ) may indicate a priority on which policies are added
to Γ . Instead of adding to Γ every occupancy frequency f w related to nondominated
policies, it is necessary to choose carefully among witnesses f w , and to add only the
witness that maximizes Δ(f w , Γ ).
Output: Γ
It is worth to notice that each iteration of πWitness takes at least |S||A| calls to
findWitnessReward(·), and if it succeeds findBest(·) is also called. The num-
ber of policies in agenda can also increase fast, increasing the burden of calls to
findWitnessReward(·). In the next section we consider the hypothesis of the re-
ward function being defined with a small set of features, and we take this into account
to define a new algorithm with better run-time performance.
A Geometric Approach to Nondominated Policies 445
The problem of finding nondominated policies is similar to the problem of finding the
convex hull of a set of points. Here the set of points are occupancy frequency vectors.
We consider reward functions defined in terms of features, thus we can work in a space
of reduced dimensions. Our algorithm is similar to the Quickhull algorithm [1], but the
set of points is not known a priori.
Even with no information about the reward functions, but considering that they are de-
scribed by k features, we can analyse the corresponding IRMDP in the feature vector
space. The advantage of such analysis is that a conventional metric space can be con-
sidered. This is possible because the expected vector of features regarding to a policy
accumulates all the necessary knowledge about transitions in an MDP.
In this section we show through two theorems that if we take the set of all feasible
expected feature vectors M = {μπ |π ∈ Π} and define its convex hull M = co(M )2 ,
then the vertices V of the polytope M represents the expected feature vector of spe-
cial deterministic policies. Such special policies are the nondominated policies under
imprecise reward functions where the set R is free of constraints.
Theorem 1. Let Π be the set of stochastic policies and let M = {μπ |π ∈ Π} be the
set of all expected feature vectors defined by Π. The convex hull of M determines a
polytope M = co(M ), where co(·) stands for the convex hull operator. Let V be the set
of vertices of the polytope M, then for any vertex μ ∈ V there exists a weight vector
wµ such that:
wµ · μ > wµ · μ for any μ
= μ ∈ M.
w , if w · μ − w · μ < 0
wµ = .
−w , if w · μ − w · μ > 0
Theorem 2. Let Π, M , M and V be defined as in the previous theorem. Let Γ be the set
of nondominated policies of an IRMDP where R = {r(s, a)|w ∈ [−1, 1]k and r(s, a) =
w · φ(s, a)}. Let MΓ = {μπ |π ∈ Γ }, then V = MΓ .
2
The operator co(·) stands for the convex hull operator.
446 V.F. da Silva and A.H. Reali Costa
Proof. Consider that there exists a policy π such that wH,M · μπ > wH,M · μH . Be-
cause of the definition of wH,M
, μ π is beyond 3
hyperplane H in the direction wH,M .
3
A vector μ is beyond a hyperplane H with respect to the direction w, if for any vector x ∈ H
it is true that w, μ > w, x.
A Geometric Approach to Nondominated Policies 447
a) b) c)
μ2 μ2 μ2
μ max
2 μ max
2 μH´ μmax
2 μH´
wH´
.
μmax
1 μmax
1 μmax
1 =μH´´
μmin
1 μmin
1 μmin
1 .
μmin μmin H´ μmin wH´´
2 2 2
μ1 μ1 μ1
H´´
Fig. 1. Constructing the set of feasible expected feature vectors in two dimensions. a) The initial
polytope (polygon) M has vertices that minimise and maximise each feature separately. b) Ex-
ample of a hyperplane (edge) of the polytope M which is not in the polytope M. c) Example of
a hyperplane (edge) of the polytope M which is in the polytope M.
π∗
feature vector μ wH .
π∗ π∗ (figure 1b).
– If μ wH is beyond H , then μ wH can be added to V
– Otherwise the hyperplane H constrains the set M (figure 1c).
– also constrain the set M, then M
If all the hyperplanes that constrain M = M.
– The end of the process is guaranteed since the cardinality of the set of nondomi-
nated policies is finite.
In the first step, instead of solving an MDP and finding a feasible expected feature
vector, which requires optimisation within |S| constraints, we work with potential fea-
ture vectors, i.e., vectors in the space under relaxed constraints. By doing so, we can
approximate such a solution to a linear optimisation within k constraints in the feature
space. First, in the feature space, the lower and upper bounds in each axis can be found
previously. We solve MDPs for weight vectors in every possible direction, obtaining
respectively upper and lower scalars bounds μtop i and μbottom
i for every feature i. Sec-
ond, when looking for vectors beyond a facet, only constraints applied to such a facet
should be considered. Then, given a facet constructed from expected feature vectors
μ1 , . . . , μk with corresponding weight vectors w1 , . . . , wk and the orthogonal vector
wort , the farthest feature vector μf ar is obtained from:
max wort · μf ar
µf ar
subject to: wi · μf ar < wi · μi , for i = 1, . . . , k . (6)
μbottom
i ≤ μfi ar ≤ μtop
i , for i = 1, . . . , k
The second step verifies if there exists a witness that maximises μf ar compared to
μ1 , . . . , μk . Note that μf ar may not be a feasible expected weight vector (f s,a is ac-
tually feasible in πWitnessBound). But μf ar indicates an upper limit regarding to the
distance of the farthest feasible feature vector.
The third step solves an MDP and finds a new expected feature vector to be added to
This step is the most expensive and we will explore the second step to avoid running
V.
into it unnecessarily. If a limited number of nondominated policies is required, not all
the policies will be added to V. We can save run-time if we conduct the third step only
when necessary, adopting the second step as an advice.
= μj ∈ V
Note that wort,− must be orthogonal to any ridge formed by feature vectors in V, but at
the same time it tries to be opposite to the weight vectors already maximised. A version
wort,+ in the average direction of W is also obtained.
A Geometric Approach to Nondominated Policies 449
We use two directions when looking for the farthest feature vectors because it is
not possible to know which direction a ridge faces. By using equation 6 we look for
the farthest feature vectors μf ar,+ and μf ar,− . Witnesses w+ and w− and optimal
expected feature vectors μ+ and μ− are found for both of them. Then, the farthest
Here, the distance is measured in the directions
feature vector of both are added to V.
+ −
w and w relatively to the set V.
This process goes on until |M| = k + 1, when it is possible to construct a polytope.
Table 2 presents the the initHull algorithm.
In each iteration the πHull algorithm processes randomly Nnull facets from set
Hnull , then it processes in the given order Nwit facets from set Hwit . Finally, it chooses
from Hbest the best facet, i.e., the one with the farthest corresponding expected feature
Table 3 summarises the πHull algorithm.
vector and adds it to V.
a facet will be smaller and smaller as new facets are added to V. Then, in the first
iterations, when the number of facets is small, the farthest vector of all facets will be
calculated and kept in Hbest . As the number of iterations grows and the farthest vector
cannot be promptly calculated for all facets, the facets saved in Hbest can be good
if not the best ones.
candidates to be added in V,
Another interesting point in favour of the πHull algorithm is the relationship with
MDP solvers. While πWitnessBound relies on small changes in known nondominated
policies, the πHull algorithm relies on the expected feature vector of known nondomi-
nated policies. For instance, if a continuous state and continuous action IRMDP is used,
how to iterate over states and actions? What would be the occupancy frequency in this
case?
The πHull algorithm allows any MDP solver to be used. For instance, if the MDP
solver finds approximated optimal policies or expected feature vector, the πHull algo-
rithm would not be affected in the first iterations, where the distance of farthest feature
vectors is big.
5 Experiments
We performed experiments on synthetic IRMDPs. Apart from the number of features
and the number of constraints, all IRMDPs are randomly drawn from the same distri-
bution. We have |S| = 50, |A| = 5, γ = 0.9 and β is a uniform distribution. Every
state can transit only to 3 other states drawn randomly and such transition function is
drawn randomly. φ(·) is defined in such a way that for any feature i, we have
also
s,a φi (s, a) = 10.
We constructed two groups of 50 IRMDPs. A group with R defined on 5 features
and 3 linear constraints, and another group with R defined on 10 features and 5 linear
constraints.
The first experiment compares πWitnessBound, πHull without time limits (Nnull =
∞ and Nwit = ∞), and πHull with time limits (Nnull = 50 and Nwit = 10). This
experiment was run on the first group of IRMDPs, where k = 5. Figure 2 shows the
results comparing the run-time spent in each iteration and the error regarding to the
recommended decision and its maximum regret, i.e., in each interaction f ∗ is chosen
to be the occupancy frequency that minimises the maximum regret regarding to the
current set of nondominated policies and effectiveness is measured by the error =
MR(f ∗ , R) − MMR(R)4 .
The second experiment compares πWitnessBound and πHull with time limits
(Nnull = 100 and Nwit = 20). This experiment was run on the second group of
IRMDPs where k = 10. Figure 3 shows the results.
In both groups the effectiveness of the subset of nondominated policies was sim-
ilar for the πWitnessBound algorithm and the πHull algorithm. In both cases, a few
policies are enough to make decision under minimax regret criterium. The time spent
per iteration in πHull with time limited is at least four times lesser when compared to
4
We estimate MMR(R) considering the union of policies found at the end of the experiment
within all of the algorithms: πWitness and πHull.
452 V.F. da Silva and A.H. Reali Costa
4.0 4.0
2.5 2.5
error in MMR
time (s)
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
0 5 10 15 20 25 30 35 40 45 50
Number of Nondominated Policies
6 42
5 35
time (s)
3 21
time
2 14
1 7
error
time
0 0
0 5 10 15 20 25 30
Number of Nondominated Policies
πWitnessBound. In the case that πHull is used without time limit, the time spent per
iteration increases with iterations, but in the early iterations it is still smaller than that
found by πWitnessBound.
Our experiments were made with small MDPs (50 states and 5 actions). The run-
time of each algorithm depends on the technique used to solve an MDP. However, as
πWitnessBound does not take advantage of the reward description based on features,
it depends also on how big the sets of state and actions are. On the other hand, πHull
depends only the number of features used. In conclusion, the higher the cardinality of
the sets of states and actions, the greater the advantage of πHull over πWitnessBound.
Although πWitnessBound is not clearly dependent on the number of features, we
can see in the second experiment a run-time four times that of first experiment. When
the number of features grows, the number of witnesses also increases, which requires a
larger number of MDPs to be solved.
A Geometric Approach to Nondominated Policies 453
6 Conclusion
We presented two algorithms: πWitnessBound and πHull. The first is a slight modifi-
cation to πWitness, while the second is a completely new algorithm. Both are effective
when they define a small subset of nondominated policies to be used for calculating
the minimax regret criterium. πHull shows a better run-time performance in our experi-
ments, mainly due to the large difference in the number of features (k = 5 and k = 10)
and number of states (S = 50), since πWitnessBound depends on the second.
Although πHull shows a better run-time performance and similar effectiveness, we
have not presented a formal proof that πHull always has a better performance. Fu-
ture works should seek for three formal results related to the πHull algorithm. First, to
prove that our analysis of facet reaches the farthest feature vector when the constraints
are considered. Second, to establish a formal relation between the number of nondomi-
nated policies and the error of calculating MMR(R). Third, to set the speed with which
πHull reaches a good approximation of the set V, which is very important given the
exponential growth in the number of facets. We must also examine how the parameters
Nnull and Nwit affect this calculation.
Besides the effectiveness and better run-time performance of πHull confronted with
πWitnessBound, there is also qualitative characteristics. Clearly the πHull algorithm
cannot be used if a feature description is not at hand. However, a reward function is
hardly defined on the state-action space. πHull would present problem if the number of
features to be used is too big.
The best advantage of πHull is regarding to the MDP solver to be used. In real
problems an MDP solver would take advantage of the problems structure, like fac-
tored MDPs [14], or would approximate solutions in order to make it feasible [8].
πWitnessBound must be adapted somehow to work with such solvers.
It is worth to notice that nondominated policies can be a good indicator for a pref-
erence elicitation process. They give us a hint about policies to be confronted. For in-
stance, the small set of nondominated policies can be used when enumerative analysis
must be done.
References
1. Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. ACM
Trans. Math. Softw. 22, 469–483 (1996), http://doi.acm.org/10.1145/235815.
235821
2. Bertsekas, D.P.: Dynamic Programming - Deterministic and Stochastic Models. Prentice-
Hall, Englewood Cliffs (1987)
3. Boutilier, C., Patrascu, R., Poupart, P., Schuurmans, D.: Constraint-based optimization and
utility elicitation using the minimax decision criterion. Artificial Intelligence 170(8), 686–
713 (2006)
4. Braziunas, D., Boutilier, C.: Elicitation of factored utilities. AI Magazine 29(4), 79–92 (2008)
5. Buchta, C., Muller, J., Tichy, R.F.: Stochastical approximation of convex bodies. Mathema-
tische Annalen 271, 225–235 (1985), http://dx.doi.org/10.1007/BF01455988,
doi:10.1007/BF01455988
454 V.F. da Silva and A.H. Reali Costa
6. Chajewska, U., Koller, D., Parr, R.: Making rational decisions using adaptive utility elic-
itation. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence
and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 363–369.
AAAI Press / The MIT Press, Austin, Texas (2000)
7. Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable
stochastic domains. Artificial Intelligence 101(1-2), 99–134 (1998)
8. Munos, R., Moore, A.: Variable resolution discretization in optimal control. Machine Learn-
ing 49(2/3), 291–323 (2002)
9. Patrascu, R., Boutilier, C., Das, R., Kephart, J.O., Tesauro, G., Walsh, W.E.: New approaches
to optimization and utility elicitation in autonomic computing. In: Proceedings, The Twen-
tieth National Conference on Artificial Intelligence and the Seventeenth Innovative Appli-
cations of Artificial Intelligence Conference, pp. 140–145. AAAI Press / The MIT Press,
Pittsburgh, Pennsylvania, USA (2005)
10. Regan, K., Boutilier, C.: Regret-based reward elicitation for markov decision processes. In:
UAI 2009: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelli-
gence, pp. 444–451. AUAI Press, Arlington (2009)
11. Regan, K., Boutilier, C.: Robust policy computation in reward-uncertain mdps using non-
dominated policies. In: Fox, M., Poole, D. (eds.) AAAI, AAAI Press, Menlo Park (2010)
12. White III, C.C., Eldeib, H.K.: Markov decision processes with imprecise transition probabil-
ities. Operations Research 42(4), 739–749 (1994)
13. Xu, H., Mannor, S.: Parametric regret in uncertain markov decision processes. In: 48th IEEE
Conference on Decision and Control, CDC 2009 (2009)
14. Guestrin, C., Koller, D., Parr, R., Venkataraman, S.: Efficient solution algorithms for factored
MDPs. Journal of Artificial Intelligence Research 19, 399–468 (2003)
Label Noise-Tolerant Hidden Markov Models
for Segmentation: Application to ECGs
1 Introduction
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 455–470, 2011.
c Springer-Verlag Berlin Heidelberg 2011
456 B. Frénay, G. de Lannoy, and M. Verleysen
robust to the label noise is proposed. To illustrate the relevance of the proposed
model, artificial electrocardiogram (ECG) signals generated using ECGSYN [4]
and real ECG recordings from the Physiobank database [5] are used in the
experiments. The label noise issue is indeed known affect the segmentation of
waveform boundaries by experts in ECG signals [6]. Nevertheless, the proposed
model also applies to any kind of sequential data facing the label noise issue, for
example biomedical signals such as EEGs, EMGs and many others.
This paper is organised as follows. Section 2 reviews related work. Section 3 in-
troduces hidden Markov models and two standard inference algorithms. Section
4 derives a new, label noise-tolerant algorithm. Section 5 quickly reviews elec-
trocardiogram signals and details the experimental settings. Finally, empirical
results are presented in Section 6 and conclusions are drawn in Section 7.
2 Related Work
Before presenting the state-of-the-art in classification with label noise, it is first
important to distinguish the label noise issue from the semi-supervised paradigm
where some data points in the training set are completely left unlabelled. Here,
we rather consider the framework where an unknown proportion of the observa-
tions are wrongly labelled. To our knowledge, existing approaches to this problem
are relatively few. These approaches can be divided in three categories: filtering
approaches, model-based approaches and plausibilistic approaches.
Filtering techniques act as a preprocessing of the training set to either remove
noisy observations or correct their labels. These methods involve the use of a
criterion to detect mislabelled observations. For example, [7] uses disagreement
in ensemble methods. Furthermore, [8] introduces an algorithm to iteratively
modify the examples whose class label disagrees with the class labels of most of
their neighbours. Eventually, [9] uses information gain to detect noisy labels.
On the other hand, model-based approaches tackle the label noise by incor-
porating the mislabelling process as an integral part of the probabilistic model.
Pioneer work by [1] incorporated a probabilistic noise model in a kernel-Fisher
discriminant for binary classification. Later, [2] extended this model by relaxing
the Gaussian distribution assumption and carried out extensive experiments on
more complex datasets, which convincingly demonstrated the value of explicit
label noise modeling. More recently the same model has been extended to multi-
class datasets [10]. Bouveyron et al. also proposes a distinct robust mixture dis-
criminant analysis [3], which consists in two steps: (i) learning an unsupervised
Gaussian mixture model and (ii) computing the probability that each cluster
belongs to a given class.
Eventually, plausibilistic approaches assume that the experts have explicitly
provided uncertainties over labels. Specific algorithms are then developed to
integrate and to focus on such uncertainties [11].
This work concentrates on model-based approaches to embed the noise pro-
cess into classifiers. Model-based approaches have a sound theoretical foundation
and tackle the noise issue in a more principled and transparent manner without
Label Noise-Tolerant Hidden Markov Models 457
data. Then, each observation distribution is fitted using the observations labelled
accordingly. This approach has the advantage of being simple to implement and
having a very low computational cost. However, if the labels are not perfect and
polluted by some label noise, the produced HMM may be significantly altered.
The Baum-Welch algorithm is another, unsupervised algorithm [12]. More
precisely, it assumes that the true labels are unknown, i.e. it ignores the ex-
pert annotations. The likelihood of the observations is maximised using an
expectation-maximisation (EM) scheme [13], since no closed-form maximum
likelihood estimator is available in this case. During the E step, the posteri-
ors P (St = i|O1 . . . OT ) and P (St−1 = i, St = j|O1 . . . OT ) are estimated for
each time step t and states i and j. Then, these posteriors are used during the
M step in order to estimate the prior vector q, the transition matrix a and the
observation distributions bi .
The main advantage of Baum-Welch is that wrong expert annotations should
have no impact on the inferred HMM. However, in practice, expert annotations
are used to compute a initial estimate of the HMM parameters, which is nec-
essary for the first E step. Moreover, ignoring expert annotations can also be a
disadvantage: if the expert uses a specific decomposition of the ECG dynamic,
such subtleties may be lost in the unsupervised learning process.
Two algorithms for HMM inference have been introduced in Section 3. However,
neither of them is satisfying when label noise is introduced. On the one hand,
supervised learning is bound to trust blindly the expert annotations. Therefore,
as shown in Section 6, label noise can degrade the segmentation quality. On
the other hand, the Baum-Welch algorithm fails to encode precisely the expert
knowledge. Indeed, as shown experimentally in Section 6, even predictions on
clean, easy-to-segment signals do not match accurately the expert annotations.
This section introduces a new algorithm for HMM inference which lies in-
between supervised learning and the Baum-Welch algorithm: expert annotations
are used, but the label noise is modelled during the inference process in order to
decrease the influence of wrong annotations.
1 − pi (i = j)
dij = (1)
pi
|S|−1 (i = j)
where pi is the probability that the expert makes an error in state i and |S| is
the number of possible states. Hence dii = 1 − pi is the probability of correct
annotation in state i. Notice that dij is only used during inference. Here, Y is an
extra layer put on a standard HMM to model the label noise. For segmentation,
only the parameters linked to S and O are used, i.e. q, a, π, μ and Σ.
where the sum spans all possible sequences of true states. As a closed-form
solution does not exist, one can use the EM algorithm which is derived in the
rest of this section. Notice that only approximate solutions are obtained, for EM
algorithms are iterative procedures and may converge to local minima [13].
using the current estimate Θold (E step) and (ii) maximising Q(Θ, Θold ) with
respect to the parameters Θ in order to update their estimate (M step). Since
T
T
T
P (O, Y, S|Θ) = qs1 ast−1 st bst (ot ) dst yt , (4)
t=2 t=1 t=1
460 B. Frénay, G. de Lannoy, and M. Verleysen
where ot , yt , s1 , st−1 and st are the actual values taken by the random variables
Ot , Yt , S1 , St−1 and St , the expression of Q(Θ, Θold ) becomes
|S| |S| |S|
T
γ1 (i) log qi + t (i, j) log aij
i=1 t=2 i=1 j=1
|S| |S|
T
T
+ γt (i) log bi (ot ) + γt (i) log diyt (5)
t=1 i=1 t=1 i=1
E Step. The γ and variables must be computed in order to evaluate (5), which
is necessary for the M step. In standard HMMs, these quantities are estimated
during the E step by the forward-backward algorithm [12, 14]. Indeed, if forward
variables α, backward variables β and scaling coefficients c are defined as
αt (i) = P (St = i|O1...t , Y1...t , Θold ) (8)
P (Ot+1...T , Yt+1...T |St = i, Θold )
βt (i) = (9)
P (Ot+1...T , Yt+1...T |O1...t , Y1...t , Θold )
ct = P (Ot , Yt |O1...t−1 , Y1...t−1 , Θold ), (10)
one eventually obtains
γt (i) = αt (i)βt (i) (11)
and
t (i, j) = αt−1 (i)c−1
t aij bj (ot )djyt βt (j). (12)
Here, the scaling coefficients ct are introduced in order to avoid numerical issues.
Indeed, for sufficiently large T (i.e. 10 or more), the dynamic range of both α
and β will exceed the precision range of any machine. The scaling factors ct
are therefore introduced to keep the values within reasonable bounds [12]. The
incomplete likelihood can be computed using P (O, Y |Θold ) = Tt=1 ct .
The forward-backward algorithm consists in using the recursive relationship
qi bi (o1 )diy1 (t = 1)
αt (i)ct = |S| (13)
bi (ot )diyt j=1 aji αt−1 (j) (t > 1).
linking the α and c variables and the recursive relationship
1 (t = T )
βt (i) = 1
|S| (14)
ct+1 j=1 aij bj (ot+1 )djyt+1 βt+1 (j) (t < T )
linking the β and c variables. The scaling coefficients can be computed using the
|S|
constraint i=1 αt (i) = 1 jointly with (13).
Label Noise-Tolerant Hidden Markov Models 461
M Step. The values of the γ and computed during the E step can be used to
maximise Q(Θ, Θold ). Using (5), one obtains
γ1 (i)
qi = |S| (15)
i=1 γ1 (i)
and T
t=2 t (i, j)
aij = T |S| (16)
t=2 j=1 t (i, j)
for the state prior and transition probabilities. The GMMs parameters become
T
γt (i, l)
πil = t=1
T
, (17)
t=1 γt (i)
T
t=1 γt (i, l)ot
μil = T (18)
t=1 γt (i)
and T
t=1 γt (i, l)(ot − μil )T (ot − μil )
Σil = T (19)
t=1 γt (i)
where
πil bil (ot )
γil (t) = γi (t) . (20)
bi (ot )
Eventually, the expert error probabilities are obtained using
=i γt (i)
t|Y
pi = T t (21)
t=1 γt (i)
5 ECG Segmentation
This section (i) quickly reviews ECG segmentation and the use of HMMs in this
context and (ii) details the methodology used for the experiments in Section 6.
462 B. Frénay, G. de Lannoy, and M. Verleysen
For each algorithm, ECGs are split into training and test sets. The training
set is used to learn the HMM, whereas the test set allows testing the HMM on
independent data. For artificial ECGs, 10% of the signal is used for training,
whereas the remaining 90% is used for test. For real ECGs, 50% of the signal is
used for training, whereas the remaining 50% is used for test. This way, the size
of the training sets are roughly equal for both artificial and real ECGs.
6 Experimental Results
This section compares the label noise-tolerant algorithm proposed in Section 4 to
the two standard algorithms described in Section 3. The tests are carried out on
three classes of ECGs, which are altered by two types of label noise. See Section
5 for more details about ECGs and the methodology used in this section.
Table 1. Recalls on original artificial, sinus and arrhythmia ECGs for supervised
learning, Baum-Welch and the proposed algorithm
Table 2. Precisions on original artificial, sinus and arrhythmia ECGs for supervised
learning, Baum-Welch and the proposed algorithm
Fig. 5. Recalls and precisions on artificial ECGs with horizontal noise for supervised
learning (black plain line), Baum-Welch (grey plain line) and the proposed algorithm
(black dashed line), with respect to the percentage of the maximum boundary move-
ment (0% to 50% of the modified wave). See text for details.
Fig. 6. Recalls and precisions on sinus ECGs with horizontal noise for supervised learn-
ing (black plain line), Baum-Welch (grey plain line) and the proposed algorithm (black
dashed line), with respect to the percentage of the maximum boundary movement (0%
to 50% of the modified wave). See text for details.
Fig. 8, 9 and 10 shows the recalls and precisions obtained for artificial, sinus and
arrhythmia ECGs, respectively. The annotations are polluted by a uniform noise,
with a percentage of flipped labels varying from 0% to 20%. For each figure, the
first row shows the recall, whereas the second row shows the precision, both
obtained on test beats. Each ECG signal is noised and segmented 40 times in
Label Noise-Tolerant Hidden Markov Models 467
Fig. 7. Recalls and precisions on arrhythmia ECGs with horizontal noise for supervised
learning (black plain line), Baum-Welch (grey plain line) and the proposed algorithm
(black dashed line), with respect to the percentage of the maximum boundary move-
ment (0% to 50% of the modified wave). See text for details.
Fig. 8. Recalls and precisions on artificial ECGs with uniform noise for supervised
learning (black plain line), Baum-Welch (grey plain line) and the proposed algorithm
(black dashed line), with respect to the percentage of flipped labels (0% to 20%). See
text for details.
order to evaluate the variability of the results. The curves in the first column
average the results of all runs for all ECGs, whereas the curves in the second and
third columns average the results of all runs for two selected ECGs. For the two
last plots of each row, the error bars show the 95 % confidence interval around
the mean on the 40 runs. The error bars shown on the first plot of each line are
the average of the error bars obtained for each ECG.
468 B. Frénay, G. de Lannoy, and M. Verleysen
Fig. 9. Recalls and precisions on sinus ECGs with uniform noise for supervised learning
(black plain line), Baum-Welch (grey plain line) and the proposed algorithm (black
dashed line), with respect to the percentage of flipped labels (0% to 20%). See text for
details.
Fig. 10. Recalls and precisions on arrhythmia ECGs with uniform noise for supervised
learning (black plain line), Baum-Welch (grey plain line) and the proposed algorithm
(black dashed line), with respect to the percentage of flipped labels (0% to 20%). See
text for details.
7 Conclusion
In this paper, a variant of the EM algorithm for label noise-tolerant HMM in-
ference is proposed. More precisely, each observed label is assumed to be a noisy
copy of the true, unknown state. The proposed EM algorithm relies on two steps
to automatically estimate the level of noise in the set of available labels. First,
during the E step, the posterior of the hidden state is estimated for each sam-
ple. Next, the M step computes the HMM parameters using the hidden true
states, and not the noisy labels themselves, which results in a model which is
less impacted by label noise.
Experiments are carried on both healthy and pathological ECGs signals artifi-
cially polluted by distinct types of label noise. Three types of inference algorithms
for HMMs are compared: supervised learning, the Baum-Welch algorithm and
the proposed noise-tolerant algorithm. The results show that the performances
of the three approaches are adversely impacted by the level of label noise. How-
ever, the proposed noise-tolerant algorithm can yield better performances than
the other two algorithms, which confirms the benefit of embedding the noise pro-
cess into the inference algorithm. This improvement is particularly pronounced
when the artificial label noise mimics errors made by medical experts, which
suggests that the proposed algorithm could be useful when expert annotations
are less reliable. The recall is improved for any label noise level, and the precision
is improved for large levels of noise.
References
1. Lawrence, N.D., Schölkopf, B.: Estimating a kernel fisher discriminant in the pres-
ence of label noise. In: Proceedings of the Eighteenth International Conference on
Machine Learning, ICML 2001, pp. 306–313. Morgan Kaufmann Publishers Inc,
San Francisco (2001)
2. Li, Y., Wessels, L.F.A., de Ridder, D., Reinders, M.J.T.: Classification in the pres-
ence of class noise using a probabilistic kernel fisher method. Pattern Recogni-
tion 40, 3349–3357 (2007)
3. Bouveyron, C., Girard, S.: Robust supervised classification with mixture mod-
els: Learning from data with uncertain labels. Pattern Recognition 42, 2649–2658
(2009)
4. McSharry, P.E., Clifford, G.D., Tarassenko, L., Smith, L.A.: Dynamical model for
generating synthetic electrocardiogram signals. IEEE Transactions on Biomedical
Engineering 50(3), 289–294 (2003)
5. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark,
R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., Stanley, H.E.: PhysioBank, Phys-
ioToolkit, and PhysioNet: Components of a new research resource for complex
physiologic signals. Circulation 101(23), e215–e220 (2000)
6. Hughes, N.P., Tarassenko, L., Roberts, S.J.: Markov models for automated ECG
interval analysis. In: NIPS 2004: Proceedings of the 16th Conference on Advances
in Neural Information Processing Systems, pp. 611–618 (2004)
7. Brodley, C.E., Friedl, M.A.: Identifying Mislabeled Training Data. Journal of Ar-
tificial Intelligence Research 11, 131–167 (1999)
470 B. Frénay, G. de Lannoy, and M. Verleysen
8. Barandela, R., Gasca, E.: Decontamination of training samples for supervised pat-
tern recognition methods. In: Proceedings of the Joint IAPR International Work-
shops on Advances in Pattern Recognition, pp. 621–630. Springer, London (2000)
9. Guyon, I., Matic, N., Vapnik, V.: Discovering informative patterns and data clean-
ing. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.)
Advances in Knowledge Discovery and Data Mining, pp. 181–203 (1996)
10. Bootkrajang, J., Kaban, A.: Multi-class classification in the presence of labelling
errors. In: Proceedings of the 19th European Conference on Artificial Neural Net-
works, pp. 345–350 (2011)
11. Côme, E., Oukhellou, L., Denoeux, T., Aknin, P.: Mixture model estimation with
soft labels. In: Proceedings of the 4th International Conference on Soft Methods
in Probability and Statistics, pp. 165–174 (2008)
12. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in
speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
13. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete
Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B
(Methodological) 39(1), 1–38 (1977)
14. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and
Statistics), 1st ed. 2006. corr. 2nd printing edition. Springer, Heidelberg (2007)
15. Clifford, G.D., Azuaje, F., McSharry, P.: Advanced Methods And Tools for ECG
Data Analysis. Artech House, Inc., Norwood (2006)
16. Hughes, N.P., Roberts, S.J., Tarassenko, L.: Semi-supervised learning of probabilis-
tic models for ecg segmentation. In: IEMBS 2004: Proceedings of the 26th Annual
International Conference of the IEEE Engineering in Medicine and Biology Society,
vol. 1, pp. 434–437 (2004)
Building Sparse Support Vector Machines for
Multi-Instance Classification
Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang
1 Introduction
Multi-instance (MI) classification is a paradigm in supervised learning first in-
troduced by Dietterich [1]. In a MI classification problem, training examples are
presented in the form of bags associated with binary labels. Each bag contains
a collection of instances, whose labels are not perceived in priori. The key as-
sumption for MI classification is that a positive bag contains at least one positive
instance and a negative bag contains only negative instances.
We focus on SVM-based methods for MI classification. Existing methods in-
clude the MI-kernel [2], MI-SVM [3], mi-SVM [3], a regularization approach to
MI [4], MILES [5], MissSVM [6], and KI-SVM [7]. Most existing approaches are
based on the modification of standard SVM models and kernels in the single
instance (SI) case to adapt to the MI setting. The SVM classifiers obtained from
these methods take a similar analytic form as standard kernel SVMs. The pre-
diction function can be expressed by a generalized linear model with predictor
variables given by the kernel values evaluated for the Support Vectors (SVs),
training instances with nonzero coefficients in the expansion. The speed of SVM
prediction is proportional to the number of SVs. The sparser the SVM model
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 471–486, 2011.
c Springer-Verlag Berlin Heidelberg 2011
472 Z. Fu et al.
(i.e. the smaller the number of SVs), the more efficient the prediction. The issue
is more serious for MI classification for two reasons. Firstly, the number of SVs
in the trained classifier largely depends on the number of training instances.
Even with a moderate-size MI problem, a large number of instances may be
encountered, consequently leading to a non-sparse SVM classifier with many
SVs. Secondly, in MI classification, the prediction function is usually defined at
instance-level either explicitly [3] or implicitly [2]. In order to predict the bag
label, one needs to apply the classifier to each instance in the bag. Reducing a
single SV in the prediction function would result in savings proportional to the
size of the bag. Thus it is more important in MI prediction to remove redun-
dant SVs for efficiency in prediction. For practical MI applications where fast
prediction speed is required, it is highly desirable to have sparse SVM prediction
models. Existing SVM solutions to MI classification are unlikely to produce a
sparse prediction model in this respect. MILES [5] is the only exception, which
produces a sparser solution than other methods, by using an L1 norm on the co-
efficients to encourage the sparsity of the solution. However, it can not explicitly
control the sparsity of the learned classifier.
In this paper, we propose a principled approach to learning sparse SVM for MI
classification. The proposed approach can achieve controlled sparsity while main-
taining competitive predictive performance as compared to existing non-sparse
SVM models for MI classification. To the best of our knowledge, this is the
first explicit formulation for sparse SVM learning in the MI setting. In the con-
text of SI classification, sparse SVMs have attracted much attention in the re-
search community [8–12]. The major difficulty to extend sparse learning to the
MI scenario is the complexity of MI classification problems compared to SI ones.
The MI assumption has to be explicitly taken into account in SVM modeling.
This eventually leads to complicated SVM formulations with extra non-convex
constraints and hence difficult optimization problems. To consider sparse MI
learning, more constraints need to be considered. This further complicates the
issue of modeling and leads to intractable optimization problems. Alternatively,
one can use some pre- [9] or post-processing [10] schemes to empirically learn
a sparse MI SVM. These schemes do not consider the special nature of an MI
classification problem and are likely to yield inferior performance for prediction.
The method most closely related to ours was proposed in Wu et al. [12] for
the SI case. Nevertheless, their method can not be directly applied to the MI
case since the base model they considered is the standard SVM classifier. It
is well-known that learning a standard SVM is equivalent to solving a convex
quadratic problem [10] with the guarantee of unique global minimum. This is the
pre-requisite for the optimization strategy employed in [12]. In contrast, existing
SVM solutions to MI classification are either non-convex [3, 4, 6] or based on
complex convex approximations [7]. It is non-trivial to extend the sparse learning
framework [12] to MI classification based on the existing formulations. To this
end, we adopt a simple “label-mean” formulation which takes the average value
of instance label predictions for bag-level prediction. This is contrary to most
existing formulations which makes bag-level prediction by taking the maximum
Building Sparse Support Vector Machines for MI Classification 473
The first and second terms in the above objective function correspond to the
regularization and data terms respectively, with parameter C controlling
the trade-off between them. (yi , F (Bi )) is the bag-level loss function penal-
izing the discrepancy between bag label yi and prediction F (Bi ) as given by
Equation 3. We stick to 2-norm SVM throughout the paper, which uses the
following squared Hinge loss function
2
(yi , F (Bi )) = max (0, 1 − yi F (Bi )) (6)
The purpose of the formulation in Equation 5 is to learn an instance-level pre-
diction function f which gives good performance on bag-level prediction based
on the “label-mean” relation in Equation 3. The objective function is continu-
ously differentiable and convex with unique global minimum. Both properties are
required for the successful application of the optimization strategy we used for
solving the sparse SVMs proposed in the next section based on the “label-mean”
formulation. The standard 1-norm SVM with the non-differentiable Hinge loss
can also be used here, which was adopted in Wu et al. [12] for solving sparse
SVM for SI learning. However, in this case, the optimization problem has to be
solved in the dual formulation. This is computationally much more expensive
than the primal formulation as we will discuss later.
The main difference between the proposed formulation and existing ones
[3, 4, 6] is in the relation between F (Bi ), bag-level prediction, and f (xi,p )’s,
the prediction values for instances. We have made a small but important modifi-
cation here by substituting the constraints in Equation 2 with those in Equation
3. This may look like a violation of the MI assumption in the beginning. Nev-
ertheless, the rationale for taking the average will become self-evident once we
reveal the relationship between the proposed formulation and the existing MI
kernel method [2]. Specifically, we have the following lemma.
Building Sparse Support Vector Machines for MI Classification 475
By making use of the Lagrangian and KKT conditions, we can derive the dual
of the above problem in the following1
m
m
max αi yi αj yj Ki,j − αi (8)
α≥0
i,j=1 i=1
s.t. yi αi = 0
i
1
with Ki,j = kset (Bi , Bj ) + δi,j . This is exactly a 2-norm SVM for bags with
2C
kernel specified by kset .
With the above lemma, we can proceed to show the following theorem, which
justifies the use of the mean operation for bag-level prediction in Equation 3.
Theorem 1. If positive and negative instances are separable with margin with
respect to feature map φ in the RKHS induced by kernel k, then for sufficiently
large integer r, positive and negative bags are separable with margin using the
bag-level prediction function in Equation 3 with instance kernel k r .
Proof. The proof follows directly from Lemma 4.2 in [2], which states that if
positive and negative instances are separable with margin with respect to kernel
k, then positive and negative bags can be separated with the same margin by
the following MI kernel
|Bi | |Bj |
1
kMI (Bi , Bj ) = k r (xi,p , xj,q ) (9)
|Bi ||Bj | p=1 q=1
This is basically the normalized set kernel for kernel k r at instance level. The
conclusion is then established by the equivalence between the set kernel and the
proposed “label-mean” formulation, which takes the average of instance labels
for bag prediction, as shown in the previous lemma.
1
The detailed steps are quite standard and similar to the derivation of dual of 2-norm
SVMs, and are thus omitted here.
476 Z. Fu et al.
w= βk φ(zk ) (14)
k=1
Building Sparse Support Vector Machines for MI Classification 477
where zk ’s are Expansion Vectors (XV) 2 whose feature maps φ(zk )’s form the
basis of linear expansion for w, and Nxv denotes the number of XVs. Let β =
[β1 , . . . , βNxv ] be the vector of expansion coefficients and Z = [zT1 , . . . , zTNxv ]T
the concatenation of XVs in a column vector. They can be solved by solving the
following optimization problem
N 2
xv 1
ni
(β, Z) = arg min βk φ(zk ) − αi yi φ(xi,p ) (15)
β,Z ni
k=1 i p=1
3.1 Model
Intuitively, we can view the above formulation as searching for the optimal so-
lution that minimizes the bag-level loss in a subspace spanned by φ(zk )’s in the
RKHS H induced by instance kernel k instead of the whole RKHS. By directly
specifying the number of XVs Nxv in w, the optimal solution is guaranteed to
2
We have adopted the same terminology here as in [12] for the same reason. Techni-
cally, an XV, which can be an arbitrary point in the instance space, is different from
an SV, which must be chosen from an existing instance in the training set.
478 Z. Fu et al.
reside in a subspace whose dimension is no larger than Nxv . The above formu-
lation can be regarded as a joint optimization problem that toggles between
searches for the optimal subspace and for the optimal solution in the subspace.
With Nxv Nsv , we can build a much sparser model for MI prediction.
Let KZ denote the Nxv ×Nxv Gram matrix for XVs zk ’s, with the (i, j)th entry
given by k(zi , zj ), KBi ,Z denote the ni × Nxv Gram matrix between instances
in bag i and XVs, with the (p, j)th entry given by k(xi,p , zj ), and 1ni be the
ni dimensional column vector with value of 1 for each element. We can rewrite
optimization problem in Equation 16 in terms of β by substituting Equation
14 into the cost function in Equation 16. This leads to the following objective
function we solve for sparse SVM model for MI classification.
1 T
min Q(β, b, Z) = β KZ β (17)
β,b,Z 2
1
+C yi , 1Tni KBi ,Z β + b
i
ni
g(Z), the new objective function, is special in the sense that it is the optimal
value of Q optimized over variables (β, b). The evaluation of g(Z) at a fixed point
Z is equivalent to training a 2-norm SVM given the XVs and computing the cost
of the trained model in Equation 17. To train the 2-norm SVM, we minimize
function Q(β, b, Z) over β and b by fixing Z. This can be done easily with vari-
ous numerical optimization routines. In our work, we used the limited memory
BFGS (L-BFGS) algorithm for its efficiency and super-linear convergence rate.
The implementation of L-BFGS requires the cost Q(β, b, Z) and the gradient
information below
1 m
∂Q
= KZ β + C fi (yi , fi )KTBi ,Z 1ni (19)
∂β i=1
n i
m
1
∂Q
=C (yi , fi ) (20)
∂b n fi
i=1 i
Building Sparse Support Vector Machines for MI Classification 479
where the partial derivative of the squared Hinge loss function with respect to
fi is given by
2(fi − yi ) yi fi < 1
fi (yi , fi ) = (21)
0 yi fi ≥ 1
1 T 1 T
g(Z) = β KZ β + C yi , 1ni KBi ,Z β + b (22)
2 i
ni
Note that we assume the Gram matrix KZ to be positive definite, which is always
the case with the Gaussian kernel for distinct zk ’s3 . Q is then a strictly convex
function with respect to β and b, and the optimal solution at each Z is unique.
This makes g(Z) a proper function with unique value for each Z. Moreover, the
uniqueness of optimal solution also makes it possible for the derivative analysis
for g(Z).
Existence and computation of the derivative of the optimal value function
has been well studied in the optimization literature. Specifically, Theorem 4.1
of [15] has provided the sufficient conditions for the existence of derivative of
g(Z). According to the theorem, the differentiability of g(Z) is guaranteed by
the uniqueness of optimal solution β and b as we discussed earlier, and by the
differentiability of Q(β, b, Z) with respect to β and b, which is ensured by the
square Hinge loss function we adopted. Moreover, the derivative of g(Z) can be
computed at each given Z by substituting the minimizers β and b into Equation
22 and taking the derivative in the following as if g(Z) does not depend on β
and b
N
∂g ∂k(zi , zk )
= β iβ k (23)
∂zk i=1
∂zk
m ni
1 ∂k(xi,p , zk )
+C (yi , fi )β k
n
i=1 i p=1
∂zk
The derivative terms in the above equation depend on the specific choice of
kernels. In our work, we have adopted the following Gaussian kernel
where γ is the scale parameter for the kernel. Note that the power of a
Gaussian kernel is still a Gaussian kernel. Hence, according to Theorem 2 in Sec-
tion 2, if instances are separable with Gaussian kernel with scale parameter γ,
then bags are separable with Gaussian kernel with scale parameter rγ for some r
3
In practice, we can enforce the positive definiteness by adding a small value to the
diagonal of KZ .
480 Z. Fu et al.
∂k(x, zk )
= 2γ(x − zk )k(x, zk )
∂zk
With g(Z) and its derivative given in Equations 22 and 23, we can develop a
gradient descent approach to solve the overall optimization problem for sparse
SVM in Equation 17. The detailed steps are outlined in Algorithm 1. Besides
input data and Nxv , additional input parameters include λ, the initial step size,
smax , the maximum number of line searches allowed, and tmax , the maximum
number of iterations. For line search, we implemented a strategy similar to back-
tracking, without imposing the condition on sufficient decrease. Any step size
resulting in a decrease in function value is immediately accepted. This is more
efficient than more sophisticated line search strategies with significant reduction
in the number of expensive function evaluations.
The optimization scheme we used has also been adopted previously for sparse
kernel machine [12] and simple Multiple Kernel Learning (MKL) [14]. The major
difference here is that the SVM problem is solved directly in its primal formula-
tion when evaluating the optimal value function g(Z), whereas in [12] and [14],
a dual formulation of SVM is solved instead by invoking standard SVM solvers.
Building Sparse Support Vector Machines for MI Classification 481
This is mainly due to the use of squared Hinge loss in the cost function, which
makes the primal formulation continuously differentiable with respect to classi-
fier parameters. In contrast, both sparse kernel machine [12] and simple MKL
[14] used non-differentiable Hinge loss for the basic SVM model, which has to be
treated in the dual form to guarantee the differentiability of the optimal value
function. Solving the primal problem has great advantage in computational com-
plexity as compared to the dual. For our formulation, it is cheaper to solve the
primal problem for SVM since it only involves Nxv + 1 variables. In contrast, the
complexity of the dual problem is much higher. It involves the computation and
cache of the kernel matrix in solving the SVM. Moreover, the gradient evalua-
tion with respect to the XVs is also much more costly, as it needs to aggregate
over the gradient value for each entry in the kernel matrix, which is the sum
of inner products between vectors of kernel function values over instances and
XVs. The complexity for gradient computation scales with O(Nxv N 2 ), where N
is the total number of instances in the training set. This is much higher than the
complexity of O(Nxv N ) in the primal case.
The proposed algorithm can also be extended to deal with multi-class MI
problems. Since a multi-class problem can be decomposed into several binary
problems using various decomposition schemes, each binary problem may intro-
duce a different set of XVs by applying the binary version of algorithm directly.
Therefore, the XVs have to be learned in a joint fashion for multi-class problems
to ensure that each binary classifier shares the same XVs. Consider M binary
problems and let β c , bc denote the weight and bias for the cth classifier respec-
tively (c = 1, . . . , M ), the optimization problem is only slightly different from
Equation 17
M
min Q((β, b, Z) = Qc (β c , bc , Z) (24)
β,b,Z
c=1
M
1 cT
c c 1 T c c
= β KZ β + C yi , 1ni KBi ,Z β + b
c=1
2 i
ni
The same strategy can be adopted for the optimization of the above equation by
introducing g(Z) as the optimal value function of Q over β c ’s and b’s. The first
step of each iteration is the same, only that M SVM classifiers are trained instead
of one. These M classifiers can be trained separately by minimizing Qc (β c , bc , Z)
for c = 1, . . . , M . The argument on the existence of the derivative of g(Z) is still
true, which can be computed with a minor modification via
M N
xv
∂g c c ∂k(zi , zk )
= βi βk (25)
∂zk c=1 i=1
∂zk
M m ni
1 c c ∂k(xi,p , zk )
+C (yi , fi )β k
n
c=1 i=1 i p=1
∂zk
482 Z. Fu et al.
(a)
(b)
Fig. 1. Demonstration of the proposed sparse SVM classifier on synthetic (a) binary
and (b) 3-class MI data sets
4 Experimental Results
4.1 Synthetic Data Examples
In our experiments, we first show two examples on synthetic data to demon-
strate the interesting properties of the proposed sparse SVM algorithm for MI
classification. Figure 1(a) shows an example of binary MI classification, where
each positive bag contains at least one instance in the center, while each neg-
ative bag contains only instances on the ring. Figure 1(b) shows a MI classifi-
cation problem with three classes. Instances are generated from four Gaussians
N (μi , σ 2 )(i = 1, . . . , 4) with μ1 = [−2, 2]T , μ2 = [2, 2]T , μ3 = [2, −2]T , μ4 =
[−2, −2]T , and σ = 0.25. Bags from class 3 contains only instances randomly
drawn from N (μ2 , σ 2 ) and N (μ4 , σ 2 ), whereas each bag from class 1 contains
at least one instance drawn from N (μ1 , σ 2 ) and each bag from class 2 contains
at least one instance drawn from N (μ3 , σ 2 ). For the binary example, a single
XV in the center is sufficient to discriminate between bags from two classes.
For the 3-class example, two XVs are needed for discriminant purposes whose
optimal locations should overlap with μ1 and μ3 , the centers of class 1 and 2.
For both cases, we initialize our algorithm with poor XV locations as shown
by the circles on the first column of Figure 1. The next two columns show
the data overlayed with XVs updated after the first and final iterations respec-
tively. Based on the trained classifier, we can also partition the data space into
predicted regions for each class. The partition is also shown in our plots and
regions for different classes are described by different shades. It can be seen
that, despite the poor initialization, the algorithm is able to find good XVs
and make the right decision. The convergence is quite fast, with the first itera-
tion already making a pronounced improvement over the initialization and XV
Building Sparse Support Vector Machines for MI Classification 483
locations are refined over the remaining iterations. This is further demonstrated
by the monotonically decreasing function values over iterations on the rightmost
column of Figure 1, with significant decrease in the first few iterations.
From the results, we can see that the proposed SparseMI is quite promising
with the optimal balance between performance and sparsity. Compared to exist-
ing SVM methods like MI-kernel and MILES, SparseMI has comparable accuracy
rates yet with significantly fewer XVs in the prediction function. Compared to
alternative sparse implementations like RS and RSVM, SparseMI achieves better
performances in majority of the cases. The gap in performance is more evident
with smaller number of XVs, as can be observed for the case of Nxv = 10. With
increasing value of Nxv , the performance of SparseMI is further improved. For
Nxv = 100, SparseMI performs comparably with MI-Kernel, which can be re-
garded as the dense version of SparseMI. As a result, MI-Kernel produces more
complex prediction models with far more SVs than SparseMI. Compared with
MILES, which has better sparsity than other SVM methods, SparseMI not only
achieves better performance but obtains sparser models. This is especially true
in the cases of multiclass MI classification, for reasons we have discussed in the
previous section. Moreover, with SparseMI, we can explicitly control its spar-
sity by specifying the number of XVs, while the sparsity of MILES can only be
specified implicitly with the regularization parameter.
Building Sparse Support Vector Machines for MI Classification 485
5 Conclusions
In this paper, we have proposed the first explicit formulation for sparse SVM
learning in MI classification. Our formulation is descended from sparse kernel
machines [12] and builds on top of an equivalent formulation of MI kernel [2].
Unlike MI kernel, the proposed technique can produce a sparse prediction model
with controlled complexity while still performing competitively compared to MI
kernel and other non-sparse methods.
486 Z. Fu et al.
References
1. Dietterich, T.G., Lathrop, R.H., Lozano-perez, T.: Solving the multiple-instance
problem with axis-parallel rectangles. Artificial Intelligence 89, 31–71 (1997)
2. Gartner, T., Flach, A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In: Intl.
Conf. Machine Learning, pp. 179–186 (2002)
3. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for
multiple-instance learning. In: Advances in Neural Information Processing Sys-
tems, pp. 561–568 (2003)
4. Cheung, P.M., Kwok, J.T.Y.: A regularization framework for multiple-instance
learning. In: Intl. Conf. Machine Learning (2006)
5. Chen, Y., Bi, J., Wang, J.: Miles: Multiple-instance learning via embedded instance
selection. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(12), 1931–
1947 (2006)
6. Zhou, Z.H., Xu, J.M.: On the relation between multi-instance learning and semi-
supervised learning. In: Intl. Conf. Machine Learning (2007)
7. Li, Y.F., Kwok, J.T., Tsang, I., Zhou, Z.H.: A convex method for locating regions of
interest with multi-instance learning. In: European Conference on Machine Learn-
ing (2009)
8. Smola, A.J., Scholkopf, B.: Sparse greedy matrix approximation for machine learn-
ing. In: Intl. Conf. Machine Learning, pp. 911–918 (2000)
9. Lee, Y.J., Mangasarian, O.L.: Rsvm: Reduced support vector machines. In: SIAM
Conf. Data Mining (2001)
10. Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Reg-
ularization, Optimization, and Beyond. The MIT Press, Cambridge (2002)
11. Keerthi, S., Chapelle, O., DeCoste, D.: Building support vector machines with
reduced classifier complexity. Journal of Machine Learning Research 7, 1493–1515
(2006)
12. Wu, M., Scholkopf, B., Bakir, G.: A direct method for building sparse kernel learn-
ing algorithms. Journal of Machine Learning Research 7, 603–624 (2006)
13. Bezdek, J.C., Harthaway, R.J.: Convergence of alternating optimization. Journal
Neural, Parallel & Scientific Computations 11(4) (2003)
14. Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y.: Simplemkl. Journal of
Machine Learning Research 9, 2491–2521 (2008)
15. Bonnans, J.F., Shapiro, A.: Optimization problems with pertubation: A guided
tour. SIAM Review 40(2), 202–227 (1998)
16. Tzanetakis, G., Cook, P.: Music genre classification of audio signals. IEEE Trans.
Speech and Audio Processing 10(5), 293–302 (2002)
Lagrange Dual Decomposition for Finite Horizon
Markov Decision Processes
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 487–502, 2011.
c Springer-Verlag Berlin Heidelberg 2011
488 T. Furmston and D. Barber
π1 π2 π3 π4 π
a1 a2 a3 a4 a1 a2 a3 a4
s1 s2 s3 s4 s1 s2 s3 s4
R1 R2 R3 R4 R1 R2 R3 R4
(a) (b)
where p(st , at |π1:t ) is the marginal of the joint state-action trajectory distribution
H−1
p(s1:H , a1:H |π) = p(aH |sH , πH ) p(st+1 |st , at )p(at |st , πt ) p1 (s1 ). (2)
t=1
Given a MDP the learning problem is to find a policy π (in the stationary case)
or set of policies π1:H (in the non-stationary case) that maximizes (1). That is,
we wish to find
∗
π1:H = argmax U (π1:H ).
π1:H
chain structure of the corresponding influence diagram fig(1a) [4]. The optimal
policies resulting from this procedure are deterministic, where for each given
state all the mass is placed on a single action.
where p(st , at |π) is the marginal of the trajectory distribution, which is given by
t−1
p(s1:t , a1:t |π) = π(at |st ) p(sτ +1 |aτ , sτ )π(aτ |sτ ) p0 (s1 ).
τ =1
Looking at the influence diagram of the stationary finite horizon MDP problem
fig(1b) it can be seen that restricting the policy to be stationary causes the
influence diagram to lose its linear chain structure. Indeed the stationary policy
couples all time-points together and a dynamic programming solution no longer
exists, making the problem of finding the optimal policy π ∗ much more complex.
Another way of viewing the complexity of stationary policy MDPs is in terms
of Bellman’s principal of optimality [3]. A control problem is said to satisfy the
principal of optimality if, given a state and time-point, the optimal action is
independent of the trajectory preceding that time-point. In order to construct a
dynamic programming algorithm it is essential that the control problem satisfies
this principal. It is easy to see that while MDPs with non-stationary policies
satisfy this principal of optimality, hence permitting a dynamic programming
solution, this is not true for stationary policy MDPs.
1
In practice, the large state-action space can make the O S 2 AH complexity imprac-
tical, an issue that is not addressed in this study.
490 T. Furmston and D. Barber
2 Dual Decomposition
Our task is to solve the stationary policy finite-horizon MDP. Here our intuition
is to exploit the fact that solving the unconstrained (non-stationary) MDP is
easy, whilst solving the constrained MDP is difficult. To do so we use dual
decomposition and iteratively solve a series of unconstrained MDPs, which have
a modified non-stationary reward structure, until convergence. Additionally our
dual decomposition provides an upper bound on the optimal reward U (π ∗ ).
The main idea of dual decomposition is to separate a complex primal opti-
misation problem into a set of easier slave problems (see appendix(A) and e.g.
[9, 10]). The solutions to these slave problems are then coordinated to generate a
new set of slave problems in a process called the master problem. This procedure
is iterated until some convergence criterion is met, at which point a solution to
the original primal problem is obtained.
As we mentioned in section(1) the stationary policy constraint results in a
highly connected influence diagram which is difficult to optimise. It is there-
fore natural to decompose this constrained MDP by relaxing this stationarity
constraint. This can be achieved through Lagrange relaxation where the set of
non-stationary policies, π1:H = (π1 , . . . , πH ), is adjoined to the objective func-
tion
H
U (π1:H , π) = R(st , at )p(st , at |π1:t ), (4)
t=1 st ,at
with the additional constraint that πt = π, ∀t ∈ {1, ..., H}. Under this constraint,
we see that (4) reduces to the primal objective (3). We also impose that all πt
and π are restricted to the probability simplex. The constraints that each πt is
a distribution will be imposed explicitly during the slave problem optimisation.
The stationary policy π will be constrained to be a distribution through the
equality constraints π = πt .
where we have used the notation pt (s, a|π1:t−1 ) ≡ p(st = s, at = a|π1:t−1 ), and
we have included the Lagrange multipliers, λ1:H .
To see that this Lagrangian cannot be solved efficiently for π1:H through
dynamic programming consider the optimisation over πH , which takes the form
max RH (s, a)πH (a|s)pH (s|π1:H−1 ) + λH (s, a)πH (a|s) .
πH
s,a
for variational q and entropy function H(q) (see for example [11]) to decouple the
policies, but we do not pursue this approach here, seeking a simpler alternative.
From this discussion we can see that the naive application of Lagrange dual
decomposition methods does not result in a set of tractable slave problems.
Provided pt (s|π1:t−1 ) > 0, the zeros of the two sets of constraint functions,
g1:H and h1:H , are equivalent2 . Adjoining the constraint functions h1:H to the
objective function (4) gives the Lagrangian
H
L(λ1:H , π1:H , π) = {(Rt (s, a) + λt (s, a)) πt (a|s)pt (s|π1:t−1 )
t=1 s,a
We can now eliminate the original primal variables π from (6) by directly per-
forming the optimisation over π, giving the following set of constraints
λt (s, a)pt (s|π1:t−1 ) = 0, ∀(s, a) ∈ S × A. (7)
t
Once the primal variables π are eliminated from (6) we obtain the dual objective
function
H
L(λ1:H , π1:H ) = (Rt (s, a) + λt (s, a))πt (a|s)pt (s|π1:t−1 ) , (8)
t=1 s,a
where the domain is restricted to π1:H , λ1:H that satisfy (7) and π1:H satisfying
the usual distribution constraints.
λi+1
t = λit − αi ∂λt L(π1:H , λ1:H ) (11)
where αi is the ith step size parameter and ∂λt L(π1:H , λ1:H ) is the subgradient
i−1
of the dual objective w.r.t. λt . As the subgradient contains the factor pt (s|π1:t ),
which is positive and independent of the action, we may consider the simplified
update equation
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 493
λi+1
t = λit − αi πti .
Once the Lagrange multipliers have been updated through this subgradient step
they need to be projected down on to the feasible set, which is defined through
the constraint (7). This is achieved through a projected subgradient method, see
e.g. [12, 13]. We enforce (7) through the projection
H
λi+1
t (s, a) ← λi+1
t (s, a) − ρτ (s)λi+1
τ (s, a), (12)
τ =1
pτ (s|π1:τ −1 )
ρτ (s) ≡ H . (13)
τ =1 pτ (s|π1:τ −1 )
while in the second we optimise the dual objective function w.r.t. π to obtain
the dual primal policy π(a|s) = δa,a∗ (s) , where
a∗ (s) = argmin λt (s, a)pt (s|π1:t−1 ), (15)
a t
λi+1
t = λit − αi πti .
λi+1 i i
t (s, a) = λt (s, a) − αi πt (a|s),
i
where π1:H denotes the optimal non-stationary policy of the previous round of
slave problems. Noting that the optimal policy is deterministic gives
i
i+1
λt (s, a) − αi if a = argmax πti (a|s)
λt (s, a) = a
λit (s, a) otherwise.
Once the Lagrangian multipliers are projected down to the space of feasible
parameters through (12) we have
i
λ̄t (s, a) + αi τ ∈Nai (s) ρτ − 1 , if a = argmax πti (a|s)
i+1
λt (s, a) = a
λ̄it (s, a) + αi τ ∈Nai (s) ρτ otherwise
where Nai (s) = t ∈ {1, ..., H}|πti (a|s) = 1 is the set of time-points for which
action a was optimalin
state s in the last round of slave problems. We use the
notation λ̄it = λit − λit ρ , where ·ρ means taking the average w.r.t. to the
distribution (13).
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 495
Noting that ρt is a distribution over t means that the terms τ ∈Nai (s) ρτ
and ρ
τ ∈Nai (s) τ − 1 are positive and negative respectively. The projected
sub-gradient step therefore either adds (or subtracts) a positive term to the
projection, λ̄, depending on the optimality of action a (given state s at time t)
in the slave problem from the previous iteration. There are hence two possibilities
for the update of the Lagrange multiplier; either the action was optimal and a
lower non-stationary reward is allocated to this state-action-time triple in the
next slave problem, or conversely it was sub-optimal and a higher reward term
is attached to this triple. The master algorithm therefore tries to readjust the
Lagrange multipliers so that (for each given state) the same action is optimal for
all time-points, i.e. it encourages the non-stationary
policies to take the same
form. Additionally, as |Nai (s)| → H then τ ∈Nai (s) ρτ → 1, which means that
a smaller quantity is added (or subtracted) to the Lagrange multiplier. The
converse happens in the situation |Nai (s)| → 0. This means that as |Nai (s)| → 1
the time-points t ∈ / Nai (s) will have a larger positive term added to the reward
for this state-action pair, making it more likely that this action will be optimal
given this state in the next slave problem. Additionally, those time-points t ∈
Nai (s) will have a smaller term subtracted from their reward, making it more
likely that this action will remain optimal in the next slave problem. The dual
decomposition algorithm therefore automatically weights the rewards according
to a ‘majority vote’. This type of behaviour is typical of dual decomposition
algorithms and is known as resource allocation via pricing [12].
4 Experiments
We ran our dual decomposition algorithm on several benchmark problems, in-
cluding the chain problem [14], the mountain car problem [1] and the puddle
world problem [15]. For comparison we included planning algorithms that can
also handle stationarity constraints on the policy, in particular Expectation Max-
imisation and Policy Gradients.
In the experiments we obtained the primal policy using both the time-
averaged policy (14) and the dual primal policy (15). We found that both
policies obtained a very similar level of performance and so we show only
the results of (14) to make the plots more readable. We declared that the
algorithm had converged when the duality gap was less than 0.01.
496 T. Furmston and D. Barber
a,10
a,0 a,0 a,0 a,0
s1 s2 s3 s4 s5
b,2
Fig. 2. The chain problem state-action transitions with rewards R(st , at ). The ini-
tial state is state 1. There are two actions a, b, with each action being flipped with
probability 0.2.
and took the derivative w.r.t. the parameters γ. During the experiments we
used a predetermined step size sequence for the gradient steps. We considered
two different step sizes parameters
and selected the one that gave the best results for each particular experiment.
90
80
DD DP Algorithm
Total Expected Reward
70 EM Algorithm
PG Algorithm − Fixed
60
EM −PG Algorithm
50
40
30
0 0.005 0.01 0.015 0.02 0.025
Run Time (Seconds)
Fig. 3. Chain experiment with total expected reward plotted against run time (in
seconds). The plot shows the results of the DD DP algorithm (blue), the EM algorithm
(green), policy gradients with fixed step size (purple), policy gradients with a line search
(black) and the switching EM-PG algorithm (red). The experiment was repeated 100
times and the plot shows the mean and standard deviation of the results.
There are different criteria for switching between the EM algorithm and
policy gradients. In the experiments we use the criterion given in [19] where
the EM iterates are used to approximate the amount of hidden data (and
an additional estimate of the error of λ̂)
In the experiment we switched from EM to policy gradients when λ̂ > 0.9 and
ˆ < 0.01. During policy gradients steps we used fixed step size parameters.
Fig. 4. A graphical illustration of the mountain car problem. The agent (driver) starts
the the problem at the bottom of a valley in a stationary position. The aim of the
agent is to get itself to the right most peak of the valley.
because the gradient of the initial policy is often in the direction of the local
optima around state 1. The EM and EM-PG algorithms perform better than
policy gradients, being less susceptible to local optima. Additionally, they were
able to get close to the optimum in the time considered, although neither of
these algorithms actually reached the optimum.
In the mountain car problem the agent is driving a car and its state is described
by its current position and velocity, denoted by x ∈ [−1, 1] and v ∈ [−0.5, 0.5]
respectively. The agent has three possible actions, a ∈ {−1, 0, 1}, which corre-
spond to reversing, stopping and accelerating respectively. The problem is de-
picted graphically in fig(4) where it can be seen that the agent is positioned in
a valley. The leftmost peak of the valley is given by x = −1, while the rightmost
peak is given by x = 1. The continuous dynamics are nonlinear and are given by
At the start of the problem the agent is in a stationary position, i.e. v = 0, and
its position is x = 0. The aim of the agent is to maneuver itself to the rightmost
peak, so the reward is set to 1 when the agent is in the rightmost position and
0 otherwise. In the experiment we discretised the position and velocity ranges
into bins of width 0.1, resulting in S = 231 states. A planning horizon H = 25
was sufficient to reach the goal state.
As we can see from fig(5) the conclusions from this experiment are similar to
those of the chain problem. Again the DD DP algorithm consistently outper-
formed all of the comparison algorithms, converging to the global optimum in
roughly 7 iterations, while the policy gradients algorithms were again suscep-
tible to local optima. The difference in convergence rates between the DD DP
algorithm and both the EM and EM-PG algorithms is more pronounced here,
which is due to an increase in the amount of hidden data in this problem.
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 499
20
18
16 DD DP Algorithm
14
Total Expected Reward
EM Algorithm
12
10 PG Algorithm − Fixed
8
EM−PG Algorithm
6
4
PG Algorithm − Line Search
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Run Time (Seconds)
Fig. 5. Mountain car experiment with total expected reward plotted against run time
(in seconds). The plot shows the results for the DD DP algorithm (blue), the EM algo-
rithm (green), policy gradients with fixed step size (purple), policy gradients with a line
search (black) and the switching EM-PG algorithm (red). The experiment was repeated
100 times and the plot shows the mean and standard deviations of the experiments.
50
40
30
DD DP Algorithm
Total Expected Reward
20
10
EM Algorithm
0
−10
−30
−40
0 50 100 150 200 250 300
Run Time (Seconds)
Fig. 6. Puddle world experiment with total expected reward plotted against run time
(in seconds). The plot shows the results for the DD DP algorithm (blue), the EM
algorithm (green) and policy gradients with line search (black).
5 Discussion
We considered the problem of optimising finite horizon MDP’s with stationary
policies. Our novel approach uses dual decomposition to construct a two stage
iterative solution method with excellent empirical convergence properties, often
converging to the optimum within a few iterations. This compares favourably
against other planning algorithms that can be readily applied to this problem
class, such as Expectation Maximisation and policy gradients. In future work
we would like to consider more general settings, including partially-observable
MDPs and problems with continuous state-action spaces. Whilst both of these
extensions are non-trivial, this work suggests that the application of Lagrange
duality to these cases could be a particularly fruitful area of research.
References
1. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press,
Cambridge (1998)
2. Vlassis, N.: A Concise Introduction to Multiagent Systems and Distributed Artifi-
cial Intelligence. Synthesis Lectures on Artificial Intelligence and Machine Learn-
ing 1(1), 1–71 (2007)
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 501
3. Bertsekas, D.P.: Dynamic Programming and Optimal Control, 2nd edn. Athena
Scientific, Belmont (2000)
4. Shachter, R.D.: Probabilistic Inference and Influence Diagrams. Operations Re-
search 36, 589–604 (1988)
5. Williams, R.: Simple Statistical Gradient Following Algorithms for Connectionist
Reinforcement Learning. Machine Learning 8, 229–256 (1992)
6. Toussaint, M., Storkey, A., Harmeling, S.: Bayesian Time Series Models. In:
Expectation-Maximization Methods for Solving (PO)MDPs and Optimal Con-
trol Problems, Cambridge University, Cambridge (in press 2011), userpage.
fu-berlin.de/~mtoussai
7. Furmston, T., Barber, D.: Efficient Inference in Markov Control Problems. In:
Uncertainty in Artificial Intelligence. North-Holland, Amsterdam (2011)
8. Furmston, T., Barber, D.: An analysis of the Expectation Maximisation algorithm
for Markov Decision Processes. Research Report RN/11/13, Centre for Computa-
tional Statistics and Machine Learning, University College London (2011)
9. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont
(1999)
10. Sontag, D., Globerson, A., Jaakkola, T.: Introduction to Dual Decomposition for
Inference. In: Sra, S., Nowozin, S., Wright, S. (eds.) Optimisation for Machine
Learning, MIT Press, Cambridge (2011)
11. Furmston, T., Barber, D.: Variational Methods for Reinforcement Learning. AIS-
TATS 9(13), 241–248 (2010)
12. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press,
Cambridge (2004)
13. Komodakis, N., Paragios, N., Tziritas, G.: MRF Optimization via Dual Decom-
position: Message-Passing Revisited. In: IEEE 11th International Conference on
Computer Vision, ICCV, pp. 1–8 (2007)
14. Dearden, R., Friedman, N., Russell, S.: Bayesian Q learning. AAAI 15, 761–768
(1998)
15. Sutton, R.: Generalization in Reinforcment Learning: Successful Examples Using
Sparse Coarse Coding. NIPS (8), 1038–1044 (1996)
16. Hoffman, M., Doucet, A., De Freitas, N., Jasra, A.: Bayesian Policy Learning with
Trans-Dimensional MCMC. NIPS (20), 665–672 (2008)
17. Hoffman, M., de Freitas, N., Doucet, A., Peters, J.: An Expectation Maximization
Algorithm for Continuous Markov Decision Processes with Arbitrary Rewards.
AISTATS 5(12), 232–239 (2009)
18. Salakhutdinov, R., Roweis, S., Ghahramani, Z.: Optimization with EM and
Expectation-Conjugate-Gradient. ICML (20), 672–679 (2003)
19. Fraley, C.: On Computing the Largest Fraction of Missing Information for the
EM Algorithm and the Worst Linear Function for Data Augmentation. Research
Report EDI-INF-RR-0934, University OF Washington (1999)
20. Barber, D.: Bayesian Reasoning and Machine Learning. Cambridge University
Press, Cambridge (2011)
E(x) = Es (x) (18)
s
Then the x that optimises the master problem is equivalent to optimising each
slave problem Es (xs ) under the constraint that the slaves agree xs = x, s =
1, . . . , S [9]. This constraint can be be imposed by a Lagrangian
L(x, {xs } , λ) = Es (xs ) + λs (xs − x) (19)
s s
Finding the stationary point w.r.t. x, gives the constraint s λs = 0, so that we
may then consider
L({xs } , λ) = Es (xs ) + λs xs (20)
s
which ensures that s λnew
s = 0.
Unsupervised Modeling of Partially Observable
Environments
1 Introduction
Traditional reinforcement learning (RL) is generally intractable on raw high-
dimensional sensory input streams. Often a sensory input processor, or more
simply, a sensory layer is used to build a representation of the world, a simplifier
of the observations of the agent, on which decisions can be based. A nice sensory
layer produces a code that simplifies the raw observations, usually by lowering
the dimensionality or the noise, and maintains the aspects of the environment
needed to learn an optimal policy. Such simplifications make the code provided
by the sensory layer amenable to traditional learning methods.
This paper introduces a novel unsupervised method for learning a model of
the environment, Temporal Network Transitions (TNT), that is particularly well-
suited for forming the sensory layer of an RL system. The method is a gener-
alization of the Temporal Hebbian Self-Organizing Map (THSOM), introduced
by Koutnı́k [7]. The THSOM places a recurrent connection on the nodes of the
Self-Organizing Maps (SOM) of Kohonen [5]. The TNT generalizes the recurrent
connections between the nodes in the THSOM map to the space of the agent’s
actions. In addition, TNT brings with it a novel aging method that allows for
variable plasticity of the nodes. Theses contributions make SOM-based systems
better suited to on-line modeling of RL data.
The quality of the sensory layer learned is dependent on at least (I) the rep-
resentational power of the sensory layer, and (II) the ability of the RL layer get
the agent the sensory inputs it needs to best improve its perception of the world.
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 503–515, 2011.
c Springer-Verlag Berlin Heidelberg 2011
504 V. Graziano, J. Koutnı́k, and J. Schmidhuber
There are still few works where unsupervised learning (UL) methods have
been combined with RL [1,8,4]. Most of these approaches deal with the UL
separately from the RL, by alternating the following two steps: (1) improving the
UL layer on a set of observations, modifying its encoding of observations, and (2)
developing the RL on top of the encoded information. (1) involves running UL,
without being concerned about the RL, and (2) involves running RL, without
being concerned about the UL. It is assumed that the implicit feedback inherent
in the process leads to both useable code and an optimal policy. For example, see
the SODA architecture, introduced by Provost et al. [12,11], which first uses a
SOM-like method for learning the sensory layer and then goes on to create high-
level actions to transition the agent between internal states. Such approaches
mainly ignore aspect (II). Still fewer approaches actually deal with the mutual
dependence of the UL and RL layers, and how to solve the bootstrapping problem
that arises when integrating the two [9].
The TNT system immediately advances the state-of-the-art for SOM-like sys-
tems as far as aspect (I) is concerned. In this paper we show how the TNT
significantly outperforms the SOM and THSOM methods for learning state rep-
resentations in noisy environments. It can be introduced into any pre-existing
system using SOM-like methods without any fuss and will generate, as we shall
see in Section 4, nice representations of challenging environments. Further, the
TNT is a natural candidate for addressing both aspects (I) and (II) simultane-
ously, since a predictive model of the environment grows inside of it. We only
touch on this in the final section of the paper, leaving it for a further study.
In Section 2 we describe the THSOM, some previous refinements, the Topo-
logical THSOM (T2 HSOM) [2], as well as some new refinements. Section 3 intro-
duces the TNT, a generalization of the T2 HSOM architecture that is suited for
on-line modeling noisy high-dimension RL environments. In Section 4 we study
the performance of the TNT on partially observable RL maze environments, in-
cluding an environment with 400 underlying states and large amounts of noise
and stochastic actions. We make direct comparisons to both SOM and T2 HSOM
methods. Section 5 discusses future directions of research.
2.2 Learning
The spatial and temporal components are learned separately. The spatial com-
ponent is trained in the same way as a conventional SOM and the training of
the temporal component is based on a Hebbian update rule. Before each up-
date the node b with the greatest activation at step t is determined. This node
506 V. Graziano, J. Koutnı́k, and J. Schmidhuber
b(t) = arg maxk yk (t) is called the best matching unit (BMU). The neighbors of
a node are those nodes which are nearby on the lattice, and are not necessarily
the nodes with the most similar prototype vector.
The temporal neighborhood function cT,i is computed in the same way the spatial
neighborhood cS,i (Equation 4) is except that temporal parameters are used.
That is, the temporal learning has its own parameters, σT , νT , and αT , all of
which, again, are functions of the age of the network.
Unsupervised Modeling of Partially Observable Environments 507
t 1 t cT t 1 t cT t 1 t 1cT
2 T
Fig. 1. T2 HSOM learning rules: The BMU nodes are depicted with filled circles. (a)
excitation of temporal connections from the previous BMU to the current BMU and its
neighbors, (b) inhibition of temporal connections from all nodes, excluding the previous
BMU, to the current BMU and its neighbors, (c) inhibition of temporal connections
from the previous BMU to all nodes outside some neighborhood of the current BMU.
The connections are modified using the values of a neighborhood function (Gaussian
given by σT ), a cut-off (νT ), and a learning rate (αT ). Figure adapted from [2].
The original THSOM [6,7] contained only the first 2 temporal learning rules,
without the use of neighborhoods. In [2], Ferro et al. introduced the use of
neighborhoods for the training of temporal weights as well as rule (3). They
named the extension the Topological Temporal Hebbian Self-organizing Map
(T2 HSOM).
Spatial Activation. As with SOMs, the spatial matching for both the TNT
and THSOM can be carried-out with metrics other than the Euclidean distance.
For example a dot-product can be used to measure the similarity between an
observation and the prototype vectors. In any case, the activation and learning
rules need to be mutually compatible. See [5] for more details.
Temporal Activation. The key difference between the THSOM and TNT
architectures is the way in which the temporal activation is realized. Rather
than use a single temporal weight matrix M , as is done in the THSOM, to
determine the temporal component, the TNT uses a separate matrix Ma , called
a transition-map, for each action a ∈ A. The ‘temporal’ activation is now given by
y(t)T,i = y(t − 1) · ma
i (t), (9)
where ma i is row i of transition-map Ma and y(t − 1) is the network activation
at the previous time step (see Equation 2).
Combining the Activations. We see two general ways to combine the com-
ponents: additively and multiplicatively. The formulations of the THSOM and
T2 HSOM considered only the additive method, and can be used successfully
with the TNT as well. The two activations are summed using a balancing term
η, precisely as they were in Equation 3.
Since the transition-maps are effectively learning a model of the transition
probabilities between the nodes on the lattice it is reasonable to interpret both
the spatial and the temporal activations as likelihoods and use an element-wise
product to combine them. The spatial activation roughly gives a likelihood that
an observation belongs to a particular node, with better matching units being
considered more likely. In the case of the Euclidean distance, the closer the
observation is to a prototype vector the more likely the correspondence, whereas
with the dot-product the better matched units take on values close to 1.
For example, when using a Euclidean metric with a multiplicative combination
the spatial component can be realized as
3.2 Learning
The learning step for the TNT is, mutatis mutandis, the same as it was for the
T2 HSOM. Rather than training the temporal weight matrix M at step t, we
train the appropriate transition-map Ma (which is given by the action at ).
P ← (1 − h)P + hQ,
where h is the so-called neighborhood function. E.g., see Equation 5. This neigh-
borhood function, for both the spatial and temporal learning, is simply the
product of the learning rate of the prototype vector, i, and the cut-Gaussian of
the BMU, b,
(a) (b)
Fig. 2. Plasticity: (a) when the BMU is young, it strongly affects training of a large
perimeter in the network grid. The nodes within are dragged towards the target (×),
the old nodes are stable due to their low learning rate. (b) when the BMU is old, a small
neighborhood of nodes is weakly trained. In well-known parts of the environment new
nodes tend not be recruited for representation, while in new parts the young nodes,
and their neighbors are quickly recruited for representation, while the older nodes are
left in place.
a
Each node i is given a spatial age, ξS,i , and a temporal age ξT,i for each action
a ∈ A. After determining the activation of the TNT and the BMU, b, the age
of the nodes are simply incremented by the value of their spatial and temporal
neighborhood functions, hS,i and hT,i respectively:
4 Experiments
The aim of the experiments is to show how the recurrent feedback (based on
previous observations and actions) empowers the TNT to discern underlying-
states from noisy observations.
Unsupervised Modeling of Partially Observable Environments 511
Fig. 3. 5 × 5 maze experiment: (a) two dimensional maze with randomly placed
walls, (b) noisy observations of a random walk (σ = 1/3 cell width), (c) trained TNT.
Disks depict the location of the prototype vectors, arrows represent the learned tran-
sitions, and dots represent transitions to the same state.
4.1 Setup
The TNT is tested on two-dimensional mazes. The underlying-states of the maze
lie at the centers of the tiles constituting the maze. There are four possible ac-
tions: up, down, left, and right. A wall in the direction of intended movement,
prevents a change in underlying-state. After each action the TNT receives a
noisy observation of the current state. In the experiments a random walk (Fig-
ure 3(b)) is presented to the TNT. The TNT tries to (1) learn the underlying-
states (2-dimensional coordinates) from the noisy observations and (2) model the
transition probabilities between states for each of the actions. The observations
have Gaussian noise added to the underlying-state information. The networks
were trained on a noise level of σ = 1/3 of the cell width, so that 20% of the
observations were placed in a cell of the maze that did not not correspond to
the underlying-state.
Mazes of size 5 × 5, 10 × 10, and 20 × 20 were used. The length of the training
and random walks were 104 , 5 × 104 , and 105 respectively.
A parameter γ determines the reliability, or determinism, of the actions. A
value of 0.9 means that the actions move in a direction other than the one
indicated by their label with a probability of 1 − 0.9 = 0.1.
The parameters used for training the TNT decay exponentially according to
Equation 12. Initially, the spatial training has a high learning rate at the BMU,
0.9, and includes the 12 nearest nodes on the lattice. This training decays to
a final learning rate of 0.001 at the BMU and also includes the neighboring 4
nodes. The temporal training only effects only the BMU, with an initial learning
rate of 0.25 that decays to 0.001. The complete specification of the learning
parameters is recorded in Table 1.
After a network is trained, disambiguation experiments are performed. The
TNT tries to identify the underlying-state from the noisy observations on newly
generated random walks. The amount of noise and the level of determinism are
512 V. Graziano, J. Koutnı́k, and J. Schmidhuber
αS σS νS αT σT ν T
Λ◦ 0.90 1.75 2.10 0.25 2.33 0.5
Λ∞ 0.001 0.75 1.00 0.001 0.75 0.5
Λk 10 10 10 10 10 ∞
4.2 Results
In the 5 × 5 mazes the variable plasticity allows for the prototype vectors (the
spatial component) of all three methods to learn the coordinates of the under-
lying states arbitrarily well. The performance of the SOM reached its theoretical
maximum while being trained on-line on non-i.i.d data. That is, it misidentified
only those observations which fell outside the correct cell. At σ = 1/2 the SOM
can only identify around 55% of the observations correctly. The TNT on the
other-hand is able to essentially identify 100% of all such observations in a com-
pletely deterministic environment (γ = 1.0), and 85% in the stochastic setting
(γ = 0.9). Both the SOM and THSOM are unaffected by the level of stochas-
ticity, since they do not model the effects of particular actions. See Table 2 for
further comparisons.
After training, in both deterministic and stochastic settings, the transition-
maps accurately reflect the transition tables of the underlying Markov model.
The tables learned with deterministic actions are essentially perfect, while the
ones learned in the stochastic case, suffer somewhat, see Table 3. This is reflected
in the performance degradation with higher amounts of noise.
The degradation seen as the size of the maze increases is partially a result us-
ing the same learning parameters for environments of different size; the trouble
is that the younger nodes need to recruit enough neighboring nodes to rep-
resent new parts of the environment while stabilizing before some other part
of the maze is explored. The learning parameters need to be sensitive to the
size of the environment. Again, this problem arises since we are training on-line
and the data is not i.i.d. This problem can be addressed somewhat by making
the ratio of nodes-to-states greater than 1. We return to this issue in Section 5.
Unsupervised Modeling of Partially Observable Environments 513
5 Discussion
Initial experiments show that the TNT is able to handle much larger environ-
ments without a significant degradation of results. The parameters while they
do not require precise tuning, do require that they are reasonably matched to
the environment. The drop-off seen in Table 3 can be mostly attributed to not
having used better suited learning parameters, as the same decay functions were
used in all the experiments. Though the aging rules and variable plasticity have
largely addressed the on-line training problem, they have not entirely solved
it. As a result, we plan to explore a constructive TNT, inspired by the “grow
514 V. Graziano, J. Koutnı́k, and J. Schmidhuber
Fig. 4. The trouble with stochastic actions: The “go right” action can transition
to 4 possible states. Noisy observations coupled with stochastic actions can make it
impossible to discern the true state.
when required” [10] and the “growing neural gas” [3] architectures. Nodes will
be added on the fly making the recruitment of nodes to new parts of the envi-
ronment more organic in nature; as the agent goes somewhere new it invokes a
new node. We expect such an architecture to be more flexible, able to handle a
larger range of environments without an alteration of parameters, bringing us
closer a general learning system.
In huge environments where continual-learning plays an increasing role, the
TNT should have two additional, related features. (1) The nodes should be able
to forget, so that resources might be recruited to newly visited parts of the
environment, and (2) a minimal number of nodes should be used to represent
the same underlying state. (1) can be solved by introducing a youthening aspect
to the aging. Simply introduce another term in the aging function which slightly
youthens the nodes, so that nodes near the BMU increase in age overall, while
nodes further away decrease. (2) is addressed by moving from a cut-Gaussian to
a similar “Mexican hat” function for the older nodes. This will push neighboring
nodes away from the expert, making him more distinguished.
We have found that the Markov model can be learned when the nodes in
the TNT are not in 1-to-1 correspondence with the underlying states. Simple
clustering algorithms, based on the proximity of the prototype vectors and the
transition-maps, are able to detect likely duplicates. An immediate and impor-
tant follow-up to this work would consider continuous environments. We expect
the topological mixing, inherent to SOM-based architectures, to give dramatic
results.
A core challenge in extending reinforcement learning (RL) to real-world agents
is uncovering how such an agent can select actions to autonomously build an ef-
fective sensory mapping through its interactions with the environment. The use
of artificial curiosity [13] with planning to address this problem has been carried
out in [9], where the sensory layer was built-up using vector quantization (a SOM
without neighborhoods). Clearly, as we established in this paper, a TNT can
learn a better sensory layer than any SOM. The transition-maps effectively model
the internal-state transitions and therefore make planning methods naturally
available to a learning system using a TNT. A promising line of inquiry, therefore,
Unsupervised Modeling of Partially Observable Environments 515
is to derive a curiosity signal from the learning updates inside the TNT to supply
the agent with a principled method to explore the environment so that a nicer
representation of it can be learned.
References
1. Fernández, F., Borrajo, D.: Two steps reinforcement learning. International Journal
of Intelligent Systems 23(2), 213–245 (2008)
2. Ferro, M., Ognibene, D., Pezzulo, G., Pirrelli, V.: Reading as active sensing: a
computational model of gaze planning during word discrimination. Frontiers in
Neurorobotics 4 (2010)
3. Fritzke, B.: A growing neural gas network learns topologies. In: Advances in Neural
Information Processing Systems, vol. 7, pp. 625–632. MIT Press, Cambridge (1995)
4. Gisslén, L., Graziano, V., Luciw, M., Schmidhuber, J.: Sequential Constant Size
Compressors and Reinforcement Learning. In: Proceedings of the Fourth Confer-
ence on Artificial General Intelligence (2011)
5. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Heidelberg (2001)
6. Koutnı́k, J.: Inductive modelling of temporal sequences by means of self-
organization. In: Proceeding of Internation Workshop on Inductive Modelling
(IWIM 2007), pp. 269–277. CTU in Prague, Ljubljana (2007)
7. Koutnı́k, J., Šnorek, M.: Temporal hebbian self-organizing map for sequences. In:
ICANN 2006, vol. 1, pp. 632–641. Springer, Heidelberg (2008)
8. Lange, S., Riedmiller, M.: Deep auto-encoder neural networks in reinforce-
ment learning. In: The 2010 International Joint Conference on Neural Networks
(IJCNN), pp. 1–8 (July 2010)
9. Luciw, M., Graziano, V., Ring, M., Schmidhuber, J.: Artificial Curiosity with Plan-
ning for Autonomous Perceptual and Cognitive Development. In: Proceedings of
the International Conference on Development and Learning (2011)
10. Marsland, S., Shapiro, J., Nehmzow, U.: A self-organising network that grows when
required. Neural Netw. 15 (October 2002)
11. Provost, J.: Reinforcement Learning in High-Diameter, Continuous Environments.
Ph.D. thesis, Computer Sciences Department, University of Texas at Austin,
Austin, TX (2007)
12. Provost, J., Kuipers, B.J., Miikkulainen, R.: Developing navigation behavior
through self-organizing distinctive state abstraction. Connection Science 18 (2006)
13. Schmidhuber, J.: Formal Theory of Creativity, Fun, and Intrinsic Motivation
(1990–2010). IEEE Transactions on Autonomous Mental Development 2(3), 230–
247 (2010)
Tracking Concept Change with Incremental Boosting by
Minimization of the Evolving Exponential Loss
1 Introduction
There are many practical applications in which the objective is to learn an accurate
model using training data set which changes over time. A naive approach is to retrain
the model from scratch each time the data set is modified. Unless the new data set is
substantially different from the old data set, retraining can be computationally waste-
ful. It is therefore of high practical interest to develop algorithms which can perform
incremental learning. We define incremental learning as the process of updating the ex-
isting model when the training data set is changed. Incremental learning is particularly
appealing for Online Learning, Active Learning, Outlier Removal and Learning with
Concept Change.
There are many single-model algorithms capable of efficient incremental learning,
such as linear regression, Naı̈ve Bayes and kernel perceptrons. However, it is still an
open challenge how to develop efficient ensemble algorithms for incremental learning.
In this paper we consider boosting, an algorithm that trains a weighted ensemble of
simple weak classifiers. Boosting is very popular because of its ease of implementation
and very good experimental results. However, it requires sequential training of a large
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 516–532, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Incremental Boosting by Minimization of the Evolving Exponential Loss 517
number of classifiers which can be very costly. Rebuilding a whole ensemble upon
slight changes in training data can put an overwhelming burden to the computational
resources. As a result, there exists a high interest for modifying boosting for incremental
learning applications.
In incremental learning with concept change, where properties of a target variable
which we are predicting can unexpectedly change over time, a typical approach is to
use a sliding window and train a model using examples within the window. Upon each
window repositioning the data set changes only slightly and it is reasonable to attempt
to update the existing model instead of training a new one. Many ensemble algorithms
have been proposed for learning with concept change [1–5]. However, in most cases, the
algorithms are based on heuristics and applicable to a limited set of problems. Boost-
ing algorithm proposed in [4] uses each new data batch to train an additional classifier
and to recalculate the weights for the existing classifiers. These weights are recalcu-
lated instead of updated, thus discarding the influence of the previous examples. In [5]
a new data batch is weighted depending on the current ensemble error and used to train
a new classifier. Instead of classifier weights, probability outputs are used for making
ensemble predictions. OnlineBoost [6], which uses a heuristic method for updating the
example weights, was modified for evolving concepts in [8] and [7]. The Online Coor-
dinate Boost (OCB) algorithm proposed in [9] performs online updates of weights of a
fixed set of base classifiers trained offline. The closed form weight update procedure is
derived by minimizing the approximation on AdaBoost’s loss. Because OCB does not
have a mechanism for adding and removing base classifiers and one cannot straightfor-
wardly be derived, the algorithm is not suitable for concept change applications.
In this paper, an extension of the popular AdaBoost algorithm for incremental learn-
ing is proposed and evaluated on concept change applications. It is based on the treat-
ment of AdaBoost as the additive model that iteratively optimizes an exponential cost
function [10]. Given this, the task of IBoost can be stated as updating of the current
boosting ensemble to minimize the modified cost function upon change of the training
data. The issue of model update consists of updating the existing classifiers and their
weights or adding new classifiers using the updated example weights. We intend to
experimentally show that IBoost, in which the ensemble update always leads towards
minimization of the exponential cost, can significantly outperform heuristically based
modifications of AdaBoost for incremental learning with concept change which do not
consider this.
1.1 Preliminaries
FOR m = 0 TO M − 1
(a) Fit a classifier fm+1 (x) to training data by minimizing
N
Jm+1 = wim I(yi = fm+1 (xi )) (1)
i=1
yi ∈ {+1, −1} is its class label. The exponential cost function is defined as
N
Em = e−yi ·Fm (xi ) , (6)
i=1
where Fm (x) is the current additive model defined as a linear combination of m base
classifiers produced so far,
m
Fm (x) = αj fj (x), (7)
j=1
where base classifier fj (x) can be any classification model with output values +1 or
−1 and αj are constant multipliers called the confidence parameters. The ensemble
prediction is made as the sign of the weighted committee, sign(Fm (x)).
Given the additive model Fm (x) at iteration m−1 the objective is to find an improved
one, Fm+1 (x) = Fm (x) + αm+1 fm+1 (x), at iteration m. The cost function can be
expressed as
N
N
Em+1 = e−yi ·(Fm (xi )+αm+1 fm+1 (xi )) = wim e−yi αm+1 fm+1 (xi ) , (8)
i=1 i=1
where
wim = e−yi Fm (xi ) (9)
Incremental Boosting by Minimization of the Evolving Exponential Loss 519
are called the example weights. By rearranging Em+1 we can obtain an expression that
leads to the familiar AdaBoost algorithm,
N
N
Em+1 = (eαm+1 − e−αm+1 ) = fm+1 (xi )) + e−αm+1
wim I(yi wim . (10)
i=1 i=1
For fixed αm+1 , classifier fm+1 (x) can be trained by minimizing (10). Since αm+1 is
fixed, the second term is constant and the multiplication factor in front of the sum in
the first term does not affect the location of minimum, the base classifier can be found
as fm+1 (x) = arg minf (x) Jm+1 , where Jm+1 is defined as the weighted error func-
tion (1). Depending on the actual learning algorithm, the classifier can be trained by
directly minimizing the cost function (1) (e.g. Naı̈ve Bayes) or by resampling the train-
ing data according to the weight distribution (e.g. decision stumps). Once the training
of the new base classifier fm+1 (x) is finished, αm+1 can be determined by minimizing
(10) assuming fm+1 (x) is fixed. By setting ∂Em+1 /∂αm+1 = 0 the closed form solu-
tion can be derived as (3), where εm+1 is defined as in (2). After we obtain fm+1 (x)
and αm+1 , before continuing to round m + 1 of the boosting procedure and training
of fm+2 , the example weights wim have to be updated. By making use of (9), weights
for the next iteration can be calculated as (4), where I(yi = fm+1 (xi )) is an indica-
tor function which equals 1 if i-th example is misclassified by fm+1 and 0 otherwise.
Thus, weight wim+1 depends on the performance of all previous base classifiers on i-th
example. The procedure of training an additive model by stage-wise optimization of
the exponential function is executed in iterations, each time adding a new base clas-
sifier. The resulting learning algorithm is identical to the familiar AdaBoost algorithm
summarized in Algorithm 1.
Consequences. There is an important aspect of AdaBoost relevant to development of
its incremental variant. Due to the iterative nature of the algorithm, ∂Em+1 /∂αj = 0
will only hold for the most recent classifier, j = m + 1, but not necessarily for the
previous ones, j = 1, ..., m. Thus, AdaBoost is not a global optimizer of the confidence
parameters αj [12]. As an alternative to the iterative optimization, one could attempt to
globally optimize all a parameters after addition of a base classifier. In spite of being
more time consuming, this would lead to better overall performance.
Weak learners (e.g decision stumps) are typically used as base classifiers because of
their inability to overfit the weighted data, which could produce a very large or infinite
value of αm+1 .
As observed in [10], AdaBoost guarantees exponential progress towards minimiza-
tion of the training error (6) with addition of each new weak classifier, as long as they
classify the weighted training examples better than random guessing (αm+1 > 0). The
convergence rate is determined in [13]. Note that αm+1 can also be negative if fm+1
does worse than 50% on the weighted set. In this case (m+1)-th classifier automatically
changes polarity because it is expected to make more wrong predictions than the correct
ones. Alternatively, fm+1 can be removed from the ensemble. A common approach is
to terminate the boosting procedure when weak learners with positive confidence pa-
rameters can no longer be produced.
In the next section, we introduce IBoost, an algorithm which naturally extends Ad-
aBoost to incremental learning where the cost function changes as the data set changes.
520 M. Grbovic and S. Vucetic
to
new
Em = e−yi ·Fm (xi ) . (12)
i∈Dnew
There are several choices one could make regarding reuse of the current ensemble
Fm (x):
1. update αt , t = 1, ..., m, to better fit the new data set;
2. update base classifiers in Fm (x);
3. update both at, αt , t = 1, ..., m and base classifiers in Fm (x);
4. add a new base classifier fm+1 and its αm+1 .
Second and third alternatives are not considered here because they would require po-
tentially costly updates of base classifiers and would also require updates of example
weights and confidence parameters of base classifiers. In the remainder of the paper, it
will be assumed that trained base classifiers are fixed. It will be allowed, however, to
remove the existing classifiers from an ensemble.
The first alternative involves updating confidence parameters αj , j = 1, ..., m, in
such way that they now minimize (12). This can be achieved in two ways.
Batch Update updates each αj using the gradient descent algorithm αnew j j −
= αold
η · ∂Em /∂αj , where η is the learning rate. Following this, the resulting update rule
new old
is
m
−yi k fk (xi )
αold
αnew
j = αold
j + y f (x
i j i )e k=1 . (13)
i∈Dnew
One update of the m confidence parameters takes O(N · m) time. If the training set
changed only slightly, only a few updates should be sufficient for the convergence.
The number of batch updates to be performed should be selected depending on the
computational constraints.
Stochastic Update is a faster alternative for updating each αj . It uses stochastic
gradient descent instead of the batch version. The update of αj only using example
(xi , yi ) ∈ Dnew is
Incremental Boosting by Minimization of the Evolving Exponential Loss 521
m
−yi k fk (xi )
αold
αnew
j = αold
j + yi fj (xi )e k=1 . (14)
At the extreme, we can run the stochastic gradient using only the new examples,
(xi , yi ) ∈ Din . This kind of updating is especially appropriate for an aggressive it-
erative schedule where data are arriving one example at a time at a very fast rate and it
is infeasible to perform batch update.
The fourth alternative (adding a new base classifier) is attractive because it allows
training a new base classifier on the new data set in a way that optimally utilizes the ex-
isting boosting ensemble. Before training fm+1 we have to determine example weights.
Weight Calculation. There are three scenarios when there is a need to calculate or
update the example weights.
First, if confidence parameters were unchanged since the last iteration, we can keep
the weights of the old examples and only calculate the weights of the new ones using
m
αt I(yi
=ft (xi ))
wim = et=1 , i ∈ Din . (15)
Second, if confidence parameters were updated, then all example weights have to be
calculated using (9).
Third, if any base classifier fj was removed, the example weights can be updated by
applying
wim = wim−1 e−αj I(yi =fj (xi )) , (16)
which is as fast as (4).
Adding Base Classifiers. After updating the example weights, we can proceed to train
a new base classifier in the standard boosting fashion. When deciding how exactly to
update the ensemble when the data changes, one should consider a tradeoff between the
accuracy and computational effort. The first question is whether to train a new classifier
and the second whether to update α values, and if the answer is affirmative, which
update mode to use and how many update iterations to run.
In applications where data set is being changed very frequently and by a little it
can become infeasible to train a new base classifier after each change. In that case one
can intentionally wait until enough incoming examples are misclassified by the current
model Fm and only then decide to add a new base classifier fm+1 . In the meantime, the
computational resources can be used to update α parameters.
Removing Base Classifiers. In order to avoid an unbounded growth in number of base
classifiers, we propose a strategy that removes a base classifier each time a predeter-
mined budget is exceeded. Similar strategies, in which the oldest [5] or the base model
with the poorest performance on the current data [1, 2, 4] is removed, were proposed.
Additionally, in case of data with concept change, a classifier fj can become out-
dated and receive negative αj , as a result of (13) or (14), because it is trained on older
examples that were drawn from a different concept.
Following this discussion, our strategy is to remove classifiers if one of the two
scenarios occurs:
522 M. Grbovic and S. Vucetic
• Memory is full and we want to add fm+1 , remove the classifier with the lowest α
• Remove fj if αj becomes negative during α updates
If classifier fj is removed, it is equivalent to setting αi = 0. To account for this change, a
parameters of the remaining classifiers are updated using (13) or (14) and the example
weights are recalculated using (9). If the time does not permit any modification of a
parameters, influence of the removed classifier on example weights can be canceled
using (16).
Convergence. An appealing feature of IBoost is that it retains the AdaBoost conver-
gence properties. At any given time and for any given set of m base classifiers f1 ,
f2 , ..., fm , as long as the confidence parameters α1 , α2 , ..., αm are positive and mini-
new
mize Em , addition of new base classifier fm+1 by minimizing (1) and calculation of
new
αm+1 using (3) will lead towards minimization of Em . This ensures the convergence
of IBoost.
Figure 1 presents a summary of IBoost algorithm. We are taking a slightly wider view
and point to all the options a practitioner could select, depending on the particular ap-
plication and computational constraints.
Initial data and example weights are used to train the first base classifier. After the
data set is updated, the user always has a choice of just updating the confidence param-
eters, training a new classifier, or doing both.
First, the user decides whether to train a new classifier or not. If the choice is not to,
the algorithm just updates the confidence parameters using (13) or (14) and modifies the
example weights (9). Otherwise, it proceeds to check if the budget is full and potentially
removes the base classifier with minimum α. Next, the user can choose whether to
update α parameters. If the choice is to perform the update, α parameters are updated
using (13) or (14) which is followed by recalculation of example weights (9). Otherwise,
before proceeding to training a new base classifier, the algorithm still has to calculate
weights for the new examples Din using (15). Finally, the algorithm proceeds with
training fm by minimizing (1), calculating αm (3) and updating example weights (4).
calculate αm (3)
update weights (4)
update
NO YES weights (16) YES
IBoost (Fig. 1) was designed to provide large flexibility with respect to budget, train-
ing and prediction speed, and stream properties. In this paper, we present an IBoost
variant for Concept Change. However, using the flowchart we can also easily design
variants for applications such as Active Learning or Outlier Removal.
Learning under concept change has received a great deal of attention during the last
decade, with a number of developed learning strategies (see overview [14]). IBoost
falls into the category of adaptive ensembles with instance weighting.
When dealing with data streams with concept change it is beneficial to use a sliding
window approach. At each step, one example is being added and one is being removed.
The common strategy is to remove the oldest one; however, other strategies exist. Se-
lection of window size n presents a tradeoff between achieving maximum accuracy on
the current concept and fast recovery from distribution changes.
As we previously discussed, IBoost is highly flexible as it can be customized to meet
memory and time constrains. For concept change applications we propose the IBoost
variant summarized in Algorithm 2. In this setup, after each window repositioning the
data within the window is used to update the α parameters and potentially train a new
base classifier fm+1 . The Stochastic version performs b updates of each a using the
newest example only, while Batch version performs b iterations of a updates using all
the examples in the window. The new classifier is added when the AddCriterion: (k
mod p = 0) ∧ (yk = Fm (xk )) is satisfied, where (xk , yk ) is the new data point,
Fm (xk ) is the current ensemble prediction and p is the parameter which controls how
often are base models potentially added. This is a common criterion used in ensemble
algorithms which perform base model addition and removal [4, 5].
The free parameters (M, n, p and b) are quite intuitive and should be relatively easy
to select for a specific application. Larger p values can speed-up the process with slight
decrease in performance. As the budget M increases, so does the accuracy at cost of
increased cost of prediction, model update and storage requirements. Finally, selection
of b is a tradeoff between accuracy, concept change recovery and time.
3 Experiments
In this section, IBoost performance in four different concept change applications will
be evaluated. Three synthetic and one real-world data set, with different drift types
(sudden, gradual and rigorous) were used. The data generation and all the experiments
were repeated 10 times. The average test set classification accuracy is reported.
SEA synthetic data [15]. The data consists of three attributes, each one in the range
from 0 to 10, and the target variable yi which is set to +1 if xi1 + xi2 ≤ b and −1
otherwise, where b ∈ {7, 8, 9, 9.5}. The data stream used has 50, 000 examples. For
the first 12, 500 examples, the target concept is with b = 8. For the second 12, 500
524 M. Grbovic and S. Vucetic
(0) initialize window Dnew = {(xi , yi ), i = 1, ..., n} and window data weights wi0 = 1/n
(a) k = n, Train f1 (1) using Dnew , calculate α1 (3), update weights winew (4), m = 1
(b) Slide the window: k = k + 1, Dnew = Dold + (xk , yk ) − (xk−n , yk−n )
(c) If (k mod p = 0) ∧ (yk = Fm (xk )),
(c.1) If (m = M )
(c.1.1) Remove fj with minimum αj , m = m − 1
(c.2) Update αj , j = 1, ..., m using (13) or (14) b times
(c.3) Recalculate winew using (9)
(c.4) Train fm+1 (1), calculate αm+1 (3), update weights winew (4)
(c.5) m = m + 1
(d) Else
(d.1) Update αj , j = 1, ..., m (13) or (14) b times, recalculate winew using (9)
(e) If any αj < 0, j = 1, ..., m
(e.1) Remove fj , m = m − 1
(e.2) Update αj , j = 1, ..., m (13) or (14) b times, recalculate winew using (9)
(f) Jump to (b)
examples, b = 9; the third, b = 7; and the fourth, b = 9.5. After each window slide, the
current ensemble is tested using the current concept 2, 500 test set examples.
Santa Fe time series data (collection A) [16] was used to test IBoost performance
on a real world gradual concept change problem. The goal is to predict the measure-
ment gi ∈ R based on 9 previous observations The original regression problem with
a target value gi was converted to classification such that yi = 1 if gi ≤ b, where
b ∈ {−0.5, 0, 1} and yi = −1 otherwise. The data stream contains 9, 990 examples.
For the first 3, 330 examples, b = −0.5; for the second 3, 330 examples, b = 0; and
b = 1 for the remaining ones. Testing is done using a holdout data with 825 examples
from the current concept. Gradual drifts were simulated by smooth transition of b over
1, 000 examples.
Random RBF synthetic data [3]. This generator can create data which contains
a rigorous concept change type. First, a fixed number of centroinds are generated in
feature space, each assigned a single class label, weight and standard deviation. The ex-
amples are then generated by selecting a center at random, taking weights into account,
and displacing them in random direction from the centroid by random displacement
length, drown from a Gaussian distribution with centeroids standard deviation. Drift is
introduced by moving the centers with constant speed. In order to test IBoost on large
binary data sets, we generated 10 centers, which are assigned class labels {−1, +1} and
a drift parameter 0.001, and simulated one million RBF data examples. Evaluation was
done using interleaved test-then-train methodology: every example was used for testing
the model before it was used for training the model.
LED data. The goal is to predict the digit displayed on a seven segment LED display,
where each binary attribute has a 10% chance of being inverted. The original 10-class
problem was converted to binary by representing digits {1, 2, 4, 5, 7} (non-round digits)
Incremental Boosting by Minimization of the Evolving Exponential Loss 525
as +1 and digits {3, 6, 8, 9, 0} (round digits) as −1. Four attributes (out of 7) were se-
lected to have drifts. We simulated one million examples and evaluated the performance
using interleaved test-then-train. The data is available in UCI repository.
3.2 Algorithms
IBoost was compared to non-incremental AdaBoost, Online Coordinate Boost, Online-
Boost and its two modifications for concept change (NSOnlineBoost and FLC), Fast
and Light Boosting, DWM and AdWin Online Bagging.
OnlineBoost [6] starts with some initial base models fj , j = 1, ..., m which are as-
signed weights λsc sw
j = 0 and λj = 0. When a new example (xi , yi ) arrives it is assigned
an initial example weight of λd = 1. Than, OnlineBoost uses a Poisson distribution for
sampling and updates each fj model k = P oisson(λd ) times using (xi , yi ). Next, if
fj (xi ) = yi the example weight is updated as λd = λd /2(1 − εj ) and λsc sc
j = λj + λd ;
sw sw sw sw sc
otherwise λd = λd /2εj and λj = λj + λd , where εj = λj /(λj + λj ), before
proceeding to updating the next base model fj+1 . Confidence parameters α for each
base classifier are obtained using (3) and the final predictions are made using (5). Since
OnlineBoost updates all the base models using each new observation, their performance
on the previous examples changes and so should the weighted sums λsc sw
m and λm . Still,
the unchanged sums are used to calculate α, and thus the resulting α are not optimized.
NSOnlineBoost [7] In the original OnlineBoost algorithm, initial classifiers are in-
crementally learned using all examples in an online manner. Base classifier addition
or removal is not used. This is why poor recovery from concept change is expected.
NSOnlineBoost modification introduces a sliding window and base classifier addition
and removal. The training is conducted in the OnlineBoost manner until the update pe-
riod pns is reached. Then, the ensemble Fm classification error on the examples in the
window is calculated and compared to the ensemble Fm − fj , where m includes all
the base models trained using at least Kns = 100 points. If removing any fj improves
the ensemble performance on the window, it is removed and a new classifier is added
with initial values λsc sw
m = 0, λm = 0 and εm = 0.
Fast and Light Classifier (FLC) [8] is a straightforward extension of OnlineBoost
that uses an Adaptive Window (AdWin) change detection technique [17] to heuristically
increase example weights when the change is detected. The base classifiers and their
confidence parameters are initialized and updated in the same way as in OnlineBoost.
When a new example arrives (λd = 1), AdWin checks, for every possible split into
”large enough” sub-windows, if their average classification rates differ by more than
a certain threshold d, set to k window standard deviations. If the change is detected,
the new example updates all m base classifiers with weights that are calculated using
λd = (1 − εj )/εj , where εj = λsw sw sc
j /(λj + λj ), j = 1, ..., m. The window then
drops the older sub-window and continues to grow back to its maximum size with the
examples from the new concept. If the change is not detected, the example weights and
base classifiers are updated in the same manner as in OnlineBoost.
AdWin Bagging [3] is the OnlineBagging algorithm proposed in [6] which uses the
AdWin technique [17] as a change detector and to estimate the error rates for each base
model. It starts with initial base models fj , j = 1, ..., m. Then, when a new example
526 M. Grbovic and S. Vucetic
(xi , yi ) arrives, each model fj is updated k = P oisson(1) times using (xi , yi ). Final
prediction is given by simple majority vote. If the change is detected, the base classifier
with the highest error rate εj is removed and a new one is added.
Online Coordinate Boost (OCB) [9] requires initial base models fj , j = 1, ..., m,
trained offline using some initial data. The initial training also provides the starting
confidence parameter values αj , j = 1, ..., m, and sums of weights of correctly and in-
correctly classified examples for each base classifier, (λsc sw
j and λj , respectively). When
a new example (xi , yi ) arrives, the goal is to find the appropriate updates Δαj for αj
such that the AdaBoost loss (6) with the addition of the last example is minimized. Be-
cause these updates cannot be found in the closed form, the authors derived closed form
updates that minimize the approximate loss instead of the exact one. Such optimization
requires keeping and updating the sums of weights (λsc sw
(j,l) and λ(j,l) ) which involve two
weak hypotheses j and l and introduction of the order parameter o. To avoid numerical
errors, the algorithm requires initialization with the data of large enough length nocb
and selection of the proper order parameter.
FLB [5] The algorithm assumes that data are arriving in disjoint blocks of size
nflb . Given a new block Bj ensemble example weights are calculated depending on
the ensemble error rate εj , where the weight of a misclassified example xi is set to
wi = (1 − εj )/εj and the weight of a correctly classified sample is left unchanged.
A new base classifier is trained using the weighted block. The process repeats until a
new block arrives. If the number of classifiers reaches the budget M , the oldest one is
removed. The base classifier predictions are combined by averaging the probability pre-
dictions and selecting the class with the highest probability. There is also a change de-
tection algorithm running in the background, which discards the entire ensemble when
a change is detected. It is based on the assumption that the ensemble performance θ on
the batch follows Gaussian distribution. The change is detected when the distribution
of θ changes from one Gaussian to another, which is detected using a threshold τ .
DWM [1] is a heuristically-based ensemble method for handling concept change.
It starts with a single classifier f1 trained using the initial data and α1 = 1. Then,
each time a new example xk arrives it updates weights a for the existing classifiers: a
classifier that incorrectly labels the current example receives a reduction in its weight by
multiplicative constant β = 0.5. After each pdwm examples classifiers whose weights
fall under a threshold θr = 0.01 are removed and if, in addition, (yk = Fm (xk )) a
new classifier fm+1 with αm+1 = 1 is trained using the data in the window. Finally,
all classifiers are updated using (xk , yk ) and their weights a are normalized. The global
prediction for the current ensemble Fm (xk ) is always made by the weighted majority
(5). When the memory for storing base classifiers is full and a new classifier needs to
be stored, the classifier with the lowest α is removed.
In general, the described concept change algorithms can be divided into several
groups based on their characteristics (Table 1).
3.3 Results
We performed an in-depth evaluation of IBoost and competitor algorithms for differ-
ent values of window size n = {100, 200, 500, 1, 000, 2, 000}, base classifier budget
Incremental Boosting by Minimization of the Evolving Exponential Loss 527
M = {20, 50, 100, 200, 500} and update frequency p = {1, 10, 50}. We also evaluated
the performance of IBoost for different values of b = {1, 5, 10}. Both IBoost Batch (13)
and IBoost Stochastic (14) were considered.
In the first set of experiments IBoost was compared to the benchmark AdaBoost
algorithm, OCB and FLB on the SEA data set. Simple Decision Stumps (single-level
Decision Trees) were used as base classifiers. Their number was limited to M . Both
AdaBoost and IBoost start with a single example and when the number of examples
reaches n they begin using a window of size n. Each time the window slides and the
AddCriterion with p = 1 is satisfied, AdaBoost retrains M classifiers from scratch,
while IBoost trains a single new classifier (Algorithm 2).
OCB was initialized using the first nocb = n data points and then it updated confi-
dence parameters using the incoming data. Depending on the budget M , the OCB order
parameter o was set to the value that resulted in the best performance. Increasing o re-
sults in improved performance. However, performance deteriorates if it is increased too
much. In FLB, disjoint data batches of size nflb = n were used (equivalent to p = nflb ).
Five base models were trained using each batch while constraining the budget as ex-
plained in [5]. Class probability outputs for Decision Stumps were calculated based on
the distance from the split and the threshold for the change detector was selected such
that the false positive rate is 1%.
Figure 2 compares the algorithms in the M = 200, n = 200 setting. Performances
for different values of M and n are compared in Table 2 based on the test accuracy, con-
cept change recovery (average test accuracy on the first 600 examples after introduction
of new concept) and training times.
100
Test Accuracy (%)
95
90
85
IBoost−DS Batch, training time: 1,113 sec
IBoost−DS Stochastic, training time: 898 sec
AdaBoost−DS, training time: 913 sec
80 OCB−DS, training time: 590 sec
FLB−DS, training time: 207 sec
5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
Time Step
Both IBoost versions achieved much higher classification accuracy than AdaBoost
and it was faster to train. This can be explained by the fact that AdaBoost deletes the
influence of all previously seen examples outside the current window by discarding
the whole ensemble and retraining. Better performance of IBoost than OCB can be
explained by the difference in updating confidence parameters and the fact that OCB
never adds or removes base classifiers. Inferior FLC results show that removing the
entire ensemble when the change is detected is not the most effective solution.
IBoost Batch was more accurate than IBoost Stochastic. However, the training time
of IBoost Batch was significantly higher. Considering this, IBoost Stochastic represents
a reasonable tradeoff between performance and time. Fastest recovery for all three con-
cept changes was achieved by IBoost Stochastic. This is because the confidence param-
eters updates of IBoost Stochastic are performed using only the most recent example.
Some general conclusions are that the increase in budget M resulted in larger training
times and accuracy gain for all algorithms. Also, as the window size n grew, the per-
formance of both IBoost and retrained AdaBoost improved at cost of increased training
time and slower concept change recovery. With a larger window the recovery perfor-
mance gap between the two approaches increased, while the test accuracy gap reduced
as the retrained AdaBoost generalization error decreased.
As we discussed in section 3.2, one can select different IBoost b and p parameters
depending on the stream properties. Table 3 shows how the performance on SEA data
changed as they were adjusted. We can conclude that bigger values of b improved the
performance at cost of increasing the training time, while bigger values of p degraded
the performance (some just slightly, e.g. p = 10) coupled with big time savings.
In the second set of experiments, IBoost Stochastic (p = 1, b = 5) was compared
to the algorithms from Table 1 on both SEA and Santa Fe data sets. Naı̈ve Bayes was
Incremental Boosting by Minimization of the Evolving Exponential Loss 529
p=1 b=1
Algorithm M = 200, n = 200
b=1 b = 5 b = 10 b = 10 b = 50 b = 100
test accuracy (%) 96.7 97.1 97.4 96.5 94.7 93.1
IBoost Stochastic recovery (%) 93.1 93.5 93.7 92.8 92.7 92.1
time (s) 201 372 635 104 45 22
test accuracy (%) 97.6 97.9 98.2 97.1 95.6 93.7
IBoost Batch recovery (%) 92.3 92.5 92.9 92.6 91.6 91.4
time (s) 545 898 1.6K 221 133 96
100
95
Test Accuracy (%)
90
chosen to be the base classifier in these experiments because of its ability to be incre-
mentally improved, which is a prerequisite for some of the competitors (Table 1). All
algorithms used a budget of M = 50. The algorithms that use a moving window had
a window of size n = 200. Additional parameters for different algorithms were set as
follows: OCB was initialized offline with nocb = 2K data points and used o = 5, FLB
used batches of size pflb = 200, NSOnlineBoost used pns = 10 and DWM pdwm = 50.
Results for the SEA data set are presented in Fig. 3. As expected, OnlineBoost had
poor concept change recovery because it never removes or adds new models. Its two
non-stationary versions, NSOnlineBoost and FLC, outperformed it. FLC was particu-
larly good during the first three concepts where it had almost the same performance as
IBoost. However, its performance deteriorated after introduction of the fourth concept.
The opposite happened for DWM which was worse than IBoost in all but the fourth
concept. We can conclude that IBoost outperformed all algorithms, while being very
fast (it came in second, after DWM).
Results for the Santa Fe data set are presented in Fig. 4. Similar conclusions as pre-
viously can be drawn. In the first concept several algorithms showed almost identical
performance. However, when the concept changed IBoost was the most accurate. Ta-
ble 4 summarizes test accuracies on both data sets.
AdWin Online Bagging had an interesting behavior in both SEA and Santa Fe data
sets. It did not suffer as large accuracy drop due to concept drift as the other algorithms,
and the direction of improvement suggests that it would reach IBoost performance if
duration of each particular concept were longer.
530 M. Grbovic and S. Vucetic
100
90
Test Accuracy (%)
80
70
IBoost−NB Stochastic, training time: 52.4 sec
OCB−NB, training time: 40.3 sec
AdWin OnlineBagg−NB, training time: 394 sec
60 FLC−NB, training time: 387 sec
OnlineBoost−NB, training time: 361 sec
DWM−NB, training time: 12.8 sec
50 FLB−NB, training time: 38.1 sec
NSOnlineBoost−NB, training time: 2,315 sec
1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000
Time Step
Table 4. SEA and Santa Fe performance summary based on the test accuracy
Data Set IBoost Online NSO FLC AdWin OCB DWM FLB
Stochastic Boost Boost Bagg
SEA 98.0 95.6 96.9 97.4 94.5 95.2 96.9 94.9
Santa Fe 94.1 81.8 85.1 83.4 80.0 80.6 88.8 87.6
82
Test Accuracy (%)
81.5
81
80.5
80
90
85
Test Accuracy (%)
80
75
70
4 Conclusion
In this paper we addressed a very important problem of incremental learning. We pro-
posed an extension of AdaBoost to incremental learning. The idea was to reuse and
upgrade the existing ensemble when the training data are modified. The new algo-
rithm was evaluated on concept change applications. The results showed that IBoost
is more efficient, accurate, and resistant than the original AdaBoost, mainly because it
retains memory about the examples that are removed from the training sliding window.
It also performed better than previously proposed OnlineBoost and its non-stationary
versions, DWM, Online Coordinate Boosting, FLB and AdWin Online Bagging. Our
future work will include extending IBoost to perform multi-class classification, com-
bining it with the powerful AdWin change detection technique and experimenting with
Hoeffding Trees as base classifiers.
References
[1] Kolter, J.Z., Maloof, M.A.: Dynamic weighted majority: A new ensemble method for track-
ing concept drift. In: ICDM, pp. 123–130 (2003)
[2] Scholz, M.: Knowledge-Based Sampling for Subgroup Discovery. In: Local Pattern Detec-
tion, pp. 171–189. Springer, Heidelberg (2005)
[3] Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavald, R.: New ensemble methods for
evolving data streams. In: ACM SIGKDD (2009)
[4] Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble
classifiers. In: Proc. ACM SIGKDD, pp. 226–235 (2003)
[5] Chu, F., Zaniolo, C.: Fast and light boosting for adaptive mining of data streams. In: Proc.
PAKDD, pp. 282–292 (2004)
[6] Oza, N., Russell, S.: Experimental comparisons of online and batch versions of bagging and
boosting. In: ACM SIGKDD (2001)
[7] Pocock, A., Yiapanis, P., Singer, J., Lujan, M., Brown, G.: Online Non-Stationary Boosting.
In: Intl. Workshop on Multiple Classifier Systems (2010)
[8] Attar, V., Sinha, P., Wankhade, K.: A fast and light classifier for data streams. Evolving
Systems 1(4), 199–207 (2010)
[9] Pelossof, R., Jones, M., Vovsha, I., Rudin, C.: Online Coordinate Boosting. In: On-line
Learning for Computer Vision Workshop, ICCV (2009)
532 M. Grbovic and S. Vucetic
[10] Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of
boosting. The Annals of Statistics 28, 337–407 (2000)
[11] Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Machine Learn-
ing: Proceedings of the Thirteenth International Conference, pp. 148–156 (1996)
[12] Schapire, R.E., Singer, Y.: Improved Boosting Algorithms Using Confidence-rate Predic-
tions. Machine Learning Journal 37, 297–336 (1999)
[13] Schapire, R.E.: The convergence rate of adaboost. In: COLT (2010)
[14] Zliobaite, I.: Learning under Concept Drift: an Overview, Technical Report, Vilnius Uni-
versity, Faculty of Mathematics and Informatics (2009)
[15] Street, W., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classification.
In: ACM SIGKDD, pp. 377–382 (2001)
[16] Weigend, A.S., Mangeas, M., Srivastava, A.N.: Nonlinear gated experts for time series:
discovering regimes and avoiding overfitting. In: IJNS, vol. 6, pp. 373–399 (1995)
[17] Bifet, A., Gavald, R.: Learning from time changing data with adaptive windowing. In:
SIAM International Conference on Data Mining, pp. 443–448 (2007)
Fast and Memory-Efficient Discovery of the
Top-k Relevant Subgroups in a Reduced
Candidate Space
1 Introduction
In applications of local pattern discovery tasks, one is typically interested in
obtaining a small yet meaningful set of patterns. The reason is that resources
for post-processing of the patterns are typically limited, both if the patterns are
manually reviewed by human experts, or if they are used as input of a subsequent
data-mining step, following a multi-step approach like LeGo [11].
Reducing the number of raw patterns to a subset of manageable size can be
done using different approaches: one is to use a quality function to assess the
value of the patterns, and to discard all but the k highest-quality patterns. This
is known as the “top-k” approach. It can be further subdivided depending on
the pattern type and the quality function considered. In this paper, we consider
the case where the data has a binary label, the quality function accounts for
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 533–548, 2011.
c Springer-Verlag Berlin Heidelberg 2011
534 H. Grosskreutz and D. Paurat
the support in the different classes, and the patterns have the form of itemsets.
This setting is known as top-k supervised descriptive rule discovery, correlated
pattern mining or subgroup discovery [5]. In the following, we will stick with the
expression subgroup discovery and with the terminology used in this community.
Restricting to the top-k patterns (or subgroups, in our specific case) is not the
only approach to reduce the size of the output. A different line of research aims at
the identification and removal of patterns which are of little interest compared to
other patterns, (cf. [6,4]). This idea is formalized using constraints based on the
interrelation between patterns. A particularly appealing approach along this line
is the theory of relevance [14,7]. The idea of this approach, which applies only to
binary labeled data, is to remove all patterns that are dominated (or covered ) by
another pattern. Here, a pattern is considered as dominating another pattern if
the dominating pattern covers at least all positives (i.e. target-class individuals)
covered by the dominated pattern, but no additional negative.
The theory of relevance not only allows to get rid of multiple equivalent de-
scriptions (as does the theory of closed sets), but also of trivial specializations
which provide no additional insight over their generalizations. Due to this advan-
tage, relevance has been used as a filtering criterion in several subgroup discovery
applications [14,12,2]. In many settings, however, the number of relevant sub-
groups is still far larger than desired. In this case, a nearby solution is to combine
it with the top-k approach.
Up to now, however, no satisfying solution has been developed for the task of
top-k relevant subgroup discovery. Most algorithms apply non-admissible prun-
ing heuristics, with the result that high-quality subgroups can be overlooked
(e.g. [14,16]). The source of these problems is that relevance is a property not
defined locally, but with respect to the set of all other subgroups. The only
non-trivial algorithm which provably finds the exact solution to this task is
that of Garriga et al. [7]. This approach is based on the insight that all relevant
subgroups must be closed on the positives; moreover, the relevance of a closed-on-
the-positive can be determined based solely on the information about all closed-
on-the-positives. This gives rise to an algorithmic approach which exhaustively
traverses all closed-on-the-positives, stores them, and relies on this collection to
distinguish the relevant subgroups from the irrelevant closed-on-the-positives.
Obviously, this approach suffers from the drawback that a potentially very large
number of subgroups has to be stored in memory. For complex datasets, the high
memory requirements are a much more severe problem than the runtime. An-
other drawback of this approach is that it does not allow for admissible pruning
techniques based on a dynamically increasing threshold [22,17,8].
In this paper, we make the following contributions:
– We analyze existing top-k relevant subgroup discovery algorithms and show
how all except for the memory-demanding approach of [7] fail to guarantee
an exact solution;
– We present a simple solution to the relevance check which requires an amount
of memory only linear in k and the number of features. An additional ad-
vantage of this approach is that it can easily be combined with admissible
pruning techniques;
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 535
The remainder of the paper is structured as follows: after reviewing basic def-
initions in Section 2, we illustrate the task of relevant subgroup discovery in
Section 3. Successively, we present our new approach and analyze its properties
in Section 4, before we present empirical results in Section 5.
2 Preliminaries
In this section, we will define the task of subgroup discovery, review the theory
of relevance and discuss its connection to closure operators.
We will now illustrate this task using a simple example, before we show how
existing pruning approaches result in incorrect results.
(empty)
HighIncome University HighIncome & University Children=yes Children=no & HighIncome University
Children=yes & HighIncome & University Children=no & HighIncome & University
Fig. 1. Subgroup lattice for the example. Relevant subgroups are highlighted
The figure illustrates that high-quality subgroups need not be relevant: For ex-
ample, the two subgroups Children:yes & HighIncome & University and Chil-
dren:no & HighIncome & University have high quality but are irrelevant, as they
are merely a fragmentation of the relevant subgroup HighIncome & University.
The two above problems can be observed in our example scenario. Issue 1 arises
if we search for the top-2 subgroups in Figure 1, and the nodes are visited in
the following order: Children=yes, Children:yes & HighIncome & University,
Children:no & HighIncome & University and High Income & University. When
the computation ends, the result will only contain High Income & University, but
miss the second relevant subgroup, Children=yes. Issue 2 arises if Children:yes
& HighIncome & University and Children:no & HighIncome & University are
added to the queue before Children=yes is visited: the effect is that the minimum
quality threshold is increased to a level that will incorrectly prune Children=yes.
The above issues are the reason why most existing algorithms, like BSD [16],
do not guarantee an exact solution for the task of top-k relevant subgroup dis-
covery. This is problematic both because the outcome will typically be of less
value, and because it is not uniquely determined; in fact, the outcome can differ
among implementations and possibly even among execution traces. This effect
is amplified if a beam search is applied (e.g. [14]), where the subgroups con-
sidered are not guaranteed not to be dominated by some subgroup outside the
beam.
Approaches based on the closed-on-the-positives. The paper of Garriga et. al. [7]
is the first that proposes a non-trivial approach to correctly solve the relevant
subgroup discovery task. The authors investigate the relation between closure
operators (cf. [19]) and relevance, and show that the relevant subgroups are
a subset of the subgroups closed on the positives. While the focus of the pa-
per is on structural properties and not on computational aspects, the authors
also propose the simple two-step algorithm CPosSd described in Section 2.4.
The search space considered by this algorithm — the closed-on-the-positives
— is a subset of the closed subgroups, thus it operates on a smaller candidate
space than all earlier approaches. The downside is that it does not account
for optimistic estimate pruning, and, probably more seriously, that it has very
high memory requirements, as the whole set of closed-on-the-positives has to be
stored.
540 H. Grosskreutz and D. Paurat
The above proposition tell us that we can perform the relevance check based
only on the top-k relevant subgroups visited so far: The iterative deepening
traversal ensures that a pattern sd is only visited once all generalizations have
been visited; so if the quality of the newly visited pattern sd exceeds that of
the k-best subgroup visited so far, then the set of the best k relevant subgroups
visited includes all generalizations of sd with higher quality – that is, a superset
of the set G∗ mentioned in Proposition 2; hence, we can check the relevance
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 541
of sd. On the other hand, if the quality of sd is lower than that of the k-best
subgroup visited, then we don’t care about its relevance anyways.
To prove the correctness of Proposition 2, we first present two lemmas:
Lemma 3. If a closed-on-the-positive sdirr is irrelevant, i.e. if there is a gen-
eralization sd sdirr closed on the positives with the same negative support as
sdirr , then there is also at least one relevant generalization sdrel sdirr with
the same negative support.
Proof. Let N be the set of all closed-on-the-positives generalizations of sdirr with
the same negative support as sd. There must be at least one sdrel in N such that
none of the patterns in N is a generalization of sdrel . From Proposition 1, we
can conclude that sdrel must be relevant and dominates sdirr .
Lemma 4. If a pattern sdrel dominates another pattern sdirr , then sdrel has
higher quality than sdirr .
Proof. We have that |DB[sdrel ]| ≥ |DB[sdirr ]|, because sdrel is a subset of sdirr
and support is antimonotonic. Thus, to show that sdrel has higher quality, it
is sufficient to show |TP(DB, sdrel )| /|DB[sdrel ]| > |TP(DB, sdirr )| /|DB[sdirr ]|.
From Proposition 1, we can conclude that sdrel and sdirr have the same num-
ber of false positives; let F denote this number. Using F , we can restate the
above inequality as |TP(DB, sdrel )| /(|TP(DB, sdrel )| + F ) > |TP(DB, sdirr )| /
(|TP(DB, sdirr )| + F ). All that remains to show is thus that |TP(DB, sdrel )| >
|TP(DB, sdirr )|. By definition of relevance, |TP(DB, sdrel )| ≥ |TP(DB, sdirr )|,
and because sdrel and sdirr are different and closed on the positives, the in-
equality must be strict, which completes the proof.
Based upon these lemmas, it is straightforward to prove Proposition 2:
Proof. We first show that if sd is irrelevant, then there is a generalization in G∗
with the same negative support. From Lemma 3 we know that if sd is irrelevant,
then there is at least one relevant generalization of sd with same negative sup-
port dominating sd. Let sdgen be such a generalization. Lemma 4 implies that
q(DB, sdgen ) ≥ q(DB, sd) ≥ minQ, hence sdgen is a member of the set G∗ .
It remains to show that if sd is relevant, then there is no generalization in G∗
with same negative support. This follows directly from Proposition 1.
main:
1: var result= queue with maximum capacity k (initially empty)
2: var minQ= 0
3: for limit= 1 to n do
4: findSubgroupsWithDepthLimit(result, limit)
5: return result
4.3 Complexity
We will now turn to the complexity of our algorithm. Let n denote the number
of features in the dataset and m the number of records. The memory complexity
is O(n2 + kn), given that the maximum recursion depth is n, the maximum size
of the result queue is k, and every subgroup description has length O(n).
Let us now consider the runtime complexity. For every node visited we com-
pute the quality, test for relevance and consider at most n augmentations. The
quality computation can be done in O(nm), while the relevance check can be
done in O(kn). The computation of the successors in Line 5 involves the exe-
cution of n closure computations, which amounts to O(n2 m). Altogether, the
cost-per-node is thus O(n2 m + kn). Finally, the number of nodes considered is
obviously bounded by O(|Cp | n), where Cp is the set of closed-on-the-positives
and the factor n is caused by the iterative deepening approach.1
Table 2 compares the runtime and space complexity of our algorithm with
CPosSd. Moreover, we show the complexity of classical and closed subgroup
discovery algorithms. Although these algorithms solve a different, simpler task,
it is interesting to observe they do not have a lower complexity. The expression
1
In case the search space has the shape of a tree, the number of nodes visited by an
iterative deepening approach is well-known to be proportional to the size of the tree.
Here, however, a tree-shape is not guaranteed, which is why we use the more loose
bound involving the additional factor n.
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 543
S used in the table denotes set of all subgroup descriptions, while C denote the
set of closed subgroups.
Let us consider the figures in more detail, starting with the memory com-
plexity: Except for CPosSd, all approach can apply depth-first-search (possibly
iterated) and thus have moderate memory requirements. In contrast, CPosSd has
to collect all closed-on-the-positives, each of which has a description of length
n. Please note that no pruning is applied, meaning that n |Cp | is not a loose up-
per bound for number of nodes stored in memory, but the precise indication —
which is why we use the Θ-notation in the table. As the number of closed-on-the-
positives can be exponential in n, this approach can quickly become unfeasible.
Let us now turn to the runtime complexity. First, let’s compare the runtime
complexity of our approach with classic resp. closed subgroup discovery algo-
rithms. Probably the most important difference is that they operate on different
spaces. While otherwise the complexity of our approach is higher by a linear
factor (resp. quadratic, compared to classic subgroup discovery), the space we
consider, i.e. the closed-on-the-positives Cp , can be exponentially smaller than
the one considered by the other approaches (i.e. C, respectively its superset S).
This is illustrated by the following family of datasets:
Proposition 5. For all n ∈ N+ , there is a dataset DBn of size n + 1 over n
features such that the ratio of closed to closed-on-the-positives is O(2n ).
5 Experimental Results
In this section we empirically compare our new relevant subgroup discovery
algorithm with existing algorithms. In particular, we considered the following
two questions:
– How does our algorithm perform compared to CPosSd?
– How does our algorithm perform compared to classical and closed subgroup
discovery algorithms?
We will not investigate and quantify the advantage of the relevant subgroups
over standard or closed subgroups, as the value of the relevance criterion on
similar datasets has been demonstrated elsewhere (cf. [7]).
(a) Bin test quality, k=10 (b) Bin test quality, k=100
Fig. 3. Number of nodes considered during relevant subgroup discovery (brackets in-
dicate memory issues)
Table 3. Total num. nodes visited, and percentage compared to StdSd (k=10)
q=bt q=wracc
StdSd CloSd Id-Rsd StdSd CloSd Id-Rsd
total # nodes 346 921 363 1 742 316 590 068 15 873 969 459 434 120 967
percentage (vs. StdSd) 100% 0.5% 0.17% 100% 2.9% 0.76%
total Runtime (sec) 2 717 286 118 147 100 45
percentage 100% 10.5% 4.4% 100% 68% 30%
representative for approaches like the algorithms SD and BSD discussed in Sec-
tion 3.2, which apply some ad-hoc and possibly incorrect relevance filtering, but
otherwise operate on the space of all subgroup descriptions.2
Figure 4 shows the number of nodes considered if k is set to 10 and the
binomial test, respectively the WRACC quality function is used. Again, we use
a logarithmic scale. The results for k = 100 are similar and omitted for space
reasons. Please note that for our algorithm (“ID-Rsd”), all nodes are closed-on-
the-positives, while for the closed subgroup discovery approach (“CloSd”) they
are closed and for the classic approach (“StdSd”) they are arbitrary subgroup
descriptions.
The results differ strongly depending on the characteristics of the data. For
several datasets, our approach results in a decrease of the number of nodes con-
sidered. The difference to the classical subgroup miner DpSubgroup is particularly
apparent, where it often amounts to several orders of magnitude.
There are, however, several datasets where our algorithm traverses more nodes
than the classical approaches. Again, the effect is particularly considerable when
compared with the classical subgroup miner. Beside the overhead caused by
the multiple iterations, one reason for this effect is that the quality of the k-th
pattern found differs for the different algorithms: For the relevant subgroup
algorithm, the k-best quality tends to be lower, because this approach suppresses
2
Note that BSD becomes faster if its relevance check is disabled [16].
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 547
high-quality but irrelevant subgroups. One could argue that it would be more fair
to use a larger k-value for the non-relevant algorithms, as their output contains
more redundancy.
Overall, our algorithm is competitive (or, somewhat faster) than the other
approaches, as the aggregated figures in Table 3 show. Although the costs-per-
node are lower for classical subgroup discovery than for the other approaches,
overall this does not compensate for the much larger number of nodes traversed.
6 Conclusions
In this paper, we have presented a new algorithm for the task of top-k relevant
subgroup discovery. The algorithm is the first that finds the top-k relevant sub-
groups based on a traversal of the closed-on-the-positives, while avoiding the
high memory requirements of the approach of Garriga et al. [7]. Moreover, it
allows for the use of optimistic estimate pruning, which reduces the fraction of
closed-on-the-positives effectively considered.
The central idea of our algorithm is the memory-efficient relevance test, which
allows getting along with only the information about the k best patterns visited
so far. Please note that restricting the candidate space to (a subset of) the closed-
on-the-positives not only reduces the size of the search space: it also ensures the
correctness of our memory-efficient relevance check. The best k patterns can
only be used to determine the relevance of a new high-quality patterns if all
patterns visited are closed-on-the-positive subgroups – not if we were to consider
arbitrary subgroup descriptions. The restriction to closed-on-the-positives is thus
a prerequisite for the correctness of our approach.
The new algorithm performs quite well compared to existing approaches.
It not only avoids the high memory requirements of the approach of Garriga
et al. [7], but also clearly outperforms this approach. Moreover, it is competitive
with the existing non-relevant top-k subgroup discovery algorithms. This is par-
ticularly remarkable as it produces more valuable patterns than those simpler
approaches.
References
1. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)
2. Atzmueller, M., Lemmerich, F., Krause, B., Hotho, A.: Towards Understanding
Spammers - Discovering Local Patterns for Concept Characterization and Descrip-
tion. In: Proc. of the LeGo Workshop at ECML-PKDD (2009)
3. Atzmueller, M., Lemmerich, F.: Fast subgroup discovery for continuous target con-
cepts. In: Rauch, J., Raś, Z.W., Berka, P., Elomaa, T. (eds.) ISMIS 2009. LNCS,
vol. 5722, pp. 35–44. Springer, Heidelberg (2009)
548 H. Grosskreutz and D. Paurat
1 Introduction
In many applications in machine learning and data mining, one is often con-
fronted with very high dimensional data. High dimensionality increases the time
and space requirements for processing the data. Moreover, in the presence of
many irrelevant and/or redundant features, learning methods tend to over-fit
and become less interpretable. A common way to resolve this problem is di-
mensionality reduction, which has attracted much attention in machine learning
community in the past decades. Generally speaking, dimensionality reduction
can be achieved by either feature selection [8] or subspace learning [12] [11] [25]
(a.k.a feature transformation). The philosophy behind feature selection is that
not all the features are useful for learning. Hence it aims to select a subset of
most informative or discriminative features from the original feature set. And
the basic idea of subspace learning is that the combination of the original fea-
tures may be more helpful for learning. As a result, it aims at transforming the
original features to a new feature space with lower dimensionality.
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 549–564, 2011.
c Springer-Verlag Berlin Heidelberg 2011
550 Q. Gu, Z. Li, and J. Han
Fisher criterion [6] [22] [9] plays an important role in dimensionality reduction.
It aims at finding a feature representation by which the within-class distance is
minimized and the between-class distance is maximized. Based on Fisher crite-
rion, two representative methods have been proposed. One is Fisher Score [22],
which is a feature selection method. The other is Linear Discriminant Analy-
sis (LDA) [6] [22] [9], which is a subspace learning method. Although there are
many other feature selection methods [8] [10] [23], Fisher score is still among the
state of the art [29]. And LDA has received great success in face recognition [2],
which is known as Fisher Face. In the past decades, both Fisher score and LDA
have been studied extensively [10] [20] [26] [24] [4] [17] [5]. However, they study
Fisher score or LDA independently, ignoring the close relation between them.
In this paper, we propose to study Fisher score and LDA together. The key
motivation is that, although it is based on Fisher criterion, Fisher score is not
able to do feature combination such as LDA. The features selected by Fisher
score are a subset of the original features. However, as we mentioned before,
the transformed features may be more discriminative than the original features.
On the other hand, although LDA admits feature combination, it transforms
all the original features rather than only those useful ones as in Fisher score.
Furthermore, since LDA uses all the features, the resulting transformation is
often difficult to interpret. It can be seen that Fisher score and LDA are ac-
tually complementary to some extent. If we combine Fisher score and LDA in
a systematic way, they could mutually enhance each other. One intuitive way
is performing Fisher score before LDA as a two-stage approach. However, since
these two stages are conducted individually, the whole process is likely to be
suboptimal. This motivates us to integrate Fisher score and LDA in a principled
way to complement each other.
Based on the above motivation, we propose a unified framework, namely Lin-
ear Discriminant Dimensionality Reduction (LDDR), integrating Fisher score
and LDA. In detail, we aim at finding a subset of features, based on which
the learnt linear transformation via LDA maximizes the Fisher criterion. LDDR
performs feature selection and subspace learning simultaneously based on Fisher
criterion. It inherits the advantages of Fisher score and LDA to overcome their
individual disadvantages. Hence it is able to discard the irrelevant features and
transform the relevant ones simultaneously. Both Fisher score and LDA can
be seen as the special cases of LDDR. The resulting optimization problem is
a mixed integer programming [3], which is difficult to solve. We relax it into a
L2,1 -norm constrained least square problem and solved by accelerated proximal
gradient descent algorithm [18]. It is worth noting that L2,1 -norm has already
been successfully applied in Group Lasso [28], multi-task feature learning [1] [14],
joint covariate selection and joint subspace selection [21]. Experiments on bench-
mark face recognition data sets demonstrate the effectiveness of the proposed
approach.
The remainder of this paper is organized as follows. In Section 2, we briefly
review Fisher score and LDA. In Section 3, we present a framework for joint
Linear Discriminant Dimensionality Reduction 551
1.1 Notations
Given a data set that consists of n data points {(xi , yi )}ni=1 , where xi ∈ Rd , and
yi ∈ {1, 2, . . . , c} denotes the class label of the i-th data point. The data matrix
is denoted by X = [x1 , . . . , xn ] ∈ Rd×n , and the linear transformation matrix
is denoted by W ∈ Rd×m , projecting the input data into an m-dimensional
subspace. Given a matrix W ∈ Rd×m , we denote the i-th row of W by wi , and
the
j-th column of W by wj . The Frobenius norm of W is defined as ||W||F =
d d
i ||w ||2 , and the L2,1 -norm of W is defined as ||W||2,1 = i ||w ||2 . 1 is
i 2 i
Linear discriminant analysis (LDA) [6] [22] [9] is a supervised subspace learning
method which is based on Fisher Criterion. It aims to find a linear transforma-
tion W ∈ Rd×m that maps xi in the d-dimensional space to a m-dimensional
space, in which the between class scatter is maximized while the within-class
scatter is minimized, i.e.,
where Sb and Sw are the between-class scatter matrix and within-class scatter
matrix respectively, which are defined as
c
c
Sb = nk (µk − µ)(µk − µ)T , Sw = (xi − µk )(xi − µk )T , (2)
k=1 k=1 i∈Ck
where Ck is the index set of the k-th class, µk and nk are mean vector
c and size
of k-th class respectively in the input data space, i.e., X, µ = k=1 nk µk is
the overall mean vector of the original data. It is easy to show that Eq. (1) is
equivalent to
arg max tr((WT St W)−1 (WT Sb W)), (3)
W
552 Q. Gu, Z. Li, and J. Han
Note that St = Sw + Sb .
According to [6], when the total scatter matrix St is non-singular, the solution
of Eq. (3) consists of the top eigenvectors of the matrix S−1
t Sb corresponding to
nonzero eigenvalues. When the total class scatter matrix St does not have full
rank, the solution of Eq. (3) consists of the top eigenvectors of the matrix S†t Sb
corresponding to nonzero eigenvalues, where S†t denotes the pseudo-inverse of St
[7]. Note that when St is nonsingular, S†t equals S−1t .
LDA has been successfully applied to face recognition [2]. Following LDA,
many incremental works have been done, e.g., Uncorrelated LDA and Orthogonal
LDA [26], Local LDA [24], Semi-supervised LDA [4] and Sparse LDA [17] [5].
Note that all these methods suffer from the weakness of using all the original
features to learn the subspace.
where µ̃k and nk are the mean vector and size of the k-th class respectively in
c
the reduced data space, i.e., Z, µ̃ = k=1 nk µ̃k is the overall mean vector of
d
the reduced data. Note that there are m candidate Z’s out of X, hence Fisher
score is a combinatorial optimization problem.
We introduce an indicator variable p, where p = (p1 , . . . , pd )T and pi ∈
{0, 1}, i = 1, . . . , d, to represent whether a feature is selected or not. In order to
indicate that m features are selected, we constrain p by pT 1 = m. Then the
Fisher Score in Eq. (5) can be equivalently formulated as follows,
where diag(p) is a diagonal matrix whose diagonal elements are pi ’s, Sb and St
are the between-class scatter matrix and total scatter matrix, defined as in Eq.
(2) and Eq. (4).
As can be seen, like other feature selection approaches [8], Fisher score only
does binary feature selection. It does not admit feature combination like LDA
does.
Based on the above discussion, we can see that LDA suffers from the problem
which Fisher score does not have, while Fisher score has the limitation which
LDA does not have. Hence, if we integrate LDA and Fisher score in a systematic
way, they could complement each other and be benefited from each other. This
motivates the proposed method in this paper.
which is a mixed integer programming [3]. Eq. (8) is called as Linear Discrimi-
nant Dimensionality Reduction (LDDR) because it is able to do feature selection
and subspace learning simultaneously. It inherits the advantages of Fisher score
and LDA. That is, it is able to find a subset of useful original features, based
on which it generates new features by feature transformation. Given p = 1,
Eq. (8) reduces to LDA as in Eq. (3). Letting W = I, Eq. (8) degenerates to
Fisher score as in Eq.(7). Hence, both LDA and Fisher score can be seen as the
special cases of the proposed method. In addition, the objective functions cor-
responding to LDA and Fisher score are lower bounds of the objective function
of LDDR.
Recent studies [9] [27] established the relationship between LDA and multi-
variate linear regression problem, which provides a regression-based solution for
LDA. This motivates us to solve the problem in Eq.(8) in a similar manner. In the
following, we present a theorem, which establishes the equivalence relationship
between the problem in Eq.(8) and the problem in Eq.(9).
Theorem 1. The optimal p that maximizes the problem in Eq. (8) is the same
as the optimal p that minimizes the following problem
1
arg min ||XT diag(p)W − H||2F
p,W 2
s.t. p ∈ {0, 1}d, pT 1 = m, (9)
554 Q. Gu, Z. Li, and J. Han
n
n − nnk , if yi = k
hik = k
(10)
− nnk , otherwise.
In addition, the optimal W1 of Eq. (8) and the optimal W2 of Eq. (9) have the
following relation
W2 = [W1 , 0]QT , (11)
under a mild condition that
Proof. Due to space limit, we only give the sketch of the proof. On the one hand,
given the optimal W, the optimization problem in Eq. (8) with respect to p is
equivalent to the optimization problem in Eq. (9) with respect to p. On the
other hand, for any feasible p, the optimal W that maximizes the problem in
Eq. (8) and the optimal W that minimizes the problem in Eq. (9) satisfy the
relation in Eq. (11) according to Theorem 5.1 in [27]. The detailed proof will be
included in the longer version of this paper.
Note that the above theorem holds under the condition that X is centered with
zero mean. Since rank(St ) = rank(Sb ) + rank(Sw ) holds in many applications
involving high-dimensional and under-sampled data, the above theorem can be
applied widely in practice.
According to theorem 1, the difference between W1 and W2 is the orthogonal
matrix Q. Since the Euclidean distance is invariant to any orthogonal transfor-
mation, if a classifier based on the Euclidean distance (e.g., K-Nearest-Neighbor
and linear support vector machine [9]) is applied to the dimensionality-reduced
data obtained by W1 and W2 , they will achieve the same classification result.
In our experiments, we use K-Nearest-Neighbor classifier.
Suppose we find the optimal solution of Eq. (9), i.e., W∗ and p∗ , then p∗ is a
binary vector, and diag(p)W is a matrix where the elements of many rows are
all zeros. This motivate us to absorb the indicator variables p into W, and use
L2,0 -norm on W to achieve feature selection, leading to the following problem
1
arg min ||XT W − H||2F ,
W 2
s.t. ||W||2,0 ≤ m. (13)
Note that Eq. (14) is no longer equivalent to Eq. (8) due to the relaxation.
However, the relaxation makes the optimization problem computationally much
easier. In this sense, the relaxation can be seen as a tradeoff between the strict
equivalence and computational tractability.
Eq. (14) is equivalent to the following regularized problem,
1
arg min ||XT W − H||2F + μ||W||2,1 , (15)
W 2
In our problem, ∇f (Wt ) = XXT Wt − XH. The philosophy under this formu-
lation is that if the optimization problem in Eq. (17) can be solved by exploiting
the structure of the L2,1 norm, then the convergence rate of the resulting al-
gorithm is the same as that of gradient descent method, i.e., O( 1 ), since no
approximation on the non-smooth term is employed. It is worth noting that
the proximal gradient descent can also be understood from the perspective of
auxiliary function optimization [15].
By ignoring the terms in Gηt (W, Wt ) that is independent of W, the opti-
mization problem in Eq. (17) boils down to
1 1 μ
Wt+1 = arg min ||W − (Wt − ∇f (Wt ))||2F + ||W||2,1 . (19)
W 2 ηt ηt
For the sake of simplicity, we denote Ut = Wt − η1t ∇f (Wt ), then Eq. (19) takes
the following form
1 μ
Wt+1 = arg min ||W − Ut ||2F + ||W||2,1 , (20)
W 2 ηt
Thus, the proximal gradient descent in Eq. (17) has the same convergence rate
of O( 1 ) as gradient descent for smooth problem.
αt − 1
Vt+1 = Wt + (Wt+1 − Wt ), (23)
αt+1
√
1+ 1+4α2t
where the sequence {αt }t≥1 is conventionally set to be αt+1 = 2 . For
more detail, please refer to [13]. Here we directly present the final algorithm for
optimizing Eq. (15) in Algorithm 1.
The convergence of this algorithm is stated in the following theorem.
Linear Discriminant Dimensionality Reduction 557
αt −1
Compute Vt+1 = Wt + αt+1
(Wt+1 − Wt )
until convergence
Theorem 2. [19] Let {Wt } be the sequence generated by Algorithm 1, then for
any t ≥ 1 we have
2γL||W1 − W∗ ||2F
F (Wt ) − F (W∗ ) ≤ , (24)
(t + 1)2
where L is the Lipschitz constant of the gradient of f (W) in the objective func-
tion, W∗ = arg minW F (W).
Theorem 2 shows that the convergence rate of the accelerated proximal gradient
descent method is O( √1 ).
4 Related Work
In this section, we discuss some approaches which are closely related to our
method.
In order to pursue sparsity and interpretability in LDA, [17] proposed both
exact and greedy algorithms for binary class sparse LDA as well as its spectral
bound. For multi-class problem, [5] proposed a sparse LDA (SLDA) based on
1 -norm regularized Spectral Regression,
arg min ||XT w − y||22 + μ||w||1 , (25)
w
where di=1 ||ai ||∞ is the 1 /∞ norm of W. The optimization problem is con-
vex and solved by quasi-Newton method [3]. Although LDFS involves structured
sparse transformation matrix W as in our method, it use it to select features
rather than doing feature selection and transformation together. Hence it is
fundamentally a feature selection method. In comparison, our method uses the
structured sparse transformation matrix for both feature selection and combi-
nation.
5 Experiments
In this section, we evaluate the proposed method, i.e., LDDR, and compare
it with the state of the art subspace learning methods, e.g. PCA, LDA and
Locality Preserving Projection (LPP) [12], sparse LDA (SLDA) [5]. We also
compare it with the feature selection methods, e.g., Fisher score (FS) and Linear
Discriminant Feature Selection (LDFS) [16]. Moreover, we study Fisher score
followed with LDA (FS+LDA), which is the most intuitive way to conduct Fisher
score and LDA together. We use K-Nearest Neighbor classifier where K = 1 as
the baseline method. All the experiments were performed in Matlab on a Intel
Core2 Duo 2.8GHz Windows 7 machine with 4GB memory.
The experimental results are shown in Table 1 and Table 2. We can observe
that (1) On some cases, Fisher score is better than LDA, while on more cases,
LDA outperforms Fisher score. This implies feature transformation may be more
essential than feature selection; (2) LDFS is worse than Fisher score on the ORL
data set, while it is better than Fisher score on the Yale-B data set. This indicates
the performance gain by doing feature selection under slightly different criterion
is limited; (3) SLDA is better than LDA, which implies sparsity is able to improve
the classification performance of LDA; (4) FS+LDA improves both FS and LDA
at most cases. It is even better than SLDA at some cases. This implies the
potential performance gain of combining Fisher score and LDA. However, at
some cases, FS+LDA is not as good as FS or LDA. This is because Fisher
score and LDA are conducted individually in FS+LDA. The selected features
by Fisher score are not necessarily useful for LDA; (5) LDDR outperforms FS,
LDA, SLDA and FS+LDA consistently and overwhelmingly, which indicates that
by performing Fisher score and LDA simultaneously to maximize the Fisher
criterion, Fisher score and LDA can enhance each other greatly. The selected
features by LDDR should be more useful than those selected by Fisher score.
We will illustrate this point latter.
We are also interested in the features selected by LDDR. We plot the top 50 se-
lected features (pixels) of our method and Fisher score on the ORL and Yale-B
data sets in Fig. 3 and Fig. 4 respectively. It is shown that the distribution of se-
lected features (pixels) by Fisher score is highly skewed. Most features distribute
in only one or two regions. Many features even reside on the non-face region.
It implies that the features selected by Fisher score are not discriminative. In
contrast, the features selected by LDDR distribute widely across the face region.
From another perspective, we can see that the features (pixels) selected by
LDDR are asymmetric. In other word, if one pixel is selected, its axis symmetric
Linear Discriminant Dimensionality Reduction 561
Fig. 1. The linear transformation matrix learned by (a) LDA, (b) SLDA (μ = 50) and
(c) LDDR (μ = 0.5) with 3 training samples per person on the ORL database. For
better viewing, please see it in color pdf file.
Fig. 2. The linear transformation matrix learned by (a) LDA, (b) SLDA (μ = 50) and
(c) LDDR (μ = 0.5) with 20 training samples per person on the Yale-B database. For
better viewing, please see it in color pdf file.
Fig. 3. Selected features (marked by blue cross) by (a) Fisher score and (b) LDDR
(μ = 0.5) with 3 training samples per person on the ORL database. For better viewing,
please see it in color pdf file.
one will not be selected. This is because the face image is roughly axis symme-
try, so one in a pair of axis symmetric pixels is redundant given the other one
is selected. Moreover, the selected pixels are mostly around the eyebrow, the
boundary of eyes, nose and cheek, which are discriminative for distinguishing
face images of different people. This is accord with our life common sense.
562 Q. Gu, Z. Li, and J. Han
Fig. 4. Selected features (marked by blue cross) by (a) Fisher score and (b) LDDR
(μ = 0.5)with 20 training samples per person on the Yale-B database. For better
viewing, please see it in color pdf file.
80 90 95
78 89 94
76 88 93
74 87 92
72 86 91
accuracy
accuracy
accuracy
70 85 90
68 84 89
66 83 88
64 82 87
90 100 100
98
96 95
94
85
92 90
accuracy
accuracy
accuracy
90
88 85
80
86
84 80
6 Conclusion
In this paper, we propose to integrate Fisher score and LDA in a unified frame-
work, namely Linear Discriminant Dimensionality Reduction. We aim at finding
a subset of features, based on which the learnt linear transformation via LDA
maximizes the Fisher criterion. LDDR inherits the advantages of Fisher score
and LDA and is able to do feature selection and subspace learning simultane-
ously. Both Fisher score and LDA can be seen as the special cases of the pro-
posed method. The resultant optimization problem is relaxed into a L2,1 -norm
constrained least square problem and solved by accelerated proximal gradient de-
scent algorithm. Experiments on benchmark face recognition data sets illustrate
the efficacy of the proposed framework.
References
1. Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Ma-
chine Learning 73(3), 243–272 (2008)
2. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces:
Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach.
Intell. 19(7), 711–720 (1997)
3. Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press,
Cambridge (2004)
4. Cai, D., He, X., Han, J.: Semi-supervised discriminant analysis. In: ICCV, pp. 1–7
(2007)
5. Cai, D., He, X., Han, J.: Spectral regression: A unified approach for sparse subspace
learning. In: ICDM, pp. 73–82 (2007)
6. Fukunaga, K.: Introduction to statistical pattern recognition, 2nd edn. Academic
Press Professional, Inc., San Diego (1990)
7. Golub, G.H., Loan, C.F.V.: Matrix computations, 3rd edn. Johns Hopkins Univer-
sity Press, Baltimore (1996)
8. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal
of Machine Learning Research 3, 1157–1182 (2003)
564 Q. Gu, Z. Li, and J. Han
9. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning.
Springer, Heidelberg (2001)
10. He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: NIPS (2005)
11. He, X., Cai, D., Yan, S., Zhang, H.: Neighborhood preserving embedding. In: ICCV,
pp. 1208–1213 (2005)
12. He, X., Niyogi, P.: Locality preserving projections. In: NIPS (2003)
13. Ji, S., Ye, J.: An accelerated gradient method for trace norm minimization. In:
ICML, p. 58 (2009)
14. Liu, J., Ji, S., Ye, J.: Multi-task feature learning via efficient l2,1 -norm minimiza-
tion. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial
Intelligence, UAI 2009 (2009)
15. Luo, D., Ding, C.H.Q., Huang, H.: Towards structural sparsity: An explicit l2/l0
approach. In: ICDM, pp. 344–353 (2010)
16. Masaeli, M., Fung, G., Dy, J.G.: From transformation-based dimensionality reduc-
tion to feature selection. In: ICML, pp. 751–758 (2010)
17. Moghaddam, B., Weiss, Y., Avidan, S.: Generalized spectral bounds for sparse lda.
In: ICML, pp. 641–648 (2006)
18. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Pro-
gram. 103(1), 127–152 (2005)
19. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course.
Kluwer Academic Publishers, Dordrecht (2003)
20. Nie, F., Xiang, S., Jia, Y., Zhang, C., Yan, S.: Trace ratio criterion for feature
selection. In: AAAI, pp. 671–676 (2008)
21. Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace
selection for multiple classification problems. Statistics and Computing 20, 231–252
(2010)
22. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience
Publication, Hoboken (2001)
23. Song, L., Smola, A.J., Gretton, A., Borgwardt, K.M., Bedo, J.: Supervised feature
selection via dependence estimation. In: ICML, pp. 823–830 (2007)
24. Sugiyama, M.: Local fisher discriminant analysis for supervised dimensionality re-
duction. In: ICML, pp. 905–912 (2006)
25. Yan, S., Xu, D., Zhang, B., Zhang, H., Yang, Q., Lin, S.: Graph embedding and
extensions: A general framework for dimensionality reduction. IEEE Trans. Pattern
Anal. Mach. Intell. 29(1), 40–51 (2007)
26. Ye, J.: Characterization of a family of algorithms for generalized discriminant anal-
ysis on undersampled problems. Journal of Machine Learning Research 6, 483–502
(2005)
27. Ye, J.: Least squares linear discriminant analysis. In: ICML, pp. 1087–1093 (2007)
28. Yuan, M., Yuan, M., Lin, Y., Lin, Y.: Model selection and estimation in regression
with grouped variables. Journal of the Royal Statistical Society, Series B 68, 49–67
(2006)
29. Zhao, Z., Wang, L., Liu, H.: Efficient spectral feature selection with minimum
redundancy. In: AAAI (2010)
DB-CSC: A Density-Based Approach for
Subspace Clustering in Graphs with Feature
Vectors
1 Introduction
In the past few years, data sources representing attribute information in com-
bination with network information have become more numerous. Such data
describes single objects via attribute vectors and also relationships between dif-
ferent objects via edges. Examples include social networks, where friendship
relationships are available along with the users’ individual interests (cf. Fig 1);
systems biology, where interacting genes and their specific expression levels are
recorded; and sensor networks, where connections between the sensors as well
as individual measurements are given. There is a need to extract knowledge
from such complex data sources, e.g. for finding groups of homogeneous objects,
i.e. clusters. Throughout the past decades, a multitude of clustering techniques
were introduced that either solely consider attribute information or network in-
formation. However, simply applying one of these techniques misses the potential
given by such combined data sources. To detect more informative patterns it is
preferable to simultaneously consider relationships together with attribute in-
formation. In this work, we focus on the mining task of clustering to extract
meaningful groups from such complex data.
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 565–580, 2011.
c Springer-Verlag Berlin Heidelberg 2011
566 S. Günnemann, B. Boden, and T. Seidl
Age: 27
Age: 25 Age: 28
TV [h/week] TV: 5.0
TV: 5.5 TV: 4.0
Web: 0
Web: 7 Web: 17
Age: 26 Age: 29
TV: 4.5 TV: 3.0
Age: 27
Web: 15 Web: 2
TV: 3.5
Age [years] Web: 33
Fig. 1. Clusters in a social network with attributes age, tv consume and web consume
objects are lost. In Fig. 1(b) an extract of the corresponding network structure
is shown. Since the restrictive quasi-clique property would only assign a low
density to this cluster, the cluster would probably be split by [9].
In this work, we combine dense subgraph mining with subspace clustering
based on a more sophisticated cluster definition; thus solving the drawbacks of
previous approaches. Established for other data types, density-based notions of
clusters have shown their strength in many cases. Thus, we introduce a density-
based clustering principle for the considered combined data sources. Our clusters
correspond to dense regions in the attribute space as well as in the graph. Based
on local neighborhoods taking the attribute similarity in subspaces as well as
the graph information into account, we model the density of single objects. By
merging all objects located in the same dense region, the overall clusters are
obtained. Thus, our model is able to detect the clusters in Fig. 1 correctly.
Besides the sound definition of clusters based on density values, we achieve a
further advantage. In contrast to previous approaches, the clusters in our model
are not limited in their size or shape but can show arbitrary shapes. For example,
the diameter of the clusters detected in [9] is a priori restricted by a parameter-
dependent value [17] leading to a bias towards clusters of small size and little
extent. Such a bias is avoided in our model, where the sizes and shapes of clusters
are automatically detected. Overall, our contributions are:
2 Related Work
γ-quasi-cliques [17], and k-cores [5,12]. However, none of the introduced dense
subgraph mining techniques considers attribute data annotated to the vertices.
There are some methods considering graph data and attribute data. In [6,15]
attribute data is only used in a post-processing step. [10] transforms the network
into a distance function and afterwards applies traditional clustering. In [18],
contrarily, the attribute information is transformed into a graph. [7] extends the
k-center problem by requiring that each group has to be a connected subgraph.
These approaches [10,18,7] perform full-space clustering on the attributes. [19,20]
enriches the graph by further nodes based on the vertices’ attribute values and
connects them to vertices showing this value. The clustered objects are only
pairwise similar and no specific relevant dimensions can be defined.
Recently, two approaches [16,9] were introduced that deal with subspace clus-
tering and dense subgraph mining. However, both approaches use too simple
cluster definitions. Similar to grid-based subspace clustering [2], a cluster (w.r.t.
the attributes) is simply defined by taking all objects located within a given
grid cell, i.e. whose attribute values differ by at most a given threshold. The
methods are biased towards small clusters with little extend. This drawback is
even worsened by considering the used notions of dense subgraphs: e.g. by using
quasi-cliques as in [9] the diameter is a priori constrained to a fixed threshold
[17]. Very similar objects just slightly located next to a cluster are lost due to
such restrictive models. Overall, finding meaningful patterns based on the clus-
ter definitions of [16,9] is questionable. Furthermore, the method in [16] does not
eliminate redundancy which usually occurs in the analysis of subspace projec-
tions due to the exponential search space containing highly similar clusters.
Our novel model is the first approach using a density-based notion of clusters
for the combination of subspace clustering and dense subgraph mining. This
allows arbitrary shaped and arbitrary sized clusters hence leading to an unbiased
definition of clusters. We remove redundant clusters induced by similar subspace
projections resulting in meaningful result sizes.
In Fig. 2 the red triangles would not be dense anymore because their simple
combined neighborhoods are empty. However, using just the adjacent vertices
leads to a too restrictive cluster model, as the next example in Fig. 4 shows.
Assuming that each vertex has to contain 3 objects in its neighborhood (in-
cluding the object itself) to be dense, we get two densely connected sets, i.e. two
clusters, in Fig. 4(a). In Fig. 4(b), we have the same vertex set, the same set
of attribute vectors and the same graph density. The example only differs from
the first one by the interchange of the attribute values of the vertices v3 and
v4 , which both belong to the same cluster in the first example. Intuitively, this
set of vertices should also be a valid cluster in our definition. However, it is not
because the neighborhood of v2 contains just the vertices {v1 , v2 }. The vertex
v4 is not considered since it is just similar w.r.t. the attributes but not adjacent.
The missing tolerance w.r.t. interchanges of the attribute values is one problem
induced by using just adjacent vertices. Furthermore, this approach would not
570 S. Günnemann, B. Boden, and T. Seidl
H H
Dim 2
Dim 2
Dim 1 Dim 1
Fig. 2. Dense region in the attribute Fig. 3. Novel challenge by using local den-
space but sparse region in the graph sity computation and k-neighborhoods
H H
Dim 2
v3 v2 v4
Dim 2
v2
v1 v4 v1 v3
Dim 1 Dim 1
(a) Successful with k ≥1, minP ts = 3 (b) Successful with k ≥2, minP ts = 3
be tolerant w.r.t. small errors in the edge set. For example in social networks,
some friendship links are not present in the current snapshot although the people
are aware of each other. Such errors should not prevent a good cluster detection.
Thus, in our approach we consider all vertices that are reachable over at most k
edges to obtain a more error-tolerant model. Formally, the neighborhood w.r.t.
the graph data is given by:
Definition 1 (Graph k-neighborhood). As vertex u is k-reachable from a
vertex v (over a set of vertices V ) if
∃v1 , . . . , vk ∈ V : v1 = v ∧ vk = u ∧ ∀i ∈ {1, . . . , k − 1} : (vi , vi+1 ) ∈ E
The graph k-neighborhood of a vertex v ∈ V is given by
NkV (v) = {u ∈ V | u is x-reachable from v (over V ) ∧ x ≤ k} ∪ {v}
Please note that the object v itself is contained in its neighborhood NkV (v) as
well. Overall, the combined neighborhood of a vertex v ∈ V considering graph
and attribute data can be formalized by intersecting v’s graph k-neighborhood
with its -neighborhood.
Definition 2 (Combined local neighborhood). The combined neighborhood
of v ∈ V is:
N V (v) = NkV (v) ∩ NV (v)
Using the combined neighborhood N V (v) and k ≥ 2, we get in Fig. 4(a) and
Fig. 4(b) the same two clusters. In both examples v2 ’s neighborhood contains 3
vertices, e.g. N V (v2 ) = {v1 , v2 , v4 } in Fig. 4(b). So to speak, we “jump over” the
vertex v3 to find further vertices that are similar to v2 in the attribute space.
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 571
Please note that the neighborhood calculation N O (v) is always done w.r.t. to the
set O and not w.r.t. the whole graph V . Based on this definition and by using
minP ts = 3, k ≥ 2, we can e.g. detect the three clusters from Fig. 3. By using
a too small value for k (as for example k = 1 in Fig. 4(b)), clusters are often
split or even not detected at all. On the other hand, if we choose k too high,
we run the risk that the graph structure is not adequately considered any more.
However, for the case depicted in Fig. 3 our model detects the correct clusters
even for arbitrary large k values as the cluster on the right hand side is clearly
separated from the other clusters by its attribute values. The two clusters on
the left hand side are never merged by our model as we do not allow jumps over
vertices outside the cluster.
In this section we introduced our combined cluster model for the case of a
single subspace. As shown in the examples, the model can detect clusters that
are dense in the graph as well as in the attribute space and that can often not
be detected by previous approaches.
In this section we extend our cluster model to a subspace clustering model. Be-
sides the adapted cluster definition we have to take care of redundancy problems
due to the exponential many subspace projections. As mentioned in the last sec-
tion, we are using the maximum norm in the attribute space. If we just want
to analyze subspace projections, we can simply define the maximum norm re-
stricted to a subspace S ⊆ Dim as
distS (x, y) = maxi∈S |x[i] − y[i]|
In principle any Lp norm can be restricted in this way and can be used within
our model. Based on this distance function, we can define a subspace cluster
which fulfills the cluster properties just in a subset of the dimensions:
As we can show, our subspace clusters have the anti-monotonicity property: For
a subspace cluster C = (O, S), for every S ⊆ S there exists a vertex set O ⊇ O
such that (O , S ) is a valid cluster. This property is used in our algorithm to
find the valid clusters more efficiently.
Proof. For every two subspaces S, S with S ⊆ S and every pair of vertices u, v
it holds that distS (l(u), l(v)) ≤ distS (l(u), l(v)). Thus for every vertex v ∈ O
we get NSO (v) ⊇ NSO (v). Accordingly, the properties (1) and (2) from Def. 3 are
fulfilled by (O, S ). If (O, S ) is maximal w.r.t. these properties, then (O, S ) is
a valid combined subspace cluster. Else, by definition there exists a vertex set
O ⊃ O such that (O , S ) is a valid subspace cluster.
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 573
H H H
Dim 2
Dim 2
Dim 2
O2
O1
Fig. 5. Finding clusters by (minP ts−1)-cores in the enriched graphs (k=2, minP ts=3)
Two vertices are adjacent in the enriched subgraph iff their attribute values are
similar in S and the vertices are connected by at most k edges in the origi-
nal graph (using just vertices from O). In Fig. 5(b) the enriched subgraph for
the whole set of vertices V is computed while Fig. 5(c) just considers the sub-
set O1 and O2 respectively. In this graph we concentrate on the detection of
(minP ts−1)-cores, which are defined as maximal connected subgraphs Oi ⊆ V
in which all vertices have at least a degree of (minP ts−1). We show:
Theorem 1 (Equivalence of representations). Let O ⊆ V a set of vertices
and S ⊆ Dim a subspace. C = (O, S) fulfills property (1) and (2) of Defini-
tion 3 if and only if the enriched subgraph GO
S = (O, E ) contains a single
(minP ts−1)-core that covers all vertices O.
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 575
The theorem implies that our algorithm only has to analyze vertices that po-
tentially lead to (minP ts−1)-cores. The important observation is: If GO S is a
(minP ts−1)-core, then for each graph GO S with O ⊇ O the set O will also be
contained in a (minP ts−1)-core. (This holds since NSO (u) ⊇ NSO (u) and hence
GO O
S contains all edges of GS .) Thus, each potential cluster (O, S) especially has
to be contained within a (minP ts−1)-core of the graph GVS . Overall, we first
extract from GVS all (minP ts−1)-cores, since only these sets could lead to valid
clusters. In Fig. 5(b) these sets are highlighted.
However, keep in mind that not all (minP ts−1)-cores correspond to valid
clusters. Theorem 1 requires that a single (minP ts−1)-core covering all vertices
is induced by the enriched subgraph. Figure 5(b) for example contains two cores.
As already discussed, the left set O1 is not a valid cluster but has to be split up.
Thus, if the graph GVS contains a single (minP ts−1)-core O1 with O1 = V
we get a valid cluster and the cluster detection is finished in this subspace.
In the other cases, however, we recursively have to repeat the procedure for
each (minP ts−1)-core {O1 , . . . , Om } detected in GVS , i.e. we determine the
smaller graphs GO i
S and their contained (minP ts−1)-cores. Since in each step
the (maximal) (minP ts−1)-cores are analyzed and refined, we ensure besides
property (1) and (2) – due to Theorem 1 – also property (3) of Definition 3.
Formally, the set of resulting clusters Clus = {C1 , . . . , Cm } corresponds to a
fixpoint of the function
f (Clus) = {(O , S) | O is a (minP ts−1)-core in GO
S with Ci = (Oi , S) ∈ Clus}
i
This fixpoint can be reached by f (f (. . . f ({(V, S)}))) = Clus. Overall, this pro-
cedures enables us to detect all combined clusters in the subspace S.
method: main()
Result = ∅ // current result set
1
2 queue = ∅ // priority queue with clusters and subtrees, descendingly sorted by quality
3 for d ∈ Dim do DF S traversal({d}, V, ∅)
4 while queue = ∅ do
5 remove first (highest-quality) object Obj from queue
6 if Obj is cluster then // check redundancy
7 for C ∈ Result do if( Obj ≺red C) goto line 4 // discard redundant cluster
8 Result = Result ∪ {Obj} // cluster is non-redundant
9 else // Obj is subtree ST = (S, O, Qmax , P arents, P arentsred )
10 if P arentsred ∩ Result = ∅ then goto line 4 // discard whole subtree
11 else DF S traversal(S, O, P arents) // subtree is non-redundant, restart traversal
12 return Result
method: DF S traversal(subspace S, candidate vertices O, parent clusters P arents)
13 f oundClusters = ∅, prelimClusters = {O}
14 while prelimClusters = ∅ do
15 remove first candidate Ox from prelimClusters
16 generate enriched subgraph GOS
x
5 Experimental Evaluation
Setup. We compare DB-CSC to GAMer [9] and CoPaM [16], two approaches
that combine subspace clustering and dense subgraph mining. In our experiments
we use real world data sets as well as synthetic data. By default the synthetic
datasets have 20 attribute dimensions and contain 80 combined clusters each
with 15 nodes and 5 relevant dimensions. Additionally we add random nodes
and edges to represent noise in the data. The clustering quality is measured
by the F1 measure [9], which compares the detected clusters to the “hidden”
clusters. The efficiency is measured by the algorithms’ runtime.
Varying characteristics of the data. In the first experiment (Fig. 7(a)) we
vary the database size of our synthetic datasets by varying the number of gener-
ated combined clusters. The runtime of all algorithms increases with increasing
database size (please note the logarithmic scale on both axes). For the datasets
578 S. Günnemann, B. Boden, and T. Seidl
F1value
F1value
F1value
0.8 0.8
CoPaM 0.6
0.7 0.7 0.5
DBͲCSC 0.4 DBͲCSC
0.6 GAMer 0.6 GAMer
0.3
CoPaM CoPaM
0.5 0.5 0.2
100 1000 10000 0% 20% 40% 60% 80% 100% 10 20 30 40 50
databasesize(#vertices) clusterdimensionality(%offullspace) #verticespercluster
10000 1000 10000
1000 1000
runtime[sec]
runtime[sec]
runtime[sec]
100
100 100
DBͲCSC
10 DBͲCSC
10 10 DBͲCSC
GAMer GAMer GAMer
CoPaM CoPaM CoPaM
1 1 1
100 1000 10000 0% 20% 40% 60% 80% 100% 10 20 30 40 50
databasesize(#vertices) clusterdimensionality(%offullspace) #verticespercluster
(a) Varying database size (b) Varying dimensionality (c) Varying cluster size
Fig. 7. Quality (top row) and Runtime (bottom row) w.r.t. varying data characteristics
with more than 7000 vertices, CoPaM is not applicable any more due to heap
overflows (4GB). While the runtimes of the different algorithms are very similar,
in terms of clustering quality DB-CSC obtains significantly better results than
the other approaches. The competing approaches tend to output only subsets of
the hidden clusters due to their restrictive cluster models. In the next experi-
ment (Fig. 7(b)) we vary the dimensionality of the hidden clusters. The runtime
of all algorithms increases for higher dimensional clusters. The clustering quali-
ties of DB-CSC and CoPaM slightly decrease. This can be explained by the fact
that for high dimensional clusters it is likely that additional clusters occur in
subsets of the dimensions. However, DB-CSC still has the best clustering quality
and runtime in this experiment. In Fig. 7(c) the cluster size (i.e. the number of
vertices per cluster) is varied. The runtimes of DB-CSC and GAMer are very
similar to each other, whereas the runtime of CoPaM increases dramatically
until it is not applicable any more. The clustering quality of DB-CSC remains
relatively stable while the qualites of the other approaches decrease constantly.
For increasing cluster sizes the expansion of the clusters in the graph as well as
in the attribute space increases, thus the restrictive cluster models of GAMer
and CoPaM can only detect subsets of them.
Robustness. In Fig. 8(a) we analyze the robustness of the methods w.r.t. the
number of “noise” vertices in the datasets. The clustering quality of all ap-
proaches decreases for noisy data, however the quality of DB-CSC is still rea-
sonably high even for 1000 noise vertices (which is nearly 50% of the overall
dataset). In the next experiment (Fig. 8(b)) we vary the clustering parameter .
For GAMer and CoPaM we vary the allowed width of a cluster in the attribute
space instead of . As shown in the figure, by choosing too small we cannot
find all clusters and thus get smaller clustering qualities. However, for > 0.05
the clustering quality of DB-CSC remains stable. The competing methods have
lower quality. In the last experiment (Fig. 8(c)) we evaluate the robustness of
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 579
F1valueforDBͲCSC
0.9 DBͲCSC 0.8 0.8
GAMer
F1value
F1value
(a) Quality vs. noise (b) Quality vs. (c) Quality vs. minP ts
DB-CSC w.r.t. the parameter minP ts. For too small values for minP ts, many
vertex sets are falsely detected as clusters, thus we obtain small clustering qual-
ities. However, for sufficiently high minP ts values the quality remains relatively
stable, similar to the previous experiment.
Overall, the experiments show that DB-CSC obtains significantly higher clus-
tering qualities. Even though it uses a more sophisticated cluster model than
GAMer and CoPaM, the runtimes of DB-CSC are comparable to (and in some
cases even better than) those of the other approaches.
Real world data. As real world data sets we use gene data1 and patent data2
as also used in [9]. Since for real world data there are no “hidden” clusters given
that we could compare our clustering results with, we compare the properties of
the clusters found by the different methods. For the gene data DB-CSC detects
9 clusters with a mean size of 6.3 and a mean dimensionality of 13.2. In con-
trast, GAMer detects 30 clusters (mean size: 8.8 vertices, mean dim.: 15.5) and
CoPaM 115581 clusters (mean size: 9.7 vertices, mean dim.: 12.2), which are far
too many to be interpretable. In the patent data, DB-CSC detects 17 clusters
with a mean size of 19.2 vertices and a mean dimensionality of 3. In contrast,
GAMer detects 574 clusters with a mean size of 11.7 vertices and a mean di-
mensionality of 3. CoPaM did not finish on this dataset within two days. The
clusters detected by DB-CSC are more expanded than the clusters of GAMer,
which often simply are subsets of the clusters detected by DB-CSC.
6 Conclusion
solution. The clustering quality and the efficiency of DB-CSC are demonstrated
in the experimental section.
Acknowledgment. This work has been supported by the UMIC Research Cen-
tre, RWTH Aachen University, Germany, and the B-IT Research School.
References
1. Aggarwal, C., Wang, H.: Managing and Mining Graph Data. Springer, New York
(2010)
2. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-
tering of high dimensional data for data mining applications. In: SIGMOD, pp.
94–105 (1998)
3. Assent, I., Krieger, R., Müller, E., Seidl, T.: EDSC: Efficient density-based subspace
clustering. In: CIKM, pp. 1093–1102 (2008)
4. Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is ”nearest neigh-
bor” meaningful? In: ICDT, pp. 217–235 (1999)
5. Dorogovtsev, S., Goltsev, A., Mendes, J.: K-core organization of complex networks.
Physical Review Letters 96(4), 40601 (2006)
6. Du, N., Wu, B., Pei, X., Wang, B., Xu, L.: Community detection in large-scale
social networks. In: WebKDD/SNA-KDD, pp. 16–25 (2007)
7. Ester, M., Ge, R., Gao, B.J., Hu, Z., Ben-Moshe, B.: Joint cluster analysis of
attribute data and relationship data: the connected k-center problem. In: SDM
(2006)
8. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discov-
ering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)
9. Günnemann, S., Färber, I., Boden, B., Seidl, T.: Subspace clustering meets dense
subgraph mining: A synthesis of two paradigms. In: ICDM, pp. 845–850 (2010)
10. Hanisch, D., Zien, A., Zimmer, R., Lengauer, T.: Co-clustering of biological net-
works and gene expression data. Bioinformatics 18, 145–154 (2002)
11. Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia
databases with noise. In: KDD, pp. 58–65 (1998)
12. Janson, S., Luczak, M.: A simple solution to the k-core problem. Random Struc-
tures & Algorithms 30(1-2), 50–62 (2007)
13. Kailing, K., Kriegel, H.P., Kroeger, P.: Density-connected subspace clustering for
high-dimensional data. In: SDM, pp. 246–257 (2004)
14. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A sur-
vey on subspace clustering, pattern-based clustering, and correlation clustering.
TKDD 3(1), 1–58 (2009)
15. Kubica, J., Moore, A.W., Schneider, J.G.: Tractable group detection on large link
data sets. In: ICDM, pp. 573–576 (2003)
16. Moser, F., Colak, R., Rafiey, A., Ester, M.: Mining cohesive patterns from graphs
with feature vectors. In: SDM, pp. 593–604 (2009)
17. Pei, J., Jiang, D., Zhang, A.: On mining cross-graph quasi-cliques. In: KDD, pp.
228–238 (2005)
18. Ulitsky, I., Shamir, R.: Identification of functional modules using network topology
and high-throughput data. BMC Systems Biology 1(1) (2007)
19. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute
similarities. PVLDB 2(1), 718–729 (2009)
20. Zhou, Y., Cheng, H., Yu, J.X.: Clustering large attributed graphs: An efficient
incremental approach. In: ICDM, pp. 689–698 (2010)
Learning the Parameters of Probabilistic Logic
Programs from Interpretations
1 Introduction
Statistical relational learning [12] and probabilistic logic learning [5,7] have con-
tributed various representations and learning schemes. Popular approaches in-
clude BLPs [15], ICL [18], Markov Logic [19], PRISM [22], PRMs [11], and
ProbLog [6,13]. These approaches differ not only in the underlying representa-
tions but also in the learning settings they employ.
For learning knowledge-based model construction approaches (KBMC), such
as Markov Logic, PRMs, and BLPs, one normally uses relational state descrip-
tions as training examples. This setting is also known as learning from inter-
pretations. For training probabilistic programming languages one typically uses
learning from entailment [7,8]. PRISM and ProbLog, for instance, are probabilis-
tic logic programming languages that are based on Sato’s distribution seman-
tics [21]. They use training examples in form of labeled facts where the labels
are either the truth values of these facts or target probabilities.
In the learning from entailment setting, one usually starts from observations
for a single target predicate. In the learning from interpretations setting, how-
ever, the observations specify the value for some of the random variables in a
state-description. Probabilistic grammars and graphical models are illustrative
examples for each setting. Probabilistic grammars are trained on examples in the
form of sentences. Each training example states that a particular sentence was
derived or not, but it does not explain how it was derived. In contrast, Bayesian
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 581–596, 2011.
c Springer-Verlag Berlin Heidelberg 2011
582 B. Gutmann, I. Thon, and L. De Raedt
The following ProbLog theory states that there is a burglary with probability
0.1, an earthquake with probability 0.2 and if either of them occurs the alarm
will go off. If the alarm goes off, a person X will be notified and will therefore
call with the probability of al(X), that is, 0.7.
F = {0.1 :: burglary, 0.2 :: earthquake, 0.7 :: al(X)}
BK = {person(mary). , person(john). , alarm :- burglary; earthquake. ,
calls(X) :- person(X), alarm, al(X).}
The set of atomic choices in this program is {al(mary), al(john), burglary, and
earthquake} and each total choice is a subset of this set. Each total choice L
combined with the background knowledge BK defines a Prolog program. Con-
sequently, the probability distribution at the level of atomic choices also in-
duces a probability distribution over possible definite clause programs of the
form L ∪ BK. Furthermore, each such program has a unique least Herbrand in-
terpretation, which is the set of all the ground facts that are entailed by the
program representing a possible world, e.g., for the total choiceburglary the
interpretation {burglary, alarm, person(john), person(mary)} the probability
distribution at the level of total choices also induces a probability distribution
at the level of possible worlds. The probability Pw (I) of this interpretation is
0.1 × (1 − 0.2) × (1 − 0.7)2 . We define the success probability of a query q as
Ps (q|T ) = L⊆LT P(L|T ) = δ(q, BK ∪ L) · P(L|T ) (1)
L∪BK|=q L⊆LT
Ps (calls(X)|T )
= Ps (calls(john)|T ) + Ps (calls(mary)|T )
=1 .
So, the predicates do not encode a probability distribution over their instances.
This differs from probabilistic grammars and their extensions such as stochastic
logic programs [4], where each predicate or non-terminal defines a probability dis-
tribution over its instances, which enables these approaches to sample instances
1
Throughout the paper, we shall assume that F is finite, see [21] for the infinite case.
584 B. Gutmann, I. Thon, and L. De Raedt
Thus, we are given a ProbLog program and a set of partial interpretations and
the goal is to find the maximum likelihood parameters. One has to consider
two cases when computing p . For complete interpretations where everything is
observable, one can obtain p by counting (cf. Sect. 3.1). In the more complex
case of partial interpretations, one has to use an approach that is capable of
handling partial observability (cf. Sect. 3.2).
Learning the Parameters of Probabilistic Logic Programs 585
by Zn = M m
m=1 Kn , the total number of ground instances of the fact fn in all
training examples. If Zn is zero, i.e., no ground instance of fn is used, p n is
undefined and one must not update pn .
Before moving on to the partial observable case, let us consider the issue of de-
m
termining the possible substitutions θn,k for a fact pn :: fn and an interpretation
Im . To resolve this, we assume that the facts fn are typed and that each inter-
pretation Im contains an explicit definition of the different types in the form of
fully-observable unary predicates. In the alarm example, the predicate person/1
can be regarded as the type of the (first) argument of al(X) and calls(X). This
predicate can differ between interpretations. One person, i.e., can have john and
mary as neighbors, another one ann, bob and eve.
As in the fully observable case, the domains are assumed to be given. Before
describing the Soft-EM algorithm for finding p n , we illustrate one of its cru-
cial properties using the alarm example. Assume that our partial interpretation
is I + = {person(mary), person(john), alarm} and I − = ∅. It is clear that
for calculating the marginal probability of all probabilistic facts – these are
the expected counts – only the atoms in {burglary, earthquake, al(john),
al(mary)} ∪ I are relevant. This is due to the fact that the remaining atoms
{calls(john), calls(mary)} cannot be used in any proof for the facts observed
in the interpretations. We call atoms, which are relevant for the distribution
of the ground atom x, the dependency set of x. It is defined as depT (x) :=
{f ground fact | a ground SLD-proof in T for x contains f }. Our goal is to re-
strict the probability calculation to the dependent atoms only. Hence we general-
ize this set to partial interpretations I as follows depT (I) := x∈(I + ∪I − ) depT (x)
and introduce the notion of a restricted ProbLog theory.
586 B. Gutmann, I. Thon, and L. De Raedt
For the partial interpretation I = ({burglary, alarm}, ∅), for instance, BKr (I)
is {alarm :- burglary, alarm :- earthquake} and the restricted set of facts
F r (I) is {0.1 :: burglary, 0.2 :: earthquake}.
The restricted theory T r (I) cannot be larger than T . More important, it is
always finite since we assume the finite support property and the evidence being a
finite conjunction of ground atoms. In many cases it will be much smaller, which
allows for learning in domains where the original theory does not fit in memory.
It can be shown using the independence of probabilistic facts in ProbLog, that
the conditional probability of a ground instance of fn given I calculated in the
theory T is equivalent to the probability calculated in T r (I), that is,
m
m ET r (Im ) [δn,k |Im ] if fn ∈ depT (Im )
ET [δn,k |Im ] = (4)
pn otherwise
We exploit this property in the following section when developing the Soft-EM
algorithm for finding the maximum likelihood parameters p defined in (3).
The algorithm starts by constructing a Binary Decision Diagram (BDD) [2] for
every training example Im (cf. Sect. 4.1), which is then used to compute the ex-
m
pected counts E[δn,k |Im ] (cf. Sect. 4.3). A BDD is a compact graphical represen-
tation of a Boolean formula. In our case, the Boolean formula (or, equivalently,
the BDD) represents the conditions under which the partial interpretation will
be generated by the ProbLog program and the variables in the formula are the
ground atoms in depT (Im ). Basically, any truth assignment to these facts that
satisfies the Boolean formula (or the BDD) will result in the partial interpreta-
tion. Given a fixed variable order, a Boolean function f can be represented as
a full Boolean decision tree where each node N on the ith level is labeled with
the ith variable and has two children called low l(N ) and high h(N ). Each path
from the root to a leaf represents a complete variable assignment. If variable
x is assigned 0 (1), the branch to the low (high) child is taken. Each leaf is
labeled with the value of f given the variable assignment represented by the cor-
responding path from the root. We use ½ to denote true and Ç to denote false.
Starting from such a tree, one obtains a BDD by merging isomorphic subgraphs
and deleting redundant nodes until no further reduction is possible. A node is
redundant if and only if the subgraphs rooted at its children are isomorphic. In
Fig. 1, dashed edges indicate 0’s and lead to low children, solid ones indicate 1’s
and lead to high children.
Learning the Parameters of Probabilistic Logic Programs 587
Fig. 1. The different steps of the LFI-ProbLog algorithm for the training example I + =
{alarm}, I − = {calls(john)}. Normally the alarm node in the BDD is propagated
away in Step 4, but it is kept here for illustrative purposes. The nodes are labeled with
their probability and the up- and downward probabilities.
The LFI-ProbLog algorithm generates the BDD that encodes a partial interpre-
tation I. Due to the usage of Clark’s completion in Step 3 the algorithm requires
a tight ProbLog program as input. Clark’s completion allows one to propagate
values from the head to the bodies of clauses and vice versa. It states that the
head is true if and only if at least one of its bodies is true, which captures the
least Herbrand model semantics of tight definite clause programs. The algorithm
works as follows (c.f. Fig 1):
1. Compute depT (I) . This is the set of ground atoms that may have an influence
on the truth value of the atoms with known truth value in the partial in-
terpretation I. This is realized by applying the definition of depT (I) directly
using a tabled meta-interpreter in Prolog. We use tabling to store subgoals
and avoid recomputation.
2. Use depT (I) to compute BKr (I), the background theory BK restricted to the
interpretation I (cf. Definition 2 and (4)).
3. Compute clark(BKr (I)), which denotes Clark’s completion of BKr (I); it
is computed by replacing all clauses with the same head h :- body1 , ...,
h :- bodyn by the corresponding formula h ↔ body1 ∨ . . . ∨ bodyn .
4. Simplify clark(BKr (I)) by propagating known values for the atoms in I.
This step eliminates ground atoms with known truth value in I. That is,
we simply fill out their value in the theory clark(BKr (I)), and then we
588 B. Gutmann, I. Thon, and L. De Raedt
The resulting set of BDDs are used by the algorithm outlined in the next section
to compute the expected counts.
One computes the downward probability α from the root to the leaves and the
upward probability β from the leaves to the root. Intermediate results are stored
and reused when nodes are revisited. Both parts are sketched in Algorithm 1.
590 B. Gutmann, I. Thon, and L. De Raedt
T1 T2 T1 T2
Fig. 2. Propagation step of the upward probability (left) and for the downward proba-
bility (right). The indicator function πN is 1 if N is a probabilistic node and 0 otherwise.
Algorithm 1. Calculating α and β. The nodes l(h) and h(n) are the low and
high child of the node n respectively.
function Alpha(BDD node n) function Beta(BDD node n)
If n is the ½ then return 1 q := priority queue using the BDD’s order
If n is the Ç then return 0 enqueue(q, n)
if n probabilistic fact then Beta := array of 0’s of length size(BDD)
return pn · Alpha(h(n)) Beta[root(n)]:= 1
+(1−pn )·Alpha(l(n)) while q not empty do
return Alpha(h(n))+ n := dequeue(q)
quadAlpha(l(n)) Beta[h(n)]+ = Beta[n] · pπnn
Beta[l(n)]+ = Beta[n] · (1 − pn )πn
enqueue(q, h(n)) if not yet in q
enqueue(q, l(n)) if not yet in q
5 Experiments
We implemented LFI-ProbLog in YAP Prolog and use SimpleCUDD for the
BDD operations. We used two datasets to evaluate LFI-ProbLog. The WebKB
benchmark serves as test case to compare with state-of-the-art systems. The
Smokers dataset is used to test the algorithm in terms of the learned model,
that is, how close are the parameters to the original ones. The experiments were
run on an Intel Core 2 Quad machine (2.83 GHz) with 8GB RAM.
5.1 WebKB
The goal of this experiment is to answer the following questions:
Q1. Is LFI-ProbLog competitive with existing state-of-the-art frameworks?
Q2. Is LFI-ProbLog insensitive to the initial probabilities?
Q3. Is the theory splitting algorithm capable of handling large data sets?
In this experiment, we used the WebKB [3] dataset. It contains four folds, each
describing the link structure of pages from one of the following universities:
Cornell, Texas, Washington, and Wisconsin. WebKB is a collective classification
task, that is, one wants to predict the class of a page depending on the classes
of the pages that link to it and depending on the words being used in the
Learning the Parameters of Probabilistic Logic Programs 591
1 -1000
-1500
0.9
0.7 -3000
-3500
0.6
LFI-ProbLog [0.0001-0.0003] -4000
0.5 LFI-ProbLog [0.1-0.3] LFI-ProbLog [0.0001-0.0003]
LFI-ProbLog [0.1-0.9] -4500 LFI-ProbLog [0.1-0.3]
MLNs LFI-ProbLog [0.1-0.9]
0.4 -5000
0 200 400 600 800 1000 0 5 10 15 20
Time [sec] Iteration
Fig. 3. Area under the ROC curve against the learning time (left) and test set log
likelihood for each iteration of the EM algorithm (right) for WebKB
text. To allow for an objective comparison with Markov Logic networks and
the results of Domingos and Lowd [9], we used their slightly altered version of
WebKB. In their setting each page is assigned exactly one of the classes “course”,
“faculty”, “other”, “researchproject”, “staff”, or “student”. Furthermore, the
class “person”, present in the original version, has been removed. We use the
following model that contains one non-ground probabilistic fact for each pair of
Class and Word. To account for the link structure, it contains one non-ground
probabilistic fact for each pair of Class1 and Class2.
P :: pfWoCla(Page, Class, Word).
P :: pfLiCla(Page1, Page2, Class1, Class2).
The probabilities P are unknown and have to be learned by LFI-ProbLog. As
there are 6 classes and 771 words, our model has 6×771+6×6 = 4662 parameters.
In order to combine the probabilistic facts and predict the class of a page we
add the following background knowledge.
cl(Pa, C) :- hasWord(Pa, Word), pfWoCla(Pa, Word, C).
cl(Pa, C) :- linksTo(Pa2, Pa), pfLiCla(Pa2, Pa, C2, C), cl(Pa2, C2).
We performed a 4-fold cross validation, that is, we trained the model on three
universities and then tested it on the fourth one. We repeated this for all four
universities and averaged the results. We measured the area under the precision-
recall curve (AUC-PR), the area under the ROC curve (AUC-ROC), the log
likelihood (LLH), and the accuracy after each iteration of the EM algorithm.
Our model does not express that each page has exactly one class. To account
for this, we normalize the probabilities per page. Figure 3 (left) shows the AUC-
ROC plotted against the average training time. The initialization phase, that is
running steps 1-4 of LFI-ProbLog, takes ≈ 330 seconds, and each iteration of the
EM algorithm takes ≈ 62 seconds. We initialized the probabilities of the model
randomly with values sampled from the uniform distribution between 0.1 and
0.9, which is shown as the graph for LFI-ProbLog [0.1-0.9]. After 10 iterations
(≈ 800 s) the AUC-ROC is 0.950 ± 0.002, the AUC-PR is 0.828 ± 0.006, and the
accuracy is 0.769 ± 0.010.
We compared LFI-ProbLog with Alchemy [9] and LeProbLog [13]. Alchemy
is an implementation of Markov Logic networks. We use the model suggested by
592 B. Gutmann, I. Thon, and L. De Raedt
Domingos and Lowd [9] that uses the same features as our model, and we train
it according to their setup.2 . The learning curve for AUC-ROC is shown in Fig-
ure 3 (left). After 943 seconds Alchemy achieves an AUC-ROC of 0.923 ± 0.016,
an AUC-PR of 0.788 ± 0.036, and an accuracy of 0.746 ± 0.032. LeProbLog is a
regression-based parameter learning algorithm for ProbLog. The training data
has to be provided in the form of queries annotated with the target probability.
It is not possible to learn from interpretations. For WebKB, however, one can
map one interpretation to several training examples P (class(URL,Class) = P
per page where P is 1 if the class of URL is Class and else 0. This is possi-
ble, due to the existence of a target predicate. We used the standard settings
of LeProblog and limit the runtime to 24 hours. Within this limit, the algo-
rithm performed 35 iteration of gradient descent. The final model obtained an
AUC-PR of 0.419 ± 0.014, an AUC-ROC of 0.738 ± 0.014, and an accuracy of
0.396 ± 0.020. These results affirmatively answer Q1.
We tested how sensitive LFI-ProbLog is for the initial fact probabilities by
repeating the experiment with values sampled uniformly between 0.1 and 0.3
and sampled uniformly between 0.0001 and 0.0003 respectively. As the graphs
in Figure 3 indicate, the convergence is initially slower and the initial LLH values
differ. This is due to the fact that the ground truth probabilities are small, and
if the initial fact probabilities are small too, one obtains a better initial LLH. All
settings converge to the same results, in terms of AUC and LLH. This suggests
that LFI-ProbLog is insensitive to the start values (cf. Q2).
The BDDs for the WebKB dataset are too large to fit in memory and the
automatic variable reordering is unable to construct the BDD in a reasonable
amount of time. We used two different approaches to resolve this. In the first
approach, we manually split each training example, that is, the grounded theory
together with the known class for each page, into several training examples. The
results shown in Figure 3 are based on this manual split. In the second approach,
we used the automatic splitting algorithm presented in Section 4.2. The resulting
BDDs are identical to the manual split setting, and the subsequent runs of the
EM algorithm converge to the same results. Hence when plotting against the it-
eration, the graphs are identical. The resulting ground theory is much larger and
the initialization phase therefore takes 247 minutes. However, this is mainly due
to the overhead for indexing, database access and garbage collection in the un-
derlying Prolog system. Grounding and Clark’s completion take only 6 seconds
each, the term simplification step takes roughly 246 minutes, and the final split-
ting algorithm runs in 40 seconds. As we did not optimize the implementation
of the term simplification, we see a big potential for improvement, for instance
by tabling intermediate simplification steps. This affirmatively answers Q3.
5.2 Smokers
We set up an experiment on an instance of the Smokers dataset (cf. [9]) to
answer the question
2
Daniel Lowd provided us with the original scripts for the experiment setup. We
report on the evaluation based on the rerun of the experiment.
Learning the Parameters of Probabilistic Logic Programs 593
20 20 20
0% 0% 0%
10 % 10 % 10 %
20 % 20 % 20 %
15 30 % 15 30 % 15 30 %
40 % 40 % 40 %
KL Divergence
50 % 50 % 50 %
10 10 10
5 5 5
0 0 0
40 60 80 100 120 140 160 180 200 40 60 80 100 120 140 160 180 200 40 60 80 100 120 140 160 180 200
# training examples # training examples # training examples
Fig. 4. Result for KL-divergence in the smokers domain. The plots are for left to right
3, 4, 5 smokers. Different graphs correspond to different amounts of missing data.
Q4. Is LFI-Problog able to recover the parameters of the original model with a
reasonable amount of data?
Missing or incorrect values are two different types of noise that can occur in
real-world data. While incorrect values can be compensated by additional data,
missing values cause local maxima in the likelihood function. In turn, they cause
the learning algorithm to yield parameters different from the ones used to gener-
ate the data. LFI-ProbLog computes the maximum likelihood parameters given
some evidence. Hence the algorithm should be capable of recovering the param-
eters used to generate a set of interpretations. We analyze how the amount of
required training data increases as the size of the model increases. Furthermore,
we test for the influence of missing values on the results. We assess the quality
of the learned model, that is, the difference to the original model parameters
by computing the Kullback Leibler (KL) divergence. ProbLog allows for an effi-
cient computation of this measure due to the independence of the probabilistic
facts. In this experiment, we use a variant of the “Smokers” model which can be
represented in ProbLog as follows:
Due to space restrictions, we omit the details on how to represent this such that
the program is tight. We set the number of persons to 3,4 and 5 respectively and
sampled from the resulting models up to 200 interpretations each. From these
datasets we derived new instances by randomly removing 10 − 50% of the atoms.
The size of an interpretation grows quadratically with the number of persons.
The model, as described above, has an implicit parameter tying between ground
594 B. Gutmann, I. Thon, and L. De Raedt
instances of non-ground facts. Hence the number of model parameters does not
change with the number of persons. To measure the influence of the model
size, we therefore trained grounded versions of the model, where the grounding
depends on the number of persons. For each dataset we ran LFI-ProbLog for
50 iterations of EM. Manual inspection showed that the probabilities stabilized
after a few, typically 10, iterations. Figure 4 shows the KL divergence for 3, 4
and 5 persons respectively. The closer the KL divergence is to 0, the closer the
learned model is to the original parameters. As the graphs show, the learned
parameters approach the parameters of the original model as the number of
training examples grows. Furthermore, the amount of missing values has little
influence on the distance between the true and the learned parameters. Hence
LFI-Problog is capable of recovering the original parameters and it is robust
against missing values. This affirmatively answers Q4.
6 Related Work
Most of the existing parameter learning approaches for ProbLog [6], PRISM [22],
and SLPs [17] are based on learning from entailment. For ProbLog, there exists a
learning algorithm based on regression where each training example is a ground
fact together with the target probability [13]. In contrast to LFI-ProbLog, this
approach does not assume an underlying generative process; neither at the level
of predicates nor at the level of interpretations. Sato and Kameya have con-
tributed various interesting and advanced learning algorithms that have been
incorporated in PRISM. Ishihata et al. [14] consider a parameter learning set-
ting based on Binary Decision Diagrams (BDDs) [2]. In contrast to our work,
they assume the BDDs to be given, whereas LFI-ProbLog, constructs them in
an intelligent way from evidence and a ProbLog theory. Ishihata et al. suggest
that their approach can be used to perform learning from entailment for PRISM
programs. This approach has been recently adopted for learning CP-Logic pro-
grams (cf. [1]). The BDDs constructed by LFI-ProbLog are a compact represen-
tation of all possible worlds that are consistent with the evidence. LFI-ProbLog
estimates the marginals of the probabilistic facts in a dynamic programming
manner on the BDDs. While this step is inspired by [14], we tailored it to-
wards the specifics of LFI-ProbLog, that is, we allow deterministic nodes to be
present in the BDDs. This extension is crucial, as the removal of deterministic
nodes can results in an exponential growth of the Boolean formulae underlying
the BDD construction. Riguzzi [20] uses a transformation of ground ProbLog
programs to Bayesian networks in order to learn ProbLog programs from inter-
pretations. Such a transformation is also employed in the learning approaches
for CP-logic [24,16]. Thon et al. [23] studied how CPT-L, a sequential variant
of CP-Logic, can be learned from sequences of interpretations. CPT-L is closely
related to LFI-ProbLog. However, CPT-L is targeted towards the sequential as-
pect of the theory, whereas we consider a more general settings with arbitrary
theories. Thon et al. assume full observability, which allows them to split the
sequence into separate transitions. They build one BDD per transition, which
Learning the Parameters of Probabilistic Logic Programs 595
is much easier to construct than one large BDD per sequence. Our splitting al-
gorithm is capable of exploiting arbitrary independence. LFI-ProbLog can also
be related to knowledge-based model construction approaches in statistical rela-
tional learning such as BLPs, PRMs and MLNs [19]. While the setting explored
in this paper is standard for the aforementioned formalisms, our approach has
significant representational and algorithmic differences from the algorithms used
in those formalisms. In BLPS, PRMs and CP-logic, each training example is
typically used to construct a ground Bayesian network on which a standard
learning algorithm is applied. Although the representation generated by Clark’s
completion is quite close to the representation of Markov Logic, there are subtle
differences. While Markov Logic uses weights on clauses, we use probabilities
attached to single facts.
7 Conclusions
We have introduced a novel parameter learning algorithm from interpretations
for the probabilistic logic programming language ProbLog. This has been mo-
tivated by the differences in the learning settings and applications of typical
knowledge-based model construction approaches and probabilistic logic program-
ming approaches. The LFI-ProbLog algorithm tightly couples logical inference
with a probabilistic EM algorithm at the level of BDDs. Possible directions of
future work include using d-DNNF representations instead of BDDs [10] and a
transformation to Boolean formulae that does not require tight programs.
References
1. Bellodi, E., Riguzzi, F.: EM over binary decision diagrams for probabilistic logic
programs. Tech. Rep. CS-2011-01, Università di Ferrara, Italy (2011)
2. Bryant, R.E.: Graph-based algorithms for boolean function manipulation. IEEE
Trans. Computers 35(8), 677–691 (1986)
3. Craven, M., Slattery, S.: Relational learning with statistical predicate invention:
Better models for hypertext. Machine Learning 43(1/2), 97–119 (2001)
4. Cussens, J.: Parameter estimation in stochastic logic programs. Machine Learn-
ing 44(3), 245–271 (2001)
5. De Raedt, L., Frasconi, P., Kersting, K., Muggleton, S. (eds.): Probabilistic Induc-
tive Logic Programming — Theory and Applications. LNCS (LNAI), vol. 4911.
Springer, Heidelberg (2008)
6. De Raedt, L., Kimmig, A., Toivonen, H.: Problog: A probabilistic Prolog and its
application in link discovery. In: Veloso, M. (ed.) IJCAI, pp. 2462–2467 (2007)
7. De Raedt, L.: Logical and Relational Learning. Springer, Heidelberg (2008)
596 B. Gutmann, I. Thon, and L. De Raedt
8. De Raedt, L., Kersting, K.: Probabilistic inductive logic programming. In: Ben-
David, S., Case, J., Maruoka, A. (eds.) ALT 2004. LNCS (LNAI), vol. 3244, pp.
19–36. Springer, Heidelberg (2004)
9. Domingos, P., Lowd, D.: Markov Logic: An Interface Layer for Artificial Intelli-
gence. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan
& Claypool Publishers, San Francisco (2009)
10. Fierens, D., Van den Broeck, G., Thon, I., Gutmann, B., De Raedt, L.: Inference
in probabilistic logic programs using weighted cnf’s. In: The 27th Conference on
Uncertainty in Artificial Intelligence, UAI 2011 (to appear, 2011)
11. Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning probabilistic relational
models. In: Džeroski, S., Lavrač, N. (eds.) Relational Data Mining, pp. 307–335.
Springer, Heidelberg (2001)
12. Getoor, L., Taskar, B. (eds.): An Introduction to Statistical Relational Learning.
MIT Press, Cambridge (2007)
13. Gutmann, B., Kimmig, A., De Raedt, L., Kersting, K.: Parameter learning in
probabilistic databases: A least squares approach. In: Daelemans, W., Goethals,
B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp.
473–488. Springer, Heidelberg (2008)
14. Ishihata, M., Kameya, Y., Sato, T., Minato, S.: Propositionalizing the EM algo-
rithm by BDDs. In: ILP (2008)
15. Kersting, K., Raedt, L.D.: Bayesian logic programming: theory and tool. In:
Getoor, L., Taskar, B. (eds) [12]
16. Meert, W., Struyf, J., Blockeel, H.: Learning ground cp-logic theories by leveraging
bayesian network learning techniques. Fundam. Inform. 89(1), 131–160 (2008)
17. Muggleton, S.: Stochastic logic programs. In: De Raedt, L. (ed.) Advances in In-
ductive Logic Programming. Frontiers in Artificial Intelligence and Applications,
vol. 32. IOS Press, Amsterdam (1996)
18. Poole, D.: The independent choice logic and beyond. In: De Raedt, L. et al [5],
19. Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62, 107–
136 (2006)
20. Riguzzi, F.: Learning ground problog programs from interpretations. In: Proceed-
ings of the 6th Workshop on Multi-Relational Data Mining, MRDM 2007 (2007)
21. Sato, T.: A statistical learning method for logic programs with distribution se-
mantics. In: Sterling, L. (ed.) Proceedings of the 12th International Conference on
Logic Programming, pp. 715–729. MIT Press, Cambridge (1995)
22. Sato, T., Kameya, Y.: Parameter learning of logic programs for symbolic-statistical
modeling. Journal of Artificial Intelligence Research 15, 391–454 (2001)
23. Thon, I., Landwehr, N., De Raedt, L.: A simple model for sequences of relational
state descriptions. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD
2008, Part I. LNCS (LNAI), vol. 5211, pp. 506–521. Springer, Heidelberg (2008)
24. Vennekens, J., Denecker, M., Bruynooghe, M.: Representing causal information
about a probabilistic process. In: Fisher, M., van der Hoek, W., Konev, B., Lisitsa,
A. (eds.) JELIA 2006. LNCS (LNAI), vol. 4160, pp. 452–464. Springer, Heidelberg
(2006)
Feature Selection Stability Assessment Based on
the Jensen-Shannon Divergence
1 Introduction
Feature selection techniques play an important role in classification problems
with high dimensional data [6]. Reducing the data dimensionality is a key step
in these applications as the size of the training data set needed to calibrate
a model grows exponentially with the number of dimensions (the curse of the
dimensionality problem) and the process of knowledge discovery from the data
is simplified if the instances are represented with less features.
Feature selection techniques measure the importance of the features according
to the value of a given function. These algorithms can be basically divided in
This work has been partially supported by the Spanish MEC project DPI2009-08424.
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 597–612, 2011.
c Springer-Verlag Berlin Heidelberg 2011
598 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez
three types [7]: filter, wrapper and embedded approaches. The filter methods
select the features according to a reasonable criterion computed directly from the
data and that is independent of the classification model. The wrapper approaches
make use of the predictive performance of the classification machine in order to
determine the value of a given feature subset and the embedded techniques
are specific for each model since they are intrinsically defined in the inductive
algorithm. Regarding the outcome of the feature selection technique, the output
format may be a full ranked list (or weighting-score) or a subset of features.
Obviously representation changes are possible and thus, a feature subset can be
extracted from a full ranked list by selecting the most important features and a
partial ranked list can be also derived directly from the full ranking by removing
the least important features.
A problem that arises in many practical problems, in particular when the
available dataset is small and the feature dimensionality is high, is that small
variations in the data lead to different outcomes of the feature selection algo-
rithm. Perhaps the disparity among different research findings has made the
study of the stability (or robustness) of feature selection a topic of recent in-
terest. Fields like biomedicine, bioinformatics or chemometrics require not only
accurate classification models, but a feature ranking or a subset of the most
important features in order to better understand the data and the underlying
process. The fact that under small variations in the available training data, the
top-k feature list (or the ranked feature list) varies, makes this task not straight-
forward and the conclusions derived from it quite unreliable.
The assessment of the robustness of feature selection/ranking methods be-
comes an important issue [11,9,3,1], specially when the aim is to gain insight
into the underlying process by analyzing the most relevant features. Neverthe-
less, this is a topic that has received little attention and it has been only during
the last decade that several works address this analysis. In order to measure
the stability, suitable metrics for each output format of the feature selection
algorithms are required.
The Spearman’s rank correlation coefficient [10,11,19] and Canberra distance
[9] have been proposed to measure the similarity when the outcome represen-
tation is a full ranked list. When the goal is to measure the similarity between
top-k lists (partial lists), a wide variety of measures have been proposed: Jac-
card distance [11,19], an adaptation of the Tanimoto distance [11], Kuncheva’s
stability index [13], Relative Hamming distance [5], Consistency measures [20],
Dice-sorense’s index [15], Ochiai’s index [22] or Percentage of overlapping fea-
tures [8]. An alternative that lies between full ranked lists (all features with
ranking information) and partial lists (a subset with the top-k features, where
all of them are given the same importance) is the use of partial ranked lists,
that is, a list with the top-k features and the relative ranking among them.
This approach has been used in the information retrieval domain [2] to evaluate
queries and it seems more natural when the goal is to analyze a subset of fea-
tures. Providing information of the feature importance is fundamental to carry
Feature Selection Stability Assessment Based on the JS Divergence 599
out a subsequent analysis of the data, but a stability measures have not been
proposed yet for these partial ranked lists.
In our context, the evaluation of the robustness of feature selection techniques,
two ranked lists would be considered much less similar if their differences oc-
curred at the “top” rather than at the “bottom” of the lists. Unlike metrics such
as the Kendall’s tau and the Spearman’s rank correlation coefficient that do not
capture this information, we propose a stability measure based on information
theory that takes this into consideration. Our proposal is based on mapping each
ranked list into a probability distribution and then, measuring the dissimilar-
ity among these distributions using the information-theoretic Jensen-Shannon
divergence. This single metric, SJS (Similarity based on the Jensen-Shannon
divergence) applies to full ranked lists, partial ranked lists as well as top-k lists.
The rest of this paper is organized as follows: Next, Section 2 describes the
stability problem and common approaches to deal with it. The new metric based
on the Jensen-Shannon divergence SJS is presented in Section 3. Experimental
evaluation is shown in Section 4 and finally Section 5 summarizes the main
conclusions.
2 Problem Formulation
In this section, we formulate the problem mathematically and present two com-
mon metrics to evaluate the stability of a feature selection and a feature ranking
algorithm.
to compute pairwise similarities and average the results, what leads to a single
scalar value.
N
−1 N
2
S(A) = SM (ri , rj ) (6)
N (N − 1) i=1 j=i+1
This measure is always non negative, taking values from 0 to ∞, and DKL (p||q)) =
0 if p = q. The KL divergence, however, has two important drawbacks, since (a)
in general it is asymmetric (DKL (p||q))
= DKL (q||p)) and (b) it does not gener-
alize to more than two distributions. For this reason, we use the related Jensen-
Shannon divergence [14], that is a symmetric version of the Kullback-Leibler
divergence and is given by
602 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez
1
DJS (p||p ) =
(DKL (p||p̄) + DKL (p ||p̄)) (9)
2
where p̄ is the average of the distributions.
Given a set of N distributions {p1 , p2 , . . . , pN }, where each one corresponds
to a run of a given feature ranking algorithm, we can use the Jensen Shannon di-
vergence to measure the similarity among the distributions produced by different
runs of the feature ranking algorithm, what can be expressed as
N
1
DJS (p1 , . . . , pN ) = DKL (pi ||p̄) (10)
N i=1
or alternatively as
N l
1 pij
DJS (p1 , . . . , pN ) = pij log (11)
N j=1 i=1 p¯i
with pij being the probability assigned to feature i in the ranking output j and
p¯i the average probability assigned to feature i.
We look for a stability measure based on the Jensen Shannon Divergence
(SJS ) that fulfills some constraints:
– It falls in the interval [0 ,1]
– It takes the value zero for completely random rankings
– It takes the value one for stable rankings
The stability metric SJS (Stability base on the Jensen Shannon divergence) is
given by
DJS (p1 , . . . , pN )
SJS (p1 , . . . , pN ) = 1 − ∗ (12)
DJS (p1 , . . . , pN )
where DJS is the Jensen Shannon Divergence among the N ranking outcomes
∗
and DJS is the divergence value for a ranking generation that is completely
random.
∗
In a random setting, p¯i = 1/l what leads to a constant value DJS
N l l l
∗ 1 1
DJS (p1 , . . . , pN ) = pij log(pij l) = N pi log(pi l) = pi log(pi l)
N j=1 i=1 N i=1 i=1
(13)
where pi is the probability assigned to a feature with rank ri . Note that this
maximum value depends exclusively on the number of features and it can be
computed beforehand with the mapping provided by (7).
It is easy to check that:
– For a completely stable ranking algorithm, pij = p¯i in (11). That is, the rank
of feature-j is the same in any run-i of the feature ranking algorithm. This
leads to DJS = 0 and a stability metric SJS = 1
Feature Selection Stability Assessment Based on the JS Divergence 603
∗
– A random ranking will lead to DJS = DJS and therefore SJS = 0
– For any ranking neither completely stable nor completely random, the sim-
ilarity metric SJS ∈ (0, 1). The closer to 1, the more stable the algorithm
is.
where k is the length of the sublist and l the total number of features.
4 Empirical Study
4.1 Illustration on Artificial Outcomes
In this experiment we evaluate the stability metric SJS for the outcomes of
hypothetical feature ranking algorithms. We generate sets of N = 100 rankings
of l = 2000 features. We simulate several feature ranking (FR) algorithms:
– FR-0 with 100 random rankings, that is, a completely random FR algorithm
– FR-1 with one fixed output, and 99 random rankings.
– FR-2 with two identical fixed outputs, and 98 random rankings.
604 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez
1
Jensen−Shannon
0.9 Spearman
0.8
0.7
0.6
Metric
0.5
SJ S
0.4
0.3
SR
0.2
0.1
0
FR−0 FR−50 FR−100
Stability
Fig. 1. SJ S metric and Spearman rank correlation for Feature Ranking (FR) techniques
that vary from completely random (FR-0 on the left) to completely stable (FR-100 on
the right)
Fig. 1 shows the SJS and the SR for Feature Ranking (FR) techniques that vary
from completely random (FR-0, on the left) to completely stable (FR-100 on
the right). For the FR-0 method, the stability metric based on Jensen-Shannon
divergence SJS takes the value 0, while its value is 1 for the stable FR-100
algorithm. Note that SJS takes similar values to the Spearman rank correlation
coefficient SR .
Assume we have now some Feature Selection (FS) techniques, which stabil-
ity needs to be assessed. These FS methods (FS-0,FS-1,...,FS-100) have been
obtained from the corresponding FR techniques described above, extracting the
top-k features (k = 600). In the same way, they vary smoothly from a completely
random FS algorithm (FS-0) to stable FS a completely stable one (FS-100). The
Jensen-Shannon metric SJS together with the Kuncheva Index (KI) are depicted
for top-600 lists in Fig. 2. Note that the SJS metric applied to top-k lists pro-
vide similar values to the KI metric. The Jensen-Shannon based measure SJS
can be applied to full ranked lists and partial lists, while the KI is only suitable
to partial lists and the SR only to full ranked lists.
Generating partial ranked feature lists is an intermediate option between: (a)
generating and comparing full ranked feature lists that are, in general, very long
and (b) extracting sublists with the top-k features, but with no information
about the importance of each feature. The SJS metric based on the Jensen-
Shannon divergence also allows to compare these partial ranked lists (as well as
top-k lists and full ranked lists). Consider we have sets of sublists with the 600
most important features out of 2000 features. We generated several sets of lists:
Feature Selection Stability Assessment Based on the JS Divergence 605
1
Spearman
0.9
Kuncheva Index
0.8
0.7
0.6
Metric
0.5
0.4
KI
0.3
SJ S
0.2
0.1
0
FS−0 FS−50 FS−100
Stability
Fig. 2. SJ S metric and the KI for Feature Selection (FS) techniques that vary from
completely random (FS-0 on the left) to completely stable (FS-100 on the right). The
metrics work on top-k lists with k=600.
some of them show high differences in the lowest ranked features while other
show high differences in the highest rank features. The same sublist can come
either with the ranking information (partial ranked lists) or with no information
about the feature importance (top-k lists). The overlap among the lists is around
350 features. Fig. 3 shows the value SJS (partial ranked lists), SJS (top-k list)
and the Kuncheva index (top-k lists) for the lists.
Even though the lists have the same average overlap (350 features), some of
them show more discrepancy about which are the top features (Fig. 3, on the
right), while other sets show more differences at the bottom of the list. The
KI can not handle this information since it only works with top-k lists and
therefore, it assigns the same value for these very different situations. When the
SJS works at this level (top-k list), it also gives the same measure for all the
scenarios. The SJS can also handle the information provided in partial ranked
lists, considering the importance of the features and therefore assigning a lower
stability value for those sets of lists with high differences at the top of the
lists, that is with high discrepancy about the most important features. Likewise,
it assigns a higher stability value for those sets where the differences appear
in the least important features, but there is more agreement about the most
important features. Fig. 3 illustrates this fact where SJS (partial ranked lists)
varies according to the location of the differences in the list, while SJS (top-k
lists) and the KI assign the same value regardless of where the discrepancies
appear.
Consider also the following situation where the most important 600 features
out of 2000 have been extracted and the overlap among the top-600 lists of 100%.
We have evaluated several scenarios:
606 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez
1
KI top−k
S top−k
0.9 JS
S , Ranked lists
JS
0.8
Stability 0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Bottom Middle Top
Sets of Partial list with differences
Fig. 3. SJ S (partial ranked lists), SJ S (top-k list) and the Kuncheva index (top-k lists)
for Feature Selection (FS) techniques that extract the top-600 features out of 2000. The
overlap among the lists is around 350 common features. The situations vary smoothly
from sets of partial lists with differences at the bottom of the list (left) to sets of lists
that show high differences at the top of the list (right).
Working with top-k lists (KI), the stability metrics provide a value of 1, what
is somewhat misleading considering the different scenarios that may appear. It
seems natural that, even though all agree about the 600 most important features,
the stability metric should be lower than 1 when there is low agreement about
which are the most important features. The SJS measure allows to work with
partial ranked lists and therefore establishing differences between these scenar-
ios. Fig. 4 shows the SJS (partial ranked lists) and the SJS , KI (top-k lists)
highlighting this fact. SJS (partial ranked lists) takes a value slightly higher
than 0.90 for a situation where there is complete agreement about which are the
most important 600 features, but complete discrepancy about their importance.
Its value increases to 1 as the randomness in the feature ranking assignment de-
creases. In contrast with this, KI would assign a value of 1 what may misleading
when studying the stability issue.
The new measure has been used to experimentally assess the stability of four
standard feature selectors based on a filter approach: χ2 , Information Gain Ratio
(GR) [4], Relief and other based on the parameter values of an independent
classifier (Decision Rule 1R) [21].
Feature Selection Stability Assessment Based on the JS Divergence 607
0.9
0.6
SJ S
0.5
0.4
0.3
0.2
0.1
0
Random Identical
Top-600 ranked sublists
Fig. 4. SJ S (top-k list) and SJ S (partial ranked lists) for Feature Selection (FS) tech-
niques that extract the top-600 features out of 2000. The overlap among the sublists
with 600 features is complete. The ranking assigned to each feature varies from FS tech-
niques for which it is random (left) to FS techniques for which each feature ranking is
identical in each sublist (right).
We have conducted some experiments on a real data set of omental fat sam-
ples collected from carcasses of suckling lambs [18]. The whole dataset has 134
instances: 66 from lambs being fed with a milk replacer (MR), while the other
68 are reared on ewe milk (EM). Authentication of the type of feeding will be a
key issue in the certification of suckling lamb carcasses, with the rearing system
being responsible for the difference in prices and quality. The use of spectroscopy
for the discrimination of fat samples according to the rearing system provides
several advantages, mainly its speed and versatility. Determining which regions
of the spectrum have more discriminant power is also fundamental for the vet-
erinarian professionals. All FTIR spectra were recorded from 4000 to 750 cm-1
with a resolution of 4 cm-1, what leads to a total of 1687 features. The average
spectra for both classes is shown in Fig.5.
The dataset was randomly split in ten folds, launching the feature ranking
algorithm with nine out the ten folds, in a consecutive way. Five runs of this
process resulted in a total of N = 50 rankings. Feature ranking was carried out
with WEKA [21] and the computation of the stability with MATLAB [17].
The SJS (full ranked list) measure gives an overall view of the stability. The
results (Table 1) indicate that in the case of the spectral data, the most stable
methods seem to be Relief and GR, while 1R appears as the one with less global
stability.
The metric SJS also enables an analysis focused on the top ranked or selected
features. Fig. 6 depicts the SJS for a given number of the top-k selected features
(continuous line) and the SJS for the the top-k ranked features (dashed line).
608 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez
EM
MR
0.25
0.2
Absorbance
0.15
0.1
0.05
−0.05
1000 1500 2000 2500 3000 3500 4000
Wavenumber (cm−1)
Fig. 5. Average FT-IR spectrum of omental fat samples for Milk Replacer (MR) and
Ewe Milk (EM)
Table 1. Stability of several feature selectors evaluated with the similarity measure
based on the Jensen-Shannon divergence (SJ S ) on a set of 50 rankings
The differences between SJS for top-k lists and top-k ranked lists is explained
by the fact that in the latter, differences/similarities in the lowest ranks are
attached less importance than differences/similarities in the highest ranks. Thus,
results show that the four feature selectors share a common trend: SJS (top-k)
assigns a lower value of stability that may be sometimes substantially different.
Thus, for the 1R feature selector, SJS (ranked top-400) is 0.82, but it drops to
0.70 when all features are given a uniform weight. This is explained by the fact
that many differences appear at the bottom of the list and when they are given
the same importance as differences at the top of the list, the stability measure
drops considerably.
When we focus on the top-k (selected/ranked) features and the value of k is
low, the feature selectors are quite stable. For example, for k = 10, SJS takes
the value 0.92 for χ2 , 0.73 for 1R, 0.92 for GR and 0.91 for Relief.
The plots in Fig. 6 allow to see that the stability decreases as the cardinality
of the feature subset increases for the feature selection techniques 1R, χ2 and
GR while Relief shows an stability profile with high stability regardless of the
size of sublist. While looking at the whole picture GR is as stable as Relief in
general terms, when we focus on lists with the most important features, Relief’s
robustness does not decrease as the feature subset size increases.
The proposed metric SJS can be compared with the Spearman’s rank corre-
lation coefficient (SR ) when it comes to measure the stability of full ranked lists.
Likewise, it can be compared with the Kuncheva’s stability index (KI) if partial
lists are considered. Note, however, that SJS is suitable for whatever problem.
Feature Selection Stability Assessment Based on the JS Divergence 609
1R 2
χ
1 1 S Top−k
S Top−k JS
JS
S Ranked lists S Ranked lists
JS
JS
0.8 0.8
0.6 0.6
SJS
SJS
0.4 0.4
0.2 0.2
0 0
0 500 1000 1500 0 500 1000 1500
Top−k lists Top−k lists
Gain Ratio Relief
1 1
0.8 0.8
S Top−k
0.6 0.6 JS
SJS
SJS
S Ranked lists
JS
0 0
0 500 1000 1500 0 500 1000 1500
Top−k lists Top−k lists
Fig. 6. Feature selection methods 1R, χ2 , GR and Relief applied on the Omental Fat
Spectra Dataset. Stability measure SJ S for feature subsets with different cardinality.
Table 2. Stability of several feature selectors evaluated with the similarity measure
based on the Spearman’s rank correlation coefficient (SR ) on a set of 50 rankings.
2
Measuring the robustness with SR and KI requires the computation of 50(50−1)
pairwise similarities for each algorithm to end up averaging these computations
as stated in Eq.(6). According to the SR values recorded in Table 2, Relief ap-
pears as the most stable (0.94) ranking algorithm, whereas 1R is quite unstable
(0.79). When SJS works on the full ranked lists, it gives a stability indication
similar to SR and the findings derived from them are not contradictory. When
SJS works on the top-k lists, its value is similar to the provided by KI (see
Fig. 7), what allows to see the SJS measure as a generalized SJS metric that
can work not only with full ranked lists or top-k lists, but also with top-k ranked
lists, while the others are restricted to a particular list format.
610 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez
1R 2
χ
1 1
S Top−k S Top−k
JS JS
KI Top−k KI Top−k
0.8 0.8
0.6 0.6
SJS
SJS
0.4 0.4
0.2 0.2
0 0
0 500 1000 1500 0 500 1000 1500
Top−k lists Top−k lists
Gain Ratio Relief
1 1
0.8 0.8
0.6 0.6
SJS
SJS
S Top−k
JS
0.4 0.4
S Top−k
JS KI Top−k
0.2 KI Top−k 0.2
0 0
0 500 1000 1500 0 500 1000 1500
Top−k lists Top−k lists
Fig. 7. Feature selection methods 1R, χ2 , GR and Relief applied on the Omental Fat
Spectra Dataset. Stability measure SJ S and KI for top-k lists with different cardinality.
5 Conclusions
The robustness of the feature ranking techniques used for knowledge discovery
is an issue of recent interest. In this work, we consider the problem of feature
selection/ranking stability and propose a metric based on the Jensen-Shannon
divergence (SJS ) able to capture the disagreement among the lists generated
in different runs by a feature ranking algorithm from different perspectives: (a)
considering the full ranked feature lists, (b) focusing on the top-k features, that is
to say, lists that contain the k most relevant features giving a uniform importance
to all them and (c) considering partial ranked lists that retain the most relevant
features together with the ranking information.
The new metric SJS shows the relative amount of randomness of the rank-
ing/selection algorithm, independently of the sublist size and unlike other met-
rics that evaluate pairwise similarities, SJS evaluates directly the whole set of
lists (with the same size). Up to our knowledge, no metrics have been proposed
so far to measure the similarity between partial ranked feature lists. Moreover,
the new measure accepts whatever representation of the feature selection output
and its behavior is: (i) close to the Spearman’s rank correlation coefficient for full
Feature Selection Stability Assessment Based on the JS Divergence 611
ranked lists and (ii) similar to the Kuncheva’s index for top-k lists. If the ranking
is taken into account, the differences at the top of the list would be considered
more important than differences that appear at the bottom part.
The stability of feature selection algorithms opens a wide area of research
that includes the development of more robust feature selection techniques, their
correlation with classifier performance and different approaches to analyze ro-
bustness. The proposal of visual techniques to ease the stability analysis and
the exploration of committee-based feature selectors is our immediate future
research.
References
1. Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P., Saeys, Y.: Robust biomarker
identification for cancer diagnosis with ensemble feature selection methods. Bioin-
formatics 26(3), 392 (2010)
2. Aslam, J., Pavlu, V.: Query Hardness Estimation Using Jensen-Shannon Diver-
gence Among Multiple Scoring Functions. In: Amati, G., Carpineto, C., Romano,
G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 198–209. Springer, Heidelberg (2007)
3. Boulesteix, A.-L., Slawski, M.: Stability and aggregation of ranked gene lists 10(5),
556–568 (2009)
4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley and Sons,
Chichester (2001)
5. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to instability problems with
sequential wrapper-based approaches to feature selection. Trinity College Dublin
Computer Science Technical Report, 2002–2028
6. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach.
Learn. Res. 3, 1157–1182 (2003)
7. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A.: Feature Extraction: Foundations
and Applications. Studies in Fuzziness and Soft Computing. Springer-Verlag New
York, Inc., Secaucus (2006)
8. He, Z., Yu, W.: Stable feature selection for biomarker discovery. Technical Report
arXiv:1001.0887 (January 2010)
9. Jurman, G., Merler, S., Barla, A., Paoli, S., Galea, A., Furlanello, C.: Algebraic
stability indicators for ranked lists in molecular profiling. Bioinformatics 24(2), 258
(2008)
10. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms. In:
Fifth IEEE International Conference on Data Mining, p. 8. IEEE, Los Alamitos
(2005)
11. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a
study on high-dimensional spaces. Knowledge and Information Systems 12, 95–116
(2007), doi:10.1007/s10115-006-0040-8
12. Kullback, S., Leibler, R.: On information and sufficiency. The Annals of Mathe-
matical Statistics 22(1), 79–86 (1951)
13. Kuncheva, L.I.: A stability index for feature selection. In: Proceedings of the 25th
conference on Proceedings of the 25th IASTED International Multi-Conference:
Artificial Intelligence and Applications, pp. 390–395. ACTA Press (2007)
14. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Transactions
on Information Theory 37(1), 145–151 (1991)
612 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez
15. Loscalzo, S., Yu, L., Ding, C.: Consensus group stable feature selection. In: Proceed-
ings of the 15th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD 2009, pp. 567–576 (2009)
16. Lustgarten, J.L., Gopalakrishnan, V., Visweswaran, S.: Measuring Stability of Fea-
ture Selection in Biomedical Datasets. In: AMIA Annual Symposium Proceedings,
vol. 2009, p. 406. American Medical Informatics Association (2009)
17. MATLAB. version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachusetts
(2010)
18. Osorio, M.T., Zumalacrregui, J.M., Alaiz-Rodrguez, R., Guzman-Martnez, R.,
Engelsen, S.B., Mateo, J.: Differentiation of perirenal and omental fat quality
of suckling lambs according to the rearing system from fourier transforms mid-
infrared spectra using partial least squares and artificial neural networks. Meat
Science 83(1), 140–147 (2009)
19. Saeys, Y., Abeel, T., Peer, Y.: Robust Feature Selection Using Ensemble Feature
Selection Techniques. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML
PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 313–325. Springer, Heidelberg
(2008)
20. Somol, P., Novovicova, J.: Evaluating stability and comparing output of feature
selectors that optimize feature subset cardinality. IEEE Transactions on Pattern
Analysis and Machine Intelligence 32, 1921–1939 (2010)
21. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech-
niques with Java Implementations. Morgan Kaufmann, San Francisco (1999)
22. Zucknick, M., Richardson, S., Stronach, E.A.: Comparing the characteristics of gene
expression profiles derived by univariate and multivariate classification methods.
Statistical Applications in Genetics and Molecular Biology 7(1), 7 (2008)
Mining Actionable Partial Orders in
Collections of Sequences
1 Introduction
Mining subsequence patterns (gaps between the symbols are allowed) in a col-
lection of sequences is one of the most important data mining frameworks with
many applications including analysis of time-related processes, telecommunica-
tions, bioinformatics, business, software engineering, Web click stream mining,
etc [1]. The framework was first introduced in [2] as the problem of sequential
pattern mining and defined as follows. Given a collection of itemset-sequences
(sequence database of transactions of variable lengths) and a minimum frequency
(support) threshold, the task is to find all subsequence patterns, occurring across
the itemset-sequences in the collection, whose relative frequency is greater than
the minimum frequency threshold. Although state of the art mining algorithms
can efficiently derive a complete set of frequent sequential patterns under cer-
tain constraints, including mining closed sequential patterns [3] and maximal
sequential patterns [4], the set of discovered sequential patterns is still too large
for practical usage [1] by usually containing a large fraction of non-significant
and redundant patterns [5]. As a solution to this problem a method for ranking
sequential patterns with respect to significance was presented in [6].
Another line of research to address the limitations of sequential pattern mining
was partial order mining [7],[8],[9], where a partial order on a set of items (poset)
is an ordering relation between the items in the set. The relation is called partial
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 613–628, 2011.
c Springer-Verlag Berlin Heidelberg 2011
614 R. Gwadera, G. Antonini, and A. Labbi
a d
[a, b, d, c]
[a, d, b, c]
[a, d, c, b]
c
[d, a, b, c]
b
[d, a, b, c]
order to reflect the fact that not every pair of elements of a poset are related.
Thus, in general the relation can be of three types: (I) empty, meaning there is no
ordering information between the items; (II) partial and (III) total corresponding
to a sequence. A partial order can be represented as a Directed Acyclic Graph
(DAG), where the nodes correspond to items and the directed edges represent
the ordering relation between the items. Figure 1 presents an example partial
order for items {a, b, c, d}. The main appeal of partial orders is to provide a
more compact representation, where a single partial order can represent the same
ordering information between co-occurring items in the collection of sequences
as a set of sequential patterns. As an example of this property of partial orders,
consider the partial order in Figure 1 and the corresponding set of total orders
that it summarizes in Figure 2. Now imagine that the set of the total orders
is the input to algorithm PrefixSpan [10] for sequential pattern mining. Setting
the minimum relative support threshold minRelSup = 0.2 we obtain twenty
three frequent sequential patterns, while only one partial order is required to
summarize the ordering information expressed by the set of sequences in Figure 2.
However, in practice, even a discovered set of frequent closed partial orders (using
algorithm Frecpo [9]) is still too large for an effective usage. Therefore, we address
this problem by proposing a method for ranking partial orders with respect to
significance that extends our previous work on ranking sequential patterns [6].
Using the ranking framework we can discover a small set of frequent partial
orders that can be actionable (informative) for a domain expert of analyzed data
who would like to turn the knowledge of the discovered patterns into an action
in his domain (e.g., to optimize web-based marketing for a marketing expert).
The main thrust of our approach is that the expected relative frequency of a par-
tial order in a given reference model can be computed from an exact formula in
one step. In this paper we are interested only in over-represented partial orders
and our algorithm for ranking partial orders with respect to significance works
as follows: (I) we find the complete set of frequent closed partial orders for a
given minimum support threshold using a variant of algorithm Frecpo [9]; (II)
we compute their expected relative frequencies and variances in the reference
model and (III) we rank the frequent closed partial orders with respect to sig-
nificance by computing the divergence (Z-score) between the observed and the
expected relative frequencies. Given the reference model, a significant divergence
between an observed and the expected frequency indicates that there is a de-
pendency between occurrences of items in the corresponding partial order. Note
that the only reason we use the minimum relative support threshold is because
we are interested in over-represented patterns. In particular, we set a value of
the threshold that is low enough to discover low support significant patterns and
that is high enough such that the discovered significant patterns are actionable
for marketing specialists. Given the rank values we discover actionable partial
orders by first pruning non-significant and redundant partial orders and then by
ordering the remaining partial orders with respect to significance.
As a reference model for the input collection of sequence we use an indepen-
dence mixture model, that corresponds to a generative process that generates a
collection of sequences as follows: (I) it first generates the size w of a sequence to
be generated from the distribution of sizes in the input collection of sequences
and (II) it generates a sequence of items of size w from the distribution of items
in the input collection of sequences. So for such a reference model the expected
relative frequency refers to the average relative frequency in randomized col-
lections of sequences, where the marginal distribution of symbols and sequence
length are the same as in the input collection of sequences. Note that the refer-
ence model can be easily extended to Markov models in the spirit of [12]. The
reasons we consider the reference model to be the independence model in this
paper are as follows: (I) it has an intuitive interpretation as a method for dis-
covering dependencies; (II) it leads to a polynomial algorithm for computing the
expected relative frequencies of partial orders in the reference model and (III) it
is a reasonable model for our real data set as explained in Section 4.
Our ranking framework builds on the work [6] by extending it to the case of
arbitrary partial orders, where [6] provided a formula for the expected relative
frequency of a sequential pattern, i.e, a pattern that occurs in a collection of
itemset-sequences of variable lengths. This paper also fills out the important
gap between [11] and [13] by providing a formula for the expected frequency
of an arbitrary partial order in a sequence of size w, where [11] analyzed the
expected frequency of a serial pattern and [13] analyzed the expected frequency
of sets of subsequence patterns including the special case of the parallel pattern
(all permutations of a set of symbols).
616 R. Gwadera, G. Antonini, and A. Labbi
In finding the formula for the expected frequency of a partial order we were
inspired by the works in [14],[15], [16] that considered the problem of enumer-
ating all linear extensions of a poset, where a linear extension is a total order
satisfying a partial order. For example, given the partial order in Figure 1 the set
of sequences in Figure 2 is the set of all linear extensions of that partial order.
The challenges in analyzing partial orders in comparison to the previous works
are as follows: (I) analytic challenge: arbitrary partial orders can have complex
dependencies between the items that complicate the probabilistic analysis and
(II) algorithmic challenge: an arbitrary poset may correspond to large set of lin-
ear extensions and computing probability of such a set is more computationally
demanding then for a single pattern.
The contributions of this paper are as follows: (I) it provides the first method
for ranking partial orders with respect to significance that leads to an algo-
rithm for mining actionable partial orders and (II) it is the first application of
significant partial orders in web usage mining.
In experiments conducted on a collection of visits to a website of a multina-
tional technology and consulting firm we show the applicability of our framework
to discover partial orders of frequently visited webpages that can be actionable
in optimizing effectiveness of web-based marketing.
The paper is organized as follows. Section 2 reviews theoretical foundations,
Section 3 presents our framework for mining actionable partial orders in collec-
tions of sequences, Section 4 presents experimental results and finally Section 5
presents conclusions.
2 Foundations
A = {a1 , a2 , . . . , a|A| } is an alphabet of size |A|. S = {s(1) , s(2) , . . . , s(n) } is a set
(i) (i) (i)
of sequences of size n = |S|, where s(i) = [s1 , s2 , . . . , sn(i) ] is the i-th sequence
(i)
of length n(i) , where st ∈ A.
A sequence s = [s1 , s2 ,. . . , sm ] is a subsequence of sequence s = [s1 , s2 ,. . . , sm ],
denoted s s , if there exist integers 1 ≤ i1 ≤ i2 . . . ≤ im such that s1 = si1 ,
s2 = si2 ,. . . , sm = sim . We also say that s is a supersequence of s and s is
contained in s .
The support (frequency) of a sequence s in S, denoted by supS (s), is defined
as the number of sequences in S that contain s as a subsequence. The relative
support (relative frequency) rsupS (s) = sup|S| S (s)
is the fraction of sequences that
contain s as a subsequence. Given a set of symbols {s1 , . . . , sm } we distinguish
the following two extreme types of occurrences as a subsequence in another se-
quence s : (I) serial pattern, denoted s = [s1 ,. . . , sm ], meaning that the symbols
must occur in the order and (II) parallel pattern, denoted s = {s1 , . . . , sm },
meaning that the symbols may occur in any order. We use the term sequen-
tial pattern with respect to the serial pattern in the sequential pattern mining
framework, where the pattern occurs across a collection of input sequences.
Mining Actionable Partial Orders in Collections of Sequences 617
M
∃
P (s) = αw · P ∃ (s|w), (1)
w=|s|
a b c
0 1 2 3
Fig. 3. G< (e) for serial pattern e = [a, b, c], where A is a finite alphabet of size |A| > 3
and {aj } = A \ aj is the set complement of element aj ∈ A
components [11]. The initial state is 0, the accepting state is m and the states
excluding the initial state correspond to indexes of symbols in e. The powers ni
for i = 1, 2, . . . m symbolize the number of times the i-th self-loop is used on the
path from state 0 to m. Thus, P ∃ (e|w) is equal to the sum of probabilities of
all distinct paths from state 0 to m of length w in G< (e). P ∃ (e|w) for a serial
pattern in 0-order Markov reference model can be expressed as follows
w−m
m
P ∃ (e|w) = P (e) (1 − P (ek ))nk , (2)
m
i=0 k=1 nk =i
k=1
m
where P (e) = i=1 P (ei ) and P (ei ) is the probability of symbol ei in the ref-
erence model and (2) can be evaluated in O(w2 ) using a dynamic programming
algorithm [11].
The DFA for a parallel pattern e = {e1 , e2 , . . . , em } called G (e), whose
example is presented in Figure 4 for e = {a, b, c}, has the following compo-
nents [13]. The initial state is {∅}, the accepting state is {1, 2, . . . , m} and the
states excluding the initial state correspond to the non-empty subsets of in-
dexes of symbols in e. Let E(e) be the set of serial patterns corresponding
to all permutations of symbols in e. Let X be the set of all distinct simple
paths (i.e., without self-loops) from the initial state to the accepting state and
let Edges(path) be the sequence of edges on a path path ∈ X . Then clearly
E(e) = {Edges(path) : path ∈ X ∈ G (e)}. P ∃ (e|w) for a parallel pattern in
0-order Markov reference model can be computed in O(m!w2 ) by evaluating (2)
for each member of E(e), where in place of 1 − P (ek ) we use the probability of
the complement of corresponding self-loop labels of states in the path in G (e).
n1,2
{c}
n2
{a, c}
a {1, 2}
{2}
c
b
c n2,3 An1,2,3
n∅ n1 {a}
{a, b, c} {b, c}
b
a a
{∅} {1} {2, 3} {1, 2, 3}
c
c
n3
{a, b} n1,3 b
b {b}
{3} a
{1, 3}
Fig. 4. G (e) for parallel pattern e = {a, b, c}, where A is a finite alphabet of size
|A| > 3 and {aj } = A \ aj is the set complement of element aj ∈ A
A(P) = {a, b, c, d}, R(P) = {(a, b), (a, c), (d, c)} and E(P) is presented in Figure
2. We use the terms poset and partial order interchangeably in the case where
all elements of a poset are part of the relation (e.g., as in Figure 1). Clearly, a
serial pattern is a total order and a parallel pattern is a trivial order. A graph
G is said to be transitive if for every pair of vertices u an v there is a directed
path in G from u to v. The transitive closure GT of G is the least subset of
V × V which contains G and is transitive. A graph Gt is a transitive reduction of
directed graph G whenever the following two conditions are satisfied: (I) there is
a directed path from vertex u to vertex v in Gt if and only if there is a directed
path from u to v in G and (II) there is no graph with fewer arcs than Gt satisfying
condition (I).
A partial order s is contained in partial order s , denoted s s , if s ⊆
G (s ). We also say that s is a super partial order of s. The relative support
T
pattern over A . Then the following property holds in the independence reference
model
P ∃ (sserial |w) ≤ P ∃ (spartial |w) ≤ P ∃ (sparallel |w). (3)
620 R. Gwadera, G. Antonini, and A. Labbi
Thus, (3) follows from the fact that in a random sequence of size w a serial pat-
tern is the least likely to occur since it corresponds to only one linear extension,
while the parallel pattern is the most likely to occur since it corresponds to m!
linear extensions. So the probability of existence of a partially ordered pattern,
depending on the ordering relation size, is at least equal to P ∃ (sserial |w) and at
most equal to P ∃ (sparallel |w).
where Lw is the set of all distinct paths of length w, including self-loops, from
the initial to any of the accepting states.
Mining Actionable Partial Orders in Collections of Sequences 621
– the initial state is [0, . . . , 0] and each out of n = |E(P)| accepting states corresponds to
a member of E (P)
– each non-initial state is [i(1) , . . . , i(n) ] denoting the prefix length for serial pattern s(j) ∈
E (P) in a prefix tree of the members of E (P)
– A self-loop from state [i(1) , . . . , i(n) ] to itself exists and has label equal to
• A if ∃i(j) = |A(P)|
(j)
• A − i(j) {si(j) +1 } if ∀i(j) < |A(P)|.
An3,2,1
n2,1,0
{c}
c 3, 2, 1
2, 1, 0
b n1,2,1 An2,3,1
n1,1,0
{b}
{b, c}
n0,0,0 c b
{a, c} 1, 1, 0 1, 2, 1 2, 3, 1
a
0, 0, 0 n0,0,1
n1,1,2
{a}
c {b}
An2,1,3
0, 0, 1 a
1, 1, 2 b
2, 1, 3
Fig. 6. G≤ (P) for partial order P = {a → b, c}, where A is a finite alphabet of size
|A| > 3 and {aj } = A \ aj is the set complement of element aj ∈ A
4 Experiments
We conducted the experiments on a collection of Web server access logs to a
website of a multinational technology and consulting firm. The access logs consist
of two fundamental data objects:
1. pageview: (the most basic level of data abstraction) is an aggregate repre-
sentation of a collection of Web objects (frames, graphics, and scripts) con-
tributing to the display on a users browser resulting from a single user action
(such as a mouse-click). Viewable pages generally include HTML, text/pdf
files and scripts while image, sound and video files are not.
2. visit (session): an interaction by an individual with a web site consisting of
one or more requests for a page (pageviews). If an individual has not taken
another action, typically additional pageviews on the site within a specified
time period the visit will terminate. In our data set the visit timeout value
is 30 minutes and a visit may not exceed a length of 24 hours. Thus, a visit
is a sequence of pageviews (click-stream) by a single user.
The purpose of the experiments was to show that the discovered partial orders of
visited pages can be actionable in optimizing effectiveness of web-based market-
ing. The data contains access logs to the website of the division of the company
that deals with Business Services (BS) and spans the period from 1-01-2010 to
1-08-2010. The main purpose of the web-based marketing for that division is to
advertise business services to potential customers arriving to the web site from
a search engine by using relevant keywords in their query strings. Therefore, we
considered the following subset of all visits: visits referred from a search engine
(Google, Yahoo, Bing, etc.), having a valid query string and at least four distinct
pages among their pageviews. As a result we obtained a collection of sequences
624 R. Gwadera, G. Antonini, and A. Labbi
Fig. 7. Partial order at rank 1, where sigRank = 2.30e + 02 and the relative frequency
Ω n = 3.09e − 02
Mining Actionable Partial Orders in Collections of Sequences 625
bs_main
ba_people ba_work
ba_main
Fig. 8. Partial order at rank 6, where sigRank = 4.74e + 01 and the relative frequency
Ω n = 1.56e − 02
ba_ideas
ba_main
ba_people
bs_main
ba_work
Fig. 9. Partial order at rank 7, where sigRank = 4.51e + 01 and the relative frequency
Ω n = 1.02e − 02
financial_management
bs_main sales_marketing_services
human_capital_management
Fig. 10. Partial order at rank 12, where sigRank = 3.52e + 01 and the relative fre-
quency Ω n = 1.03e − 02
including the BA work and ideas page until they arrive at the BA people page.
This pattern may suggest that by emphasizing the link to BA people page on
the BA/BS main pages the users would directly get to the BA people page.
Figure 10 presents the partial order at rank 12 that is the first in decreasing
order of the rank to contain the financial management and the human capital
management pages. This partial order may suggests that those two pages should
be accessible directly from the BS main page in order to increase their visibility.
Figure 11 presents the partial order at rank 24 that is the first in decreasing
order of the rank that contains the strategy planing page. As it turns out it
significantly co-occurs with the financial management, the human capital man-
agement and the sales marketing services page. This pattern may suggest that
the three co-occurring pages with the strategy planing page should contain well-
exposed links to the strategy planing page to increase its visibility.
Figure 12 presents a comparison of P ∃ (P) against the baseline P ∃ (sserial )
and P ∃ (sparallel ), where sserial is a serial and sparallel is the parallel pattern
over symbols of P, where a proper value of P ∃ (spartial ) = P ∃ (P) should sat-
isfy (3). In order to compute P ∃ (sserial ) and P ∃ (sparallel ) for the discovered
626 R. Gwadera, G. Antonini, and A. Labbi
strategy_planning
financial_management
bs_main
sales_marketing_services
human_capital_management
Fig. 11. Partial order at rank 24, where sigRank = 2.73e + 01 and the relative fre-
quency Ω n = 1.02e − 02
Fig. 12. P ∃ (spartial ) = P ∃ (P) for the discovered partial orders in comparison to their
serialized/paralellized versions P ∃ (sserial )/P ∃ (sparallel ), where as to be expected for a
valid value of P ∃ (spartial ), P ∃ (sserial ) ≤ P ∃ (spartial ) ≤ P ∃ (sparallel ) is satisfied in all
cases
Fig. 13. Number of frequent partial orders, number of final partial orders obtained from
our algorithm and effectiveness of pruning methods as a function of minRelSup. The
pruning methods are applied in the following order: significance pruning, bottom-up
pruning and top-down pruning
5 Conclusions
We presented a method for ranking partial orders with respect to significance and
an algorithm for mining actionable partial orders. In experiments, conducted on a
collection of visits to a website of a multinational technology and consulting firm
we showed the applicability of our framework to discover partially ordered sets
of frequently visited webpages that can be actionable in optimizing effectiveness
of web-based marketing.
References
1. Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and
future directions. Data Mining and Knowledge Discovery 15(1) (2007)
2. Agrawal, R., Srikant, R.: Mining sequential patterns. In: ICDE, pp. 3–14 (1995)
3. Yan, X., Han, J., Afshar, R.: Clospan: Mining closed sequential patterns in large
datasets. In: SDM, pp. 166–177 (2003)
4. Guan, E., Chang, X., Wang, Z., Zhou, C.: Mining maximal sequential patterns. In:
2005 International Conference on Neural Networks and Brain, pp. 525–528 (2005)
5. Huang, X., An, A., Cercone, N.: Comparison of interestingness functions for learn-
ing web usage patterns. In: Proceedings of the Eleventh International Conference
on Information and Knowledge Management, CIKM 2002, pp. 617–620. ACM, New
York (2002)
6. Gwadera, R., Crestani, F.: Ranking sequential patterns with respect to signifi-
cance. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010.
LNCS(LNAI), vol. 6118, pp. 286–299. Springer, Heidelberg (2010)
628 R. Gwadera, G. Antonini, and A. Labbi
7. Mannila, H., Meek, C.: Global partial orders from sequential data. In: KDD 2000:
Proceedings of the sixth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 161–168. ACM, New York (2000)
8. Casas-Garriga, G.: Summarizing sequential data with closed partial orders. In:
Proceedings of the Fifth SIAM International Conference on Data Mining, April
2005, pp. 380–390 (2005)
9. Pei, J., Wang, H., Liu, J., Wang, K., Wang, J., Yu, P.S.: Discovering frequent
closed partial orders from strings. IEEE Transactions on Knowledge and Data
Engineering 18, 1467–1481 (2006)
10. Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q.: Mining sequential
patterns by pattern-growth: The prefixspan approach. TKDE 16 (November 2004)
11. Gwadera, R., Atallah, M., Szpankowski, W.: Reliable detection of episodes in event
sequences. In: Third IEEE International Conference on Data Mining, pp. 67–74
(November 2003)
12. Gwadera, R., Atallah, M., Szpankowski, W.: Markov models for discovering signif-
icant episodes. In: SIAM International Conference on Data Mining, pp. 404–414
(April 2005)
13. Atallah, M., Gwadera, R., Szpankowski, W.: Detection of significant sets of episodes
in event sequences. In: Fourth IEEE International Conference on Data Mining, pp.
67–74 (October 2004)
14. Varol, Y.L., Rotem, D.: An algorithm to generate all topological sorting arrange-
ments. The Computer Journal 24(1), 83–84 (1981)
15. Pruesse, G., Ruskey, F.: Generating linear extensions fast. SIAM J. Comput. 23,
373–386 (1994)
16. Knuth, D.E., Szwarcfiter, J.L.: A structured program to generate all topological
sorting arrangements. Inf. Process. Lett
A Game Theoretic Framework for Data Privacy
Preservation in Recommender Systems
1 Introduction
The need for expert recommender systems becomes ever increasing in our days,
due to the massive amount of information and abundance of choices available
for virtually any human action involving decision making, which is connected
explicitly or implicitly to the Internet. Recommender systems arise with various
contexts, from providing personalized search results and targeted advertising,
to making social network related suggestions, up to providing personalized sug-
gestions on various goods and services. Internet users become more and more
D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 629–644, 2011.
c Springer-Verlag Berlin Heidelberg 2011
630 M. Halkidi and I. Koutsopoulos
of the rating strategy of each user, such that no user can benefit in terms of
improving its privacy by unilaterally deviating from that strategy. User strategies
converge to the NEP after an iterative best-response strategy update; (iii) for
a hybrid recommendation system, we find that the NEP strategy for each user
in terms of privacy preservation is to declare false rating only for one item, the
one that is highly ranked in his private profile and less correlated with items
for which he anticipates recommendationq; (iv) We present various modes of
cooperation by which users can mutually benefit. To the best of our knowledge,
this is the first work that applies the framework of game theory to address the
arising user interaction in privacy preserving recommendation systems.
The rest of the paper is organized as follows. In section 2 we present the
model and assumptions for our approach. Section 3 elaborates on the case of
a hybrid recommendation system. In section 4 we obtain valuable insights for
conflict and cooperation in user interaction by analyzing the case of two users.
Section 5 includes numerical results and section 6 concludes our study.
Let us denote by q−i = (q1 , . . . , qi−1 , qi+1 , . . . , qN ) the declared rating vector
of all users except user i. Thus, ri = fi (qi , q−i ). Now, let r̃i = fi (pi , q−i ) be
the resulting recommendation vector if user i declared its true profile. Then, the
goal above is quantified by the following constraint for user i:
2 2
(ri − r̃i ) ≤ D ⇔ [fi (qi , q−i ) − fi (pi , q−i )] ≤ D , (1)
where D is an upper bound that denotes the maximum distortion that can
be tolerated in the recommendation by user i. We assume that all users are
characterized by the same such maximum tolerable distortion amount D.
subject to:
2
[fi (qi , q−i ) − fi (pi , q−i )] ≤ D (3)
In other words, user i has to select its declared profile vector out of a set of
feasible profile vectors which satisfy (1). Nevertheless, this set of feasible vectors
is determined by declared profiles q−i of other users. Denote by F (q−i ) this
feasible set of vectors.
Notice that the problem stated above involves only the point of view of user
i which behaves in a selfish but rational manner. That is, he cares only about
its own maximum privacy conservation, subject to keeping the quality of the
recommendation good enough, and he does not take into account the objectives
of other users. In his effort to optimally address and resolve this tradeoff, and in
particular to ensure that the recommendation vector will be close enough to the
one he would get under full privacy compromise, other users’ strategies matter.
These other users also act in the same rational way since they strive to fulfill
their own privacy preservation objectives, while trying to maintain good quality
recommendation for themselves.
The NEP denotes the point that comprises strategies of all users, from which
no user will benefit if it deviates from his strategy unilaterally. In our problem,
in the NEP (q∗1 , . . . , q∗N ), no user i can further increase its privacy preservation
metric g(·) by altering its declared profile to qi = q∗i , provided that all other
users stay with their NEP declared profiles.
where Q∗ is the NEP. Namely, a cooperation regime is feasible if: (i) belongs to
the set of feasible vectors as specified by constraint (1) for all users, (ii) each
user has a privacy at least as much as the one he receives at the NEP. This
latter requirement renders cooperation meaningful for the user and provides the
incentive to the user so as to participate in the coordinated effort.
A first goal of cooperation is to jointly find the set of feasible cooperation
regimes, call it Fc . If Fc
= ∅, there exists at least one joint strategy Q0 such
that all users are privacy-wise better off compared to the NEP, and this strategy
can be found from solving the set of inequalities in (5). Out of the set of feasible
cooperation regimes, a further goal could be to select one that maximizes the
global privacy objective G(P, Q) or one that guarantees certain properties of
the privacy preservation vector (g(p1 , q1 ), . . . , g(pN , qN )).
the recommendation server applies the following measure to compute metrics ri
for items
∈ Si , ∈ Sj for j
= i, so as to rate them and include them in the
recommendation vector that is sent to each user i:
1 1
ri = qj · ρk qik , (6)
N −1 |Si |
j
=i: k∈Si
∈Sj
where ρk ∈ [0, 1] is the correlation between items k and . The server computes
the metric above for all ∈ Si and forms vector ri . We will assume that the
|I| × |I| correlation matrix that contains the pairwise correlations between any
two items in the system is computed a priori, it is fixed, it is preloaded to the
server and is known by user agents. For example, if the items are movies, the
correlation between two movies could be directly related to the common theme
of the movie, common starring actors, the director or other attributes.
The recommendation metric above pertaining to user i can be viewed as an
instance of a hybrid recommendation. Indeed, the first term above implies a
collaborative filtering approach, in which, for each item under tentative rec-
ommendation to user i, the ratings of all other users are aggregated. On the
other hand, the second term can be viewed as representative of a content-based
recommendation approach, since it involves a correlation metric that connects
item (candidate for recommendation) with other items that user i has viewed.
Here, the aggregation function in the first part is taken to be simply the mean
rating of all other users j
= i which have already viewed the item. Clearly various
modes of aggregating the ratings of other users can be employed. For example,
different weights may be applied in the aggregation. Or, only the ratings from a
subset of users are taken into account, e.g. K users which have viewed common
items with user i, where K is a parameter of the recommendation server. These
K users are denoted by set Ui . In this case the first term would be equal to:
1
qj
K
j∈Ui :|Ui |=K
Si ∩Sj
=∅
The metric above reflectsthe intuitive fact that privacy preservation increases as
the Euclidean distance k∈Si (pik − qik )2 between the declared and the private
profiles increases. This distance is weighted by the private rating pik so as to
capture the fact that, among items whose private and declared rating have the
638 M. Halkidi and I. Koutsopoulos
1 1
qj · ρk (qik − pik )2 ≤ D (8)
|Si | N −1
k∈Si
∈ Si j
=i:
∈Sj
(t−1)
subject to constraint (8), which includes {qj }j
=i from the previous itera-
(t)
tion, and thus it computes his own declared rating vector qi for the current
iteration t.
(t)
– STEP 3: Each agent declares its rating qi to the server.
A Game Theoretic Framework for Data Privacy Preservation 639
Calculates
recommendation Recommendation Aggregates ratings
vectors ri ,i=1,...,N server
sends to users
q1 qi ri rN
r1 q
N
Fig. 1. Overview of system architecture and of the data exchange process at each
iteration
with
1
βik = qj ρk (11)
|Si |
∈Si j
=i:∈Sj
where the factor of 2 comes due to the 1/|S1 | factor in the left-hand side of
inequality. Similarly, for user 2, the problem is:
D
max p2B x2B + p2C x2C , subject to: ρAB x2B + ρAC x2C ≤ . (17)
x2B ,x2C q1A
The NEP is the point (x∗1 , x∗2 ) = (x∗1A , x∗1B , x∗2B , x∗2C ) that solves the two prob-
lems above. Depending on the private rating vectors p1 = (p1A , p1B ), p2 =
(p2B , p2C ) and correlations ρAC , ρBC , ρAB , we distinguish four cases:
ρAC ρBC ρAB ρAC
≶ , and ≶ (18)
p1A p1B p2B p2C
Observe that user 1 will declare the true private profile for the item for which
the fraction above is larger, and it declare a different rating for the item for
which the fraction is the smaller. Thus, he prefers to declare different rating for
the item that is higher ranked and least correlated to item C that is candidate
for recommendation to him. For instance, if ρpAC1A
< ρpBC
1B
, user 1 will maximize
its privacy by setting x1B = 0, thus declaring q1B = p1B , while x1A = ρAC 2D
q2C .
For each of the four cases above, we have a respective NEP. For example, if
ρAC ρBC ρAB ρAC
< and < (19)
p1A p1B p2B p2C
then the NEP is:
2D 2D
x1 = ( , 0), and x2 = ( , 0) (20)
ρAC p2C ρAB (p1A ± ρAC
2D
)
p2C
or, equivalently
2D 2D
q∗1 = (p1A ± ∗
, p1B ) , q2 = (p2B ± , p2C ) ,
ρAC p2C ρ AB (p1A ±
2D
ρAC p2C )
(21)
and the privacy metrics are P1 = p1A x1A , P2 = p2B x2B .
5 Numerical Results
70
user5
60 user10
user20
user50
50
Privacy preservation
40
30
20
10
0
2 3 4 5 6 7 8 9 10
max distortion tolerance in recommendation (D)
14
12
Privacy preservation
10
user5
user10
8 user20
2
0 5 10 15 20
Iteration
Fig. 3. Convergence of the iterative best response strategy for different users
ratings lie in the value interval [1, 5]. In Figure 2, we depict the privacy preserva-
tion metric for different users as a function of the maximum distortion tolerance
D in recommendation. We observe that as user tolerance to recommendation
quality increases, the privacy preservation metric also increases. This confirms
the tradeoff between privacy preservation and quality of recommendation. Thus,
users that are less tolerant to the error of recommendation quality they receive
from the server, have to reveal more information about their preferences.
Our approach is based on the iterative best response process of data exchange
between users and the recommendation server that was discussed in section 3.
A Game Theoretic Framework for Data Privacy Preservation 643
User 10
22
D=2
20 D=4
D=6
18
Privacy preservation 16
14
12
10
4
0 5 10 15 20
Iteration
Fig. 4. Convergence of the iterative best response strategy for different values of D
Figure 3 depicts and verifies the convergence of the privacy preservation iteration
for different users at the NEP as the rating vector exchange process progresses.
Then we consider that a specific user chooses to vary the values of his/her
maximum distortion tolerance in recommendation (D). Figure 4 also shows that
the privacy preservation iteration of a user convergences at the NEP and this
can also be verified for different values of D. It is clear that after a small number
of iterations, usually no more than 2 − 3, the system converges and the privacy
preservation metric of a user at the NEP is determined. This fast convergence
of the best response update is a direct consequence of the linear programming
type of problem that each user solves.
6 Conclusion
In this work, we took a first step towards characterizing the fundamental tradeoff
between privacy preservation and good quality recommendation. We introduced
a game theoretic framework for capturing the interaction and conflicting inter-
ests of users in the context of privacy preservation in recommendation systems.
Viewed abstractly from the perspective of each user, the privacy preservation
problem that arises in the process of deciding about the declared profile reduces
to that of placing the declared rating vector sufficiently far away from the actual,
private vector. The constraint on having recommendation quality close enough
to the one that would be achieved if the true profile was revealed, places a
constraint on the meaningful distance between the actual and declared profiles.
Nevertheless, the key challenge is that the extent to which this constraint is sat-
isfied, depends on the declared profiles of other users as well, which in turn face
a similar profile vector placement problem. We attempted to capture this inter-
action, we characterized the Nash Equilibrium Points, and we proposed various
modes of cooperation of users.
644 M. Halkidi and I. Koutsopoulos
References
1. Balabanovic, M., Shoham, Y.: Fab: Content-based collaborative recommendation.
Communications of the Association for Computing Machinery 40(3) (1997)
2. Berkovsky, S., Eytani, Y., Kuflik, T., Ricci, F.: Enhancing privacy and preserving
accuracy of a distributed collaborative filtering. In: Proc. of ACM RecSys (2007)
3. Canny, J.: Collaborative filtering with privacy. In: IEEE Symposium on Security
and Privacy (2002)
4. Cotter, P., Smyth, B.: PTV: Intelligent personalized tv guides. In: Proc. of
AAAI/IAAI (2002)
5. Kargupta, H., Das, K., Liu, K.: A game theoretic approach toward multi-party
privacy-preserving distributed data mining. In: Proc. of PKDD (2007)
6. Lathia, N., Hailes, S., Capra, L.: Private distributed collaborative filtering using
estimated concordance measures. In: Proc. of ACM RecSys (2007)
7. Linden, G., Smith, B., York, J.: Amazon.com recommendations: Item-to-item col-
laborative filtering. IEEE Internet Computing 7(1) (2003)
8. Mellville, P., Mooney, R.J., Nagarajan, R.: Content-boosted collaborative filtering
for improved recommedations. In: Proc. of the National Conference on Artificial
Intelligence (2002)
9. Melville, P., Sindhwani, V.: Recommender Systems, Encyclopedia of Machine
Learning. Springer, Heidelberg (2010)
10. Miller, B., Konstan, J.A., Riedl, J.: Pocketlens: Toward a personal recommender
system. ACM Transactions on Information Systems 22(3) (2004)
11. Mooney, R.J., Roy, L.: Content-based book recommending using learning for text
categorization. In: Proc. of ACM Conf. on Digital Libraries (2000)
12. Nisan, N., Roughgarden, T., Tardos, E., Vazirani, V.V.: Algorithmic Game Theory,
Cambridge (2007)
13. Polat, H., Du, W.: Privacy-preserving collaborative filtering using randomized per-
turbation techniques. In: Proc. of Inter. Conf. on Data Mining, ICDM (2003)
14. Resnick, P., Iacovou, N., Sushak, M., Bergstrom, M., Reidl, J.: Grouplens: An
open architecture for collaborative filtering of netnews. In: Proc. of the Computer
Supported Cooperative Work Conference (1994)
15. Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filtering
recommendation algorithms. In: Proc. of the Inter. Conf. of WWW (2001)
16. Shokri, R., Pedarsani, P., Theodorakopoulos, G., Hubaux, J.P.: Preserving privacy
in collaborative filtering through distributed aggregation of offline profiles. In: Proc.
of ACM RecSys (2009)
17. Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Ad-
vances in Artificial Intelligence (2009)
Author Index