Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ML & Knowledge Discovery in Databases I

Download as pdf or txt
Download as pdf or txt
You are on page 1of 678

Lecture Notes in Artificial Intelligence 6911

Subseries of Lecture Notes in Computer Science

LNAI Series Editors


Randy Goebel
University of Alberta, Edmonton, Canada
Yuzuru Tanaka
Hokkaido University, Sapporo, Japan
Wolfgang Wahlster
DFKI and Saarland University, Saarbrücken, Germany

LNAI Founding Series Editor


Joerg Siekmann
DFKI and Saarland University, Saarbrücken, Germany
Dimitrios Gunopulos Thomas Hofmann
Donato Malerba Michalis Vazirgiannis (Eds.)

Machine Learning and


Knowledge Discovery
in Databases

European Conference, ECML PKDD 2011


Athens, Greece, September 5-9, 2011
Proceedings, Part I

13
Series Editors

Randy Goebel, University of Alberta, Edmonton, Canada


Jörg Siekmann, University of Saarland, Saarbrücken, Germany
Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany

Volume Editors

Dimitrios Gunopulos
University of Athens, Greece
E-mail: dg@di.uoa.gr
Thomas Hofmann
Google Switzerland GmbH, Zurich, Switzerland
E-mail: thofmann@google.com
Donato Malerba
University of Bari “Aldo Moro”, Bari, Italy
E-mail: malerba@di.uniba.it
Michalis Vazirgiannis
Athens University of Economics and Business, Greece
E-mail: mvazirg@aueb.gr

ISSN 0302-9743 e-ISSN 1611-3349


ISBN 978-3-642-23779-9 e-ISBN 978-3-642-23780-5
DOI 10.1007/978-3-642-23780-5
Springer Heidelberg Dordrecht London New York

Library of Congress Control Number: Applied for

CR Subject Classification (1998): I.2, H.2.8, H.2, H.3, G.3, J.1, I.7, F.2.2, F.4.1

LNCS Sublibrary: SL 7 – Artificial Intelligence

© Springer-Verlag Berlin Heidelberg 2011


This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Welcome to ECML PKDD 2011

Welcome to the European Conference on Machine Learning and Principles and


Practice of Knowledge Discovery in Databases (ECML PKDD 2011) held in
Athens, Greece, during September 5–9, 2011. ECML PKDD is an annual confer-
ence that provides an international forum for the discussion of the latest high-
quality research results in all areas related to machine learning and knowledge
discovery in databases as well as other innovative application domains. Over
the years it has evolved as one of the largest and most selective international
conferences in machine learning and data mining, the only one that provides a
common forum for these two closely related fields.
ECML PKDD 2011 included all the scientific events and activities of big con-
ferences. The scientific program consisted of technical presentations of accepted
papers, plenary talks by distinguished keynote speakers, workshops and tutorials,
a discovery challenge track, as well as demo and industrial tracks. Moreover, two
co-located workshops were organized on related research topics. We expect that
all those scientific activities provide opportunities for knowledge dissemination,
fruitful discussions and exchange of ideas among people both from academia and
industry. Moreover, we hope that this conference will continue to offer a unique
forum that stimulates and encourages close interaction among researchers work-
ing on machine learning and data mining.
We were very happy to have the conference back in Greece after 1995 when
ECML was successfully organized in Heraklion, Crete. However, this was the
first time that the joint ECML PKDD event was organized in Greece and, more
specifically, in Athens, with the conference venue boasting a superb location
under the Acropolis and in front of the Temple of Zeus. Besides the scientific
activities, the conference offered delegates an attractive range of social activities,
such as a welcome reception on the roof garden of the conference venue directly
facing the Acropolis hill, a poster session at “Technopolis” Gazi industrial park,
the conference banquet, and a farewell party at the new Acropolis Museum,
one of the most impressive archaeological museums worldwide, which included
a guided tour of the museum exhibits.
Several people worked hard together as a superb dedicated team to ensure
the successful organization of this conference. First, we would like to express our
thanks and deep gratitude to the PC Chairs Dimitrios Gunopulos, Thomas Hof-
mann, Donato Malerba and Michalis Vazirgiannis. They efficiently carried out
the enormous task of coordinating the rigorous hierarchical double-blind review
process that resulted in a rich, while at the same time, very selective scientific
program. Their contribution was crucial and essential in all phases and aspects of
the conference organization and it was by no means restricted only to the paper
review process. We would also like to thank the Area Chairs and Program Com-
mittee members for the valuable assistance they offered to the PC Chairs in their
VI Welcome to ECML PKDD 2011

timely completion of the review process under strict deadlines. Special thanks
should also be given to the Workshop Co-chairs, Bart Goethals and Katharina
Morik, the Tutorial Co-chairs, Fosca Giannotti and Maguelonne Teisseire, the
Discovery Challenge Co-chairs, Alexandros Kalousis and Vassilis Plachouras, the
Industrial Session Co-chairs, Alexandros Ntoulas and Michail Vlachos, the Demo
Track Co-chairs, Michelangelo Ceci and Spiros Papadimitriou, and the Best Pa-
per Award Co-chairs, Sunita Sarawagi and Michèle Sebag. We further thank the
keynote speakers, workshop organizers, the tutorial presenters and the organizers
of the discovery challenge.
Furthermore, we are indebted to the Publicity Co-chairs, Annalisa Appice
and Grigorios Tsoumakas, who developed and implemented an effective dissem-
ination plan and supported the Program Chairs in the production of the pro-
ceedings, and also to Margarita Karkali for the development, support and timely
update of the conference website. We further thank the members of the ECML
PKDD Steering Committee for their valuable help and guidance.
The conference was financially supported by the following generous sponsors
who are worthy of special acknowledgment: Google, Pascal2 Network, Xerox, Ya-
hoo Labs, COST-MOVE Action, Rapid-I, FP7-MODAP Project, Athena RIC /
Institute for the Management of Information Systems, Hellenic Artificial Intel-
ligence Society, Marathon Data Systems, and Transinsight. Additional support
was generously provided by Sony, Springer, and the UNESCO Privacy Chair Pro-
gram. This support has given us the opportunity to specify low registration rates,
provide video-recording services and support students through travel grants for
attending the conference. The substantial effort of the Sponsorship Co-chairs,
Ina Lauth and Ioannis Kopanakis, was crucial in order to attract these spon-
sorships, and therefore, they deserve our special thanks. Special thanks should
also be given to the five organizing institutions, namely, University of Bari “Aldo
Moro”, Athens University of Economics and Business, University of Athens, Uni-
versity of Ioannina, and University of Piraeus for supporting in multiple ways
our task.
We would like to especially acknowledge the members of the Local Organiza-
tion team, Maria Halkidi, Despoina Kopanaki and Nikos Pelekis, for making all
necessary local arrangements and Triaena Tours & Congress S.A. for efficiently
handling finance and registrations. The essential contribution of the student vol-
unteers also deserves special acknowledgment.
Finally, we are indebted to all researchers who considered this conference as
a forum for presenting and disseminating their research work, as well as to all
conference participants, hoping that this event will stimulate further expansion
of research and industrial activities in machine learning and data mining.

July 2011 Aristidis Likas


Yannis Theodoridis
Preface

The European Conference on Machine Learning and Principles and Practice of


Knowledge Discovery in Databases (ECML PKDD 2011) took place in Athens,
Greece, during September 5–9, 2011. This year we have completed the first
decade since the junction between the European Conference on Machine Learn-
ing and the Principles and Practice of Knowledge Discovery in Data Bases con-
ferences, which as independent conferences date back to 1986 and 1997, respec-
tively. During this decade there has been an increasing integration of the two
fields, as reflected by the rising number of submissions of top-quality research re-
sults. In 2008 a single ECML PKDD Steering Committee was established, which
gathered senior members of both communities.
The ECML PKDD conference is a highly selective conference in both areas,
the leading forum where researchers in machine learning and data mining can
meet, present their work, exchange ideas, gain new knowledge and perspectives,
and motivate the development of new interesting research results. Although tra-
ditionally based in Europe, ECML PKDD is also a truly international conference
with rich and diverse participation.
In 2011, as in previous years, ECML PKDD followed a full week sched-
ule, from Monday to Friday. It featured six plenary invited talks by Rakesh
Agrawal, Albert-László Barabási, Christofer Bishop, Andrei Broder, Marco Gori
and Heikki Mannila. Monday and Friday were devoted to workshops selected
by Katharina Morik and Bart Goethals, and tutorials, organized and selected
by Fosca Giannotti and Maguelonne Teisseire. There was also an interesting in-
dustrial session, managed by Alexandros Ntoulas and Michalis Vlachos, which
welcomed distinguished speakers from the ML and DM industry: Vasilis Agge-
lis, Radu Jurca, Neel Sundaresan and Olivier Verscheure. The 2011 discovery
challenge was organized by Alexandros Kalousis and Vassilis Plachouras.
The main conference program unfolded from Tuesday to Thursday, where
121 papers selected among 599 full-paper submissions were presented in the
technical parallel sessions and in a poster session open to all accepted papers.
The acceptance rate of 20% supports the traditionally high standards of the
joint conference. The selection process was assisted by 35 Area Chairs, each
supervising the reviews and discussions of about 17 papers, and by 270 members
of the Program Committee, with the help of 197 additional reviewers. While the
selection process was made particularly intense due to the very high number of
submissions, we are grateful and heartily thank all Area Chairs, members of the
Program Committee, and additional reviewers for their commitment and hard
work during the tight reviewing period.
The composition of the paper topics covered a wide spectrum of issues. A sig-
nificant portion of the accepted papers dealt with core issues such as supervised
VIII Preface

and unsupervised learning with some innovative contributions in fundamental


issues such as cluster-ability of a dataset.
Other fundamental issues tackled by accepted papers include dimensionality
reduction, distance and similarity learning, model learning and matrix/tensor
analysis. In addition, there was a significant cluster of papers with valuable con-
tributions on graph mining, graphical models, hidden Markov models, kernel
methods, active and ensemble learning, semi-supervised and transductive learn-
ing, mining sparse representations, model learning, inductive logic programming,
and statistical learning.
A significant part of the program covered novel and timely applications of
data mining and machine learning in industrial domains, including: privacy-
preserving and discrimination-aware mining, spatiotemporal data mining, text
mining, topic modeling, learning from environmental and scientific data, Web
mining and Web search, link analysis, bio/medical data, data Streams and sensor
data, ontology-based data, relational data mining, learning from time series data,
time series data.
In the past three editions of the joint conference, the two Springer journals
Machine Learning and Data Mining and Knowledge Discovery published the
top papers in two special issues printed in advance of the conference. These
papers were not included in the conference proceedings, so there was no double
publication of the same work. A novelty introduced this year was the post-
conference publication of the special issues in order to guarantee the expected
high-standard reviews for top-quality journals. Therefore, authors of selected
machine learning and data mining papers were invited to submit a significantly
extended version of their paper to the special issues. The selection was made
by Program Chairs on the basis of their exceptional scientific quality and high
impact on the field, as indicated by conference reviewers.
Following an earlier tradition, the Best Paper Chairs Sunita Sarawagi and
Michèle Sebag contributed to the selection of papers deserving the Best Pa-
per Awards and Best Student Paper Awards in Machine Learning and in Data
Mining, sponsored by Springer. As ECML PKDD completes 10 years of joint
organization, the PC chairs, together with the steering committee, initiated a
10-year Awards series. This award is established for the author(s), whose paper
appeared in the ECML PKDD conference 10 years ago, and had the most im-
pact on the machine learning and data mining research since then. This year’s,
first award, committee consisted of three PC co-chairs (Dimitrios Gunopulos, Do-
nato Malerba and Michalis Vazirgiannis) and three Steering Committee members
(Wray Buntine, Bart Goethals and Michèle Sebag).
The conference also featured a demo track, managed by Michelangelo Ceci
and Spiros Papadimitriou; 11 demos out of 21 submitted were selected by a
Demo Track Program Committee, presenting prototype systems that illustrate
the high impact of machine learning and data mining application in technology.
The demo descriptions are included in the proceedings. We further thank the
members of the Demo Track Program Committee for their efforts in timely
reviewing submitted demos.
Preface IX

Finally, we would like to thank the General Chairs, Aristidis Likas and Yannis
Theodoridis, for their critical role in the success of the conference, the Tutorial,
Workshop, Demo, Industrial Session, Discovery Challenge, Best Paper, and Local
Chairs, the Area Chairs and all reviewers, for their voluntary, highly dedicated
and exceptional work, and the ECML PKDD Steering Committee for their help
and support. Our last and warmest thanks go to all the invited speakers, the
speakers, all the attendees, and especially to the authors who chose to submit
their work to the ECML PKDD conference and thus enabled us to build up this
memorable scientific event.

July 2011 Dimitrios Gunopulos


Thomas Hofmann
Donato Malerba
Michalis Vazirgiannis
Organization

ECML/PKDD 2011 was organized by the Department of Computer Science—


University of Ioannina, Department of Informatics—University of Piraeus, De-
partment of Informatics and Telecommunications—University of Athens,
Google Inc. (Zurich), Dipartimento di Informatica—Università degli Studi di
Bari “Aldo Moro,” and Department of Informatics—Athens University of
Economics & Business.

General Chairs
Aristidis Likas University of Ioannina, Greece
Yannis Theodoridis University of Piraeus, Greece

Program Chairs
Dimitrios Gunopulos University of Athens, Greece
Thomas Hofmann Google Inc., Zurich, Switzerland
Donato Malerba University of Bari “Aldo Moro,” Italy
Michalis Vazirgiannis Athens University of Economics and Business,
Greece

Workshop Chairs
Katharina Morik University of Dortmund, Germany
Bart Goethals University of Antwerp, Belgium

Tutorial Chairs
Fosca Giannotti Knowledge Discovery and Delivery Lab, Italy
Maguelonne Teisseire TETIS Lab. Department of Information
System and LIRMM Lab. Department of
Computer Science, France

Best Paper Award Chairs


Sunita Sarawagi Computer Science and Engineering,
IIT Bombay, India
Michèle Sebag University of Paris-Sud, France
XII Organization

Industrial Session Chairs


Alexandros Ntoulas Microsoft Research, USA
Michail Vlachos IBM Zurich Research Laboratory, Switzerland

Demo Track Chairs


Michelangelo Ceci University of Bari “Aldo Moro,” Italy
Spiros Papadimitriou Google Research

Discovery Challenge Chairs


Alexandros Kalousis University of Geneva, Switzerland
Vassilis Plachouras Athens University of Economics and Business,
Greece

Publicity Chairs
Annalisa Appice University of Bari “Aldo Moro,” Italy
Grigorios Tsoumakas Aristotle University of Thessaloniki, Greece

Sponsorship Chairs
Ina Lauth IAIS Fraunhofer, Germany
Ioannis Kopanakis Technological Educational Institute of Crete,
Greece

Organizing Committee
Maria Halkidi University of Pireaus, Greece
Despina Kopanaki University of Piraeus, Greece
Nikos Pelekis University of Piraeus, Greece

Steering Committee
José Balcázar Bart Goethals
Francesco Bonchi Katharina Morik
Wray Buntine Dunja Mladenic
Walter Daelemans John Shawe-Taylor
Aristides Gionis Michèle Sebag
Organization XIII

Area Chairs

Elena Baralis Michael May


Hendrik Blockeel Taneli Mielikainen
Francesco Bonchi Yücel Saygin
Gautam Das Arno Siebes
Janez Demsar Jian Pei
Amol Deshpande Myra Spiliopoulou
Carlotta Domeniconi Jie Tang
Tapio Elomaa Evimaria Terzi
Floriana Esposito Bhavani M. Thuraisingham
Fazel Famili Hannu Toivonen
Wei Fan Luis Torgo
Peter Flach Ioannis Tsamardinos
Johannes Furnkranz Panayiotis Tsaparas
Aristides Gionis Ioannis P. Vlahavas
George Karypis Haixun Wang
Ravi Kumar Stefan Wrobel
James Kwok Xindong Wu
Stan Matwin

Program Committee

Foto Afrati Konstantinos Blekas


Aijun An Mario Boley
Aris Anagnostopoulos Zoran Bosnic
Gennady Andrienko Marco Botta
Ion Androutsopoulos Jean-Francois Boulicaut
Annalisa Appice Pavel Bradzil
Marta Arias Ulf Brefeld
Ira Assent Paula Brito
Vassilis Athitsos Wray Buntine
Martin Atzmueller Toon Calders
Jose Luis Balcazar Rui Camacho
Daniel Barbara Longbing Cao
Sugato Basu Michelangelo Ceci
Roberto Bayardo Tania Cerquitelli
Klaus Berberich Sharma Chakravarthy
Bettina Berendt Keith Chan
Michele Berlingerio Vineet Chaoji
Michael Berthold Keke Chen
Indrajit Bhattacharya Ling Chen
Marenglen Biba Xue-wen Chen
Albert Bifet Weiwei Cheng
Enrico Blanzieri Yun Chi
XIV Organization

Silvia Chiusano Maxim Gurevich


Vassilis Christophides Maria Halkidi
Frans Coenen Mohammad Hasan
James Cussens Jose Hernandez-Orallo
Alfredo Cuzzocrea Eyke Hüllermeier
Maria Damiani Vasant Honavar
Atish Das Sarma Andreas Hotho
Tijl De Bie Xiaohua Hu
Jeroen De Knijf Ming Hua
Colin de la Higuera Minlie Huang
Antonios Deligiannakis Marcus Hutter
Krzysztof Dembczynski Nathalie Japkowicz
Anne Denton Szymon Jaroszewicz
Christian Desrosiers Daxin Jiang
Wei Ding Alipio Jorge
Ying Ding Theodore Kalamboukis
Debora Donato Alexandros Kalousis
Kurt Driessens Panagiotis Karras
Chris Drummond Samuel Kaski
Pierre Dupont Ioannis Katakis
Saso Dzeroski John Keane
Tina Eliassi-Rad Kristian Kersting
Roberto Esposito Latifur Khan
Nicola Fanizzi Joost Kok
Fabio Fassetti Christian Konig
Ad Feelders Irena Koprinska
Hakan Ferhatosmanoglou Walter Kosters
Stefano Ferilli Georgia Koutrika
Cesar Ferri Stefan Kramer
Daan Fierens Raghuram Krishnapuram
Eibe Frank Marzena Kryszkiewicz
Enrique Frias-Martinez Nicolas Lachiche
Elisa Fromont Nada Lǎvrac
Efstratios Gallopoulos Wang-Chien Lee
Byron Gao Feifei Li
Jing Gao Jiuyong Li
Paolo Garza Juanzi Li
Ricard Gavalda Tao Li
Floris Geerts Chih-Jen Lin
Pierre Geurts Hsuan-Tien Lin
Aris Gkoulalas-Divanis Jessica Lin
Bart Goethals Shou-de Lin
Vivekanand Gopalkrishnan Song Lin
Marco Gori Helger Lipmaa
Henrik Grosskreutz Bing Liu
Organization XV

Huan Liu Kunal Punera


Yan Liu Chedy Raissi
Corrado Loglisci Jan Ramon
Chang-Tien Lu Huzefa Rangwala
Ping Luo Zbigniew Ras
Panagis Magdalinos Ann Ratanamahatana
Giuseppe Manco Jan Rauch
Yannis Manolopoulos Matthias Renz
Simone Marinai Christophe Rigotti
Dimitrios Mavroeidis Fabrizio Riguzzi
Ernestina Menasalvas Celine Robardet
Rosa Meo Marko Robnik-Sikonja
Pauli Miettinen Pedro Rodrigues
Dunja Mladenic Fabrice Rossi
Marie-Francine Moens Juho Rousu
Katharina Morik Celine Rouveirol
Mirco Nanni Ulrich Rückert
Alexandros Nanopoulos Salvatore Ruggieri
Benjamin Nguyen Stefan Ruping
Frank Nielsen Lorenza Saitta
Siegfried Nijssen Ansaf Salleb-Aouissi
Richard Nock Claudio Sartori
Kjetil Norvag Lars Schmidt-Thieme
Irene Ntoutsi Matthias Schubert
Salvatore Orlando Michèle Sebag
Gerhard Paass Thomas Seidl
George Paliouras Prithviraj Sen
Apostolos Papadopoulos Andrzej Skowron
Panagiotis Papapetrou Carlos Soares
Stelios Paparizos Yangqiu Song
Dimitris Pappas Alessandro Sperduti
Ioannis Partalas Jerzy Stefanowski
Srinivasan Parthasarathy Jean-Marc Steyaert
Andrea Passerini Alberto Suarez
Vladimir Pavlovic Johan Suykens
Dino Pedreschi Einoshin Suzuki
Nikos Pelekis Panagiotis Symeonidis
Jing Peng Marcin Szczuka
Ruggero Pensa Andrea Tagarelli
Bernhard Pfahringer Domenico Talia
Fabio Pinelli Pang-Ning Tan
Enric Plaza Letizia Tanca
George Potamias Lei Tang
Michalis Potamias Dacheng Tao
Doina Precup Nikolaj Tatti
XVI Organization

Martin Theobald Jieping Ye


Dimitrios Thilikos Jeffrey Yu
Jilei Tian Philip Yu
Ivor Tsang Bianca Zadrozny
Grigorios Tsoumakas Gerson Zaverucha
Theodoros Tzouramanis Demetris Zeinalipour
Antti Ukkonen Filip Zelezny
Takeaki Uno Changshui Zhang
Athina Vakali Kai Zhang
Giorgio Valentini Kun Zhang
Maarten van Someren Min-Ling Zhang
Iraklis Varlamis Nan Zhang
Julien Velcin Shichao Zhang
Celine Vens Zhongfei Zhang
Jean-Philippe Vert Junping Zhang
Vassilios Verykios Ying Zhao
Herna Viktor Bin Zhou
Jilles Vreeken Zhi-Hua Zhou
Willem Waegeman Kenny Zhu
Jianyong Wang Xingquan Zhu
Wei Wang Djamel Zighed
Xuanhui Wang Indre Zliobaite
Hui Xiong Blaz Zupan

Additional Reviewers
Pedro Abreu Brigitte Boden
Raman Adaikkalavan Samuel Branders
Artur Aiguzhinov Janez Brank
Darko Aleksovski Agnès Braud
Tristan Allard Giulia Bruno
Alessia Amelio Luca Cagliero
Panayiotis Andreou Yuanzhe Cai
Fabrizio Angiulli Ercan Canhas
Josephine Antoniou Eugenio Cesario
Andrea Argentini George Chatzimilioudis
Krisztian Balog Xi Chen
Teresa M.A. Basile Anna Ciampi
Christian Beecks Marek Ciglan
Antonio Bella Tyler Clemons
Aurelien Bellet Joseph Cohen
Dominik Benz Carmela Comito
Thomas Bernecker Andreas Constantinides
Alberto Bertoni Michael Corsello
Jerzy Blaszczynski Michele Coscia
Organization XVII

Gianni Costa Faisal Kamiran


Claudia D’Amato Michael Kamp
Mayank Daswani Mehdi Kargar
Gerben de Vries Mehdi Kaytoue
Nicoletta Del Buono Jyrki Kivinen
Elnaz Delpisheh Dragi Kocev
Elnaz Delpisheh Domen Košir
Engin Demir Iordanis Koutsopoulos
Nicola Di Mauro Hardy Kremer
Ivica Dimitrovski Anastasia Krithara
Martin Dimkovski Artus Krohn-Grimberghe
Huyen Do Onur Kucuktunc
Lucas Drumond Tor Lattimore
Ana Luisa Duboc Florian Lemmerich
Tobias Emrich Jun Li
Saeed Faisal Rong-Hua Li
Zheng Fang Yingming Li
Wei Feng Siyi Liu
Cèsar Ferri Stefano Lodi
Alessandro Fiori Claudio Lucchese
Nuno A. Fonseca Gjorgji Madjarov
Marco Frasca M. M. Hassan Mahmud
Christoph Freudenthaler Fernando Martı́nez-Plumed
Sergej Fries Elio Masciari
Natalja Friesen Michael Mathioudakis
David Fuhry Ida Mele
Dimitrios Galanis Corrado Mencar
Zeno Gantner Glauber Menezes
Bo Gao Pasquale Minervini
Juan Carlos Gomez Ieva Mitasiunaite-Besson
Alberto Grand Folke Mitzlaff
Francesco Gullo Anna Monreale
Tias Guns Gianluca Moro
Marwan Hassani Alessandro Moschitti
Zhouzhou He Yang Mu
Daniel Hernàndez-Lobato Ricardo Ñanculef
Tomas Horvath Franco Maria Nardini
Dino Ienco Robert Neumayer
Elena Ikonomovska Phuong Nguyen
Anca Ivanescu Stefan Novak
Francois Jacquenet Tim O’Keefe
Baptiste Jeudy Riccardo Ortale
Dustin Jiang Aline Paes
Lili Jiang George Pallis
Maria Jose Luca Pappalardo
XVIII Organization

Jérôme Paul Anže Staric


Daniel Paurat Erik Strumbelj
Sergios Petridis Ilija Subasic
Darko Pevec Peter Sunehag
Jean-Philippe Peyrache Izabela Szczech
Anja Pilz Frank Takes
Marc Plantevit Aditya Telang
Matija Polajnar Eleftherios Tiakas
Giovanni Ponti Gabriele Tolomei
Adriana Prado Marko Toplak
Xu Pu Daniel Trabold
ZhongAng Qi Roberto Trasarti
Ariadna Quattoni Paolo Trunfio
Alessandra Raffaetà George Tsatsaronis
Maria Jose Ramı́rez-Quintana Guy Van den Broeck
Hongda Ren Antoine Veillard
Salvatore Rinzivillo Mathias Verbeke
Ettore Ritacco Jan Verwaeren
Matthew Robards Alessia Visconti
Christophe Rodrigues Dimitrios Vogiatzis
Giulio Rossetti Dawei Wang
André Rossi Jun Wang
Cláudio Sá Louis Wehenkel
Venu Satuluri Adam Woznica
Leander Schietgat Ming Yang
Marijn Schraagen Qingyan Yang
Wen Shao Xintian Yang
Wei Shen Lan Zagar
A. Pedro Duarte Silva Jure Žbontar
Claudio Silvestri Bernard Zenko
Ivica Slavkov Pin Zhao
Henry Soldano Yuanyuan Zhu
Yi Song Albrecht Zimmermann
Miha Stajdohar Andreas Züfle

Sponsoring Institutions
We wish to express our gratitude to the sponsors of ECML PKDD 2011 for their
essential contribution to the conference: Google, Pascal2 Network, Xerox, Yahoo
Lab, COST/MOVE (Knowledge Discovery from Movin, Objects), FP7/MODAP
(Mobility, Data Mining, and Privacy), Rapid-I (report the future), Athena/IMIS
(Institute for the Management of Information Systems), EETN (Hellenic Arti-
ficial Intelligence Society), MDS Marathon Data Systems, Transinsight, SONY,
UNESCO Privacy Chair, Springer, Università degli Studi di Bari “Aldo Moro,”
Athens University of Economics and Business, University of Ioannina, National
and Kapodistrian University of Athens, and the University of Piraeus.
Table of Contents – Part I

Invited Talks (Abstracts)


Enriching Education through Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Rakesh Agrawal

Human Dynamics: From Human Mobility to Predictability . . . . . . . . . . . . 3


Albert-László Barabási

Embracing Uncertainty: Applied Machine Learning Comes of Age . . . . . . 4


Christopher Bishop

Highly Dimensional Problems in Computational Advertising . . . . . . . . . . . 5


Andrei Broder

Learning from Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6


Marco Gori

Permutation Structure in 0-1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7


Heikki Mannila

Industrial Invited Talks (Abstracts)


Reading Customers Needs and Expectations with Analytics . . . . . . . . . . . 8
Vasilis Aggelis

Algorithms and Challenges on the GeoWeb . . . . . . . . . . . . . . . . . . . . . . . . . . 9


Radu Jurca

Data Science and Machine Learning at Scale . . . . . . . . . . . . . . . . . . . . . . . . 10


Neel Sundaresan

Smart Cities: How Data Mining and Optimization Can Shape Future
Cities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Olivier Verscheure

Regular Papers
Preference-Based Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Riad Akrour, Marc Schoenauer, and Michele Sebag

Constraint Selection for Semi-supervised Topological Clustering . . . . . . . . 28


Kais Allab and Khalid Benabdeslem
XX Table of Contents – Part I

Is There a Best Quality Metric for Graph Clusters? . . . . . . . . . . . . . . . . . . 44


Hélio Almeida, Dorgival Guedes, Wagner Meira Jr., and
Mohammed J. Zaki

Adaptive Boosting for Transfer Learning Using Dynamic Updates . . . . . . 60


Samir Al-Stouhi and Chandan K. Reddy

Peer and Authority Pressure in Information-Propagation Models . . . . . . . 76


Aris Anagnostopoulos, George Brova, and Evimaria Terzi

Constrained Logistic Regression for Discriminative Pattern Mining . . . . . 92


Rajul Anand and Chandan K. Reddy

α-Clusterable Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108


Gerasimos S. Antzoulatos and Michael N. Vrahatis

Privacy Preserving Semi-supervised Learning for Labeled Graphs . . . . . . 124


Hiromi Arai and Jun Sakuma

Novel Fusion Methods for Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . 140


Muhammad Awais, Fei Yan, Krystian Mikolajczyk, and Josef Kittler

A Spectral Learning Algorithm for Finite State Transducers . . . . . . . . . . . 156


Borja Balle, Ariadna Quattoni, and Xavier Carreras

An Analysis of Probabilistic Methods for Top-N Recommendation in


Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Nicola Barbieri and Giuseppe Manco

Learning Good Edit Similarities with Generalization Guarantees . . . . . . . 188


Aurélien Bellet, Amaury Habrard, and Marc Sebban

Constrained Laplacian Score for Semi-supervised Feature Selection . . . . . 204


Khalid Benabdeslem and Mohammed Hindawi

COSNet: A Cost Sensitive Neural Network for Semi-supervised


Learning in Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Alberto Bertoni, Marco Frasca, and Giorgio Valentini

Regularized Sparse Kernel Slow Feature Analysis . . . . . . . . . . . . . . . . . . . . 235


Wendelin Böhmer, Steffen Grünewälder, Hannes Nickisch, and
Klaus Obermayer

A Selecting-the-Best Method for Budgeted Model Selection . . . . . . . . . . . . 249


Gianluca Bontempi and Olivier Caelen

A Robust Ranking Methodology Based on Diverse Calibration of


AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Róbert Busa-Fekete, Balázs Kégl, Tamás Éltető, and György Szarvas
Table of Contents – Part I XXI

Active Learning of Model Parameters for Influence Maximization . . . . . . 280


Tianyu Cao, Xindong Wu, Tony Xiaohua Hu, and Song Wang

Sampling Table Configurations for the Hierarchical Poisson-Dirichlet


Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Changyou Chen, Lan Du, and Wray Buntine

Preference-Based Policy Iteration: Leveraging Preference Learning for


Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Weiwei Cheng, Johannes Fürnkranz, Eyke Hüllermeier, and
Sang-Hyeun Park

Learning Recommendations in Social Media Systems by Weighting


Multiple Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
Boris Chidlovskii

Clustering Rankings in the Fourier Domain . . . . . . . . . . . . . . . . . . . . . . . . . . 343


Stéphan Clémençon, Romaric Gaudel, and Jérémie Jakubowicz

PerTurbo: A New Classification Algorithm Based on the Spectrum


Perturbations of the Laplace-Beltrami Operator . . . . . . . . . . . . . . . . . . . . . . 359
Nicolas Courty, Thomas Burger, and Johann Laurent

Datum-Wise Classification: A Sequential Approach to Sparsity . . . . . . . . . 375


Gabriel Dulac-Arnold, Ludovic Denoyer, Philippe Preux, and
Patrick Gallinari

Manifold Coarse Graining for Online Semi-supervised Learning . . . . . . . . 391


Mehrdad Farajtabar, Amirreza Shaban, Hamid Reza Rabiee, and
Mohammad Hossein Rohban

Learning from Partially Annotated Sequences . . . . . . . . . . . . . . . . . . . . . . . . 407


Eraldo R. Fernandes and Ulf Brefeld

The Minimum Transfer Cost Principle for Model-Order Selection . . . . . . . 423


Mario Frank, Morteza Haghir Chehreghani, and
Joachim M. Buhmann

A Geometric Approach to Find Nondominated Policies to Imprecise


Reward MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Valdinei Freire da Silva and Anna Helena Reali Costa

Label Noise-Tolerant Hidden Markov Models for Segmentation:


Application to ECGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
Benoı̂t Frénay, Gaël de Lannoy, and Michel Verleysen

Building Sparse Support Vector Machines for Multi-Instance


Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang
XXII Table of Contents – Part I

Lagrange Dual Decomposition for Finite Horizon Markov Decision


Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
Thomas Furmston and David Barber

Unsupervised Modeling of Partially Observable Environments . . . . . . . . . 503


Vincent Graziano, Jan Koutnı́k, and Jürgen Schmidhuber

Tracking Concept Change with Incremental Boosting by Minimization


of the Evolving Exponential Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
Mihajlo Grbovic and Slobodan Vucetic

Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups


in a Reduced Candidate Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Henrik Grosskreutz and Daniel Paurat

Linear Discriminant Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . 549


Quanquan Gu, Zhenhui Li, and Jiawei Han

DB-CSC: A Density-Based Approach for Subspace Clustering in


Graphs with Feature Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
Stephan Günnemann, Brigitte Boden, and Thomas Seidl

Learning the Parameters of Probabilistic Logic Programs from


Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
Bernd Gutmann, Ingo Thon, and Luc De Raedt

Feature Selection Stability Assessment Based on the Jensen-Shannon


Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
Roberto Guzmán-Martı́nez and Rocı́o Alaiz-Rodrı́guez

Mining Actionable Partial Orders in Collections of Sequences . . . . . . . . . . 613


Robert Gwadera, Gianluca Antonini, and Abderrahim Labbi

A Game Theoretic Framework for Data Privacy Preservation in


Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
Maria Halkidi and Iordanis Koutsopoulos

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645


Table of Contents – Part II

Regular Papers
Common Substructure Learning of Multiple Graphical Gaussian
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Satoshi Hara and Takashi Washio

Mining Research Topic-Related Influence between Academia


and Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Dan He

Typology of Mixed-Membership Models: Towards a Design Method . . . . 32


Gregor Heinrich

ShiftTree: An Interpretable Model-Based Approach for Time Series


Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Balázs Hidasi and Csaba Gáspár-Papanek

Image Classification for Age-related Macular Degeneration Screening


Using Hierarchical Image Decompositions and Graph Mining . . . . . . . . . . 65
Mohd Hanafi Ahmad Hijazi, Chuntao Jiang, Frans Coenen, and
Yalin Zheng

Online Structure Learning for Markov Logic Networks . . . . . . . . . . . . . . . . 81


Tuyen N. Huynh and Raymond J. Mooney

Fourier-Information Duality in the Identity Management Problem . . . . . . 97


Xiaoye Jiang, Jonathan Huang, and Leonidas Guibas

Eigenvector Sensitive Feature Selection for Spectral Clustering . . . . . . . . . 114


Yi Jiang and Jiangtao Ren

Restricted Deep Belief Networks for Multi-view Learning . . . . . . . . . . . . . . 130


Yoonseop Kang and Seungjin Choi

Motion Segmentation by Model-Based Clustering of Incomplete


Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Vasileios Karavasilis, Konstantinos Blekas, and Christophoros Nikou

PTMSearch: A Greedy Tree Traversal Algorithm for Finding Protein


Post-Translational Modifications in Tandem Mass Spectra . . . . . . . . . . . . . 162
Attila Kertész-Farkas, Beáta Reiz, Michael P. Myers, and
Sándor Pongor
XXIV Table of Contents – Part II

Efficient Mining of Top Correlated Patterns Based on Null-Invariant


Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Sangkyum Kim, Marina Barsky, and Jiawei Han

Smooth Receiver Operating Characteristics (smROC) Curves . . . . . . . . . 193


William Klement, Peter Flach, Nathalie Japkowicz, and Stan Matwin

A Boosting Approach to Multiview Classification with Cooperation . . . . 209


Sokol Koço and Cécile Capponi

ARTEMIS: Assessing the Similarity of Event-Interval Sequences . . . . . . . 229


Orestis Kostakis, Panagiotis Papapetrou, and Jaakko Hollmén

Unifying Guilt-by-Association Approaches: Theorems and Fast


Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Danai Koutra, Tai-You Ke, U. Kang, Duen Horng (Polo) Chau,
Hsing-Kuo Kenneth Pao, and Christos Faloutsos

Online Clustering of High-Dimensional Trajectories under Concept


Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Georg Krempl, Zaigham Faraz Siddiqui, and Myra Spiliopoulou

Gaussian Logic for Predictive Classification . . . . . . . . . . . . . . . . . . . . . . . . . 277


Ondřej Kuželka, Andrea Szabóová, Matěj Holec, and Filip Železný

Toward a Fair Review-Management System . . . . . . . . . . . . . . . . . . . . . . . . . 293


Theodoros Lappas and Evimaria Terzi

Focused Multi-task Learning Using Gaussian Processes . . . . . . . . . . . . . . . 310


Gayle Leen, Jaakko Peltonen, and Samuel Kaski

Reinforcement Learning through Global Stochastic Search


in N-MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Matteo Leonetti, Luca Iocchi, and Subramanian Ramamoorthy

Analyzing Word Frequencies in Large Text Corpora Using Inter-arrival


Times and Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
Jefrey Lijffijt, Panagiotis Papapetrou, Kai Puolamäki, and
Heikki Mannila

Discovering Temporal Bisociations for Linking Concepts over Time . . . . . 358


Corrado Loglisci and Michelangelo Ceci

Minimum Neighbor Distance Estimators of Intrinsic Dimension . . . . . . . . 374


Gabriele Lombardi, Alessandro Rozza, Claudio Ceruti,
Elena Casiraghi, and Paola Campadelli

Graph Evolution via Social Diffusion Processes . . . . . . . . . . . . . . . . . . . . . . 390


Dijun Luo, Chris Ding, and Heng Huang
Table of Contents – Part II XXV

Multi-Subspace Representation and Discovery . . . . . . . . . . . . . . . . . . . . . . . 405


Dijun Luo, Feiping Nie, Chris Ding, and Heng Huang

A Novel Stability Based Feature Selection Framework for k-means


Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
Dimitrios Mavroeidis and Elena Marchiori

Link Prediction via Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437


Aditya Krishna Menon and Charles Elkan

On Oblique Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453


Bjoern H. Menze, B. Michael Kelm, Daniel N. Splitthoff,
Ullrich Koethe, and Fred A. Hamprecht

An Alternating Direction Method for Dual MAP LP Relaxation . . . . . . . 470


Ofer Meshi and Amir Globerson

Aggregating Independent and Dependent Models to Learn Multi-label


Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
Elena Montañés, José Ramón Quevedo, and Juan José del Coz

Tensor Factorization Using Auxiliary Information . . . . . . . . . . . . . . . . . . . . 501


Atsuhiro Narita, Kohei Hayashi, Ryota Tomioka, and
Hisashi Kashima

Kernels for Link Prediction with Latent Feature Models . . . . . . . . . . . . . . 517


Canh Hao Nguyen and Hiroshi Mamitsuka

Frequency-Aware Truncated Methods for Sparse Online Learning . . . . . . 533


Hidekazu Oiwa, Shin Matsushima, and Hiroshi Nakagawa

A Shapley Value Approach for Influence Attribution . . . . . . . . . . . . . . . . . . 549


Panagiotis Papapetrou, Aristides Gionis, and Heikki Mannila

Fast Approximate Text Document Clustering Using Compressive


Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
Laurence A.F. Park

Ancestor Relations in the Presence of Unobserved Variables . . . . . . . . . . . 581


Pekka Parviainen and Mikko Koivisto

ShareBoost: Boosting for Multi-view Learning with Performance


Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
Jing Peng, Costin Barbu, Guna Seetharaman, Wei Fan,
Xian Wu, and Kannappan Palaniappan

Analyzing and Escaping Local Optima in Planning as Inference for


Partially Observable Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Pascal Poupart, Tobias Lang, and Marc Toussaint
XXVI Table of Contents – Part II

Abductive Plan Recognition by Extending Bayesian Logic Programs . . . . 629


Sindhu Raghavan and Raymond J. Mooney

Higher Order Contractive Auto-Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645


Salah Rifai, Grégoire Mesnil, Pascal Vincent, Xavier Muller,
Yoshua Bengio, Yann Dauphin, and Xavier Glorot

The VC-Dimension of SQL Queries and Selectivity Estimation through


Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
Matteo Riondato, Mert Akdere, Uǧur Çetintemel,
Stanley B. Zdonik, and Eli Upfal

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677


Table of Contents – Part III

Regular Papers
Sparse Kernel-SARSA(λ) with an Eligibility Trace . . . . . . . . . . . . . . . . . . . 1
Matthew Robards, Peter Sunehag, Scott Sanner, and
Bhaskara Marthi

Influence and Passivity in Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18


Daniel M. Romero, Wojciech Galuba, Sitaram Asur, and
Bernardo A. Huberman

Preference Elicitation and Inverse Reinforcement Learning . . . . . . . . . . . . 34


Constantin A. Rothkopf and Christos Dimitrakakis

A Novel Framework for Locating Software Faults Using Latent


Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Shounak Roychowdhury and Sarfraz Khurshid

Transfer Learning with Adaptive Regularizers . . . . . . . . . . . . . . . . . . . . . . . 65


Ulrich Rückert and Marius Kloft

Multimodal Nonlinear Filtering Using Gauss-Hermite Quadrature . . . . . . 81


Hannes P. Saal, Nicolas M.O. Heess, and Sethu Vijayakumar

Active Supervised Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97


Avishek Saha, Piyush Rai, Hal Daumé III,
Suresh Venkatasubramanian, and Scott L. DuVall

Efficiently Approximating Markov Tree Bagging for High-Dimensional


Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
François Schnitzler, Sourour Ammar, Philippe Leray,
Pierre Geurts, and Louis Wehenkel

Resource-Aware On-line RFID Localization Using Proximity Data . . . . . 129


Christoph Scholz, Stephan Doerfel, Martin Atzmueller,
Andreas Hotho, and Gerd Stumme

On the Stratification of Multi-label Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145


Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas

Learning First-Order Definite Theories via Object-Based Queries . . . . . . . 159


Joseph Selman and Alan Fern

Fast Support Vector Machines for Structural Kernels . . . . . . . . . . . . . . . . . 175


Aliaksei Severyn and Alessandro Moschitti
XXVIII Table of Contents – Part III

Generalized Agreement Statistics over Fixed Group of Experts . . . . . . . . . 191


Mohak Shah

Compact Coding for Hyperplane Classifiers in Heterogeneous


Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Hao Shao, Bin Tong, and Einoshin Suzuki

Multi-label Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223


Chuan Shi, Xiangnan Kong, Philip S. Yu, and Bai Wang

Rule-Based Active Sampling for Learning to Rank . . . . . . . . . . . . . . . . . . . 240


Rodrigo Silva, Marcos A. Gonçalves, and Adriano Veloso

Parallel Structural Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256


Madeleine Seeland, Simon A. Berger, Alexandros Stamatakis, and
Stefan Kramer

Aspects of Semi-supervised and Active Learning in Conditional


Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Nataliya Sokolovska

Comparing Probabilistic Models for Melodic Sequences . . . . . . . . . . . . . . . 289


Athina Spiliopoulou and Amos Storkey

Fast Projections Onto 1,q -Norm Balls for Grouped Feature Selection . . . 305
Suvrit Sra

Generalized Dictionary Learning for Symmetric Positive Definite


Matrices with Application to Nearest Neighbor Retrieval . . . . . . . . . . . . . . 318
Suvrit Sra and Anoop Cherian

Network Regression with Predictive Clustering Trees . . . . . . . . . . . . . . . . . 333


Daniela Stojanova, Michelangelo Ceci, Annalisa Appice, and
Sašo Džeroski

Learning from Label Proportions by Optimizing Cluster Model


Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Marco Stolpe and Katharina Morik

The Minimum Code Length for Clustering Using the Gray Code . . . . . . . 365
Mahito Sugiyama and Akihiro Yamamoto

Learning to Infer Social Ties in Large Networks . . . . . . . . . . . . . . . . . . . . . . 381


Wenbin Tang, Honglei Zhuang, and Jie Tang

Comparing Apples and Oranges: Measuring Differences between Data


Mining Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
Nikolaj Tatti and Jilles Vreeken
Table of Contents – Part III XXIX

Learning Monotone Nonlinear Models Using the Choquet Integral . . . . . . 414


Ali Fallah Tehrani, Weiwei Cheng, Krzysztof Dembczyński, and
Eyke Hüllermeier

Feature Selection for Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430


Selen Uguroglu and Jaime Carbonell

Multiview Semi-supervised Learning for Ranking Multilingual


Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Nicolas Usunier, Massih-Reza Amini, and Cyril Goutte

Non-redundant Subgroup Discovery in Large and Complex Data . . . . . . . 459


Matthijs van Leeuwen and Arno Knobbe

Larger Residuals, Less Work: Active Document Scheduling for Latent


Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
Mirwaes Wahabzada and Kristian Kersting

A Community-Based Pseudolikelihood Approach for Relationship


Labeling in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
Huaiyu Wan, Youfang Lin, Zhihao Wu, and Houkuan Huang

Correcting Bias in Statistical Tests for Network Classifier Evaluation . . . 506


Tao Wang, Jennifer Neville, Brian Gallagher, and Tina Eliassi-Rad

Differentiating Code from Data in x86 Binaries . . . . . . . . . . . . . . . . . . . . . . 522


Richard Wartell, Yan Zhou, Kevin W. Hamlen,
Murat Kantarcioglu, and Bhavani Thuraisingham

Bayesian Matrix Co-Factorization: Variational Algorithm and


Cramér-Rao Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
Jiho Yoo and Seungjin Choi

Learning from Inconsistent and Unreliable Annotators by a Gaussian


Mixture Model and Bayesian Information Criterion . . . . . . . . . . . . . . . . . . . 553
Ping Zhang and Zoran Obradovic

iDVS: An Interactive Multi-document Visual Summarization System . . . 569


Yi Zhang, Dingding Wang, and Tao Li

Discriminative Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585


Yu Zhang and Dit-Yan Yeung

Active Learning with Evolving Streaming Data . . . . . . . . . . . . . . . . . . . . . . 597


Indrė Žliobaitė, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes
XXX Table of Contents – Part III

Demo Papers
Celebrity Watch: Browsing News Content by Exploiting Social
Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Omar Ali, Ilias Flaounas, Tijl De Bie, and Nello Cristianini

MOA: A Real-Time Analytics Open Source Framework . . . . . . . . . . . . . . . 617


Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Jesse Read,
Philipp Kranen, Hardy Kremer, Timm Jansen, and Thomas Seidl

L-SME: A System for Mining Loosely Structured Motifs . . . . . . . . . . . . . . 621


Fabio Fassetti, Gianluigi Greco, and Giorgio Terracina

The MASH Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626


François Fleuret, Philip Abbet, Charles Dubout, and Leonidas Lefakis

Activity Recognition With Mobile Phones . . . . . . . . . . . . . . . . . . . . . . . . . . . 630


Jordan Frank, Shie Mannor, and Doina Precup

MIME: A Framework for Interactive Visual Pattern Mining . . . . . . . . . . . 634


Bart Goethals, Sandy Moens, and Jilles Vreeken

InFeRno – An Intelligent Framework for Recognizing Pornographic


Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
Sotiris Karavarsamis, Nikos Ntarmos, and Konstantinos Blekas

MetaData Retrieval: A Software Prototype for the Annotation of Maps


with Social Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
Rosa Meo, Elena Roglia, and Enrico Ponassi

TRUMIT: A Tool to Support Large-Scale Mining of Text Association


Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
Robert Neumayer, George Tsatsaronis, and Kjetil Nørvåg

Traffic Jams Detection Using Flock Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 650


Rebecca Ong, Fabio Pinelli, Roberto Trasarti, Mirco Nanni,
Chiara Renso, Salvatore Rinzivillo, and Fosca Giannotti

Exploring City Structure from Georeferenced Photos Using Graph


Centrality Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
Katerina Vrotsou, Natalia Andrienko, Gennady Andrienko, and
Piotr Jankowski

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659


Enriching Education through Data Mining

Rakesh Agrawal

Microsoft Search Labs, California, USA


ragrawal@acm.org

Abstract. Education is acknowledged to be the primary vehicle for im-


proving the economic well-being of people [1,6]. Textbooks have a direct
bearing on the quality of education imparted to the students as they are
the primary conduits for delivering content knowledge [9]. They are also
indispensable for fostering teacher learning and constitute a key compo-
nent of the ongoing professional development of the teachers [5,8]. Many
textbooks, particularly from emerging countries, lack clear and adequate
coverage of important concepts [7]. In this talk, we present our early ex-
plorations into developing a data mining based approach for enhancing
the quality of textbooks. We discuss techniques for algorithmically aug-
menting different sections of a book with links to selective content mined
from the Web. For finding authoritative articles, we first identify the set
of key concept phrases contained in a section. Using these phrases, we
find web (Wikipedia) articles that represent the central concepts pre-
sented in the section and augment the section with links to them [4]. We
also describe a framework for finding images that are most relevant to
a section of the textbook, while respectingglobal relevancy to the entire
chapter to which the section belongs. We pose this problem of matching
images to sections in a textbook chapter as an optimization problem and
present an efficient algorithm for solving it [2].
We also present a diagnostic tool for identifying those sections of a
book that are notwell-written and hence should be candidates for enrich-
ment. We propose a probabilistic decision model for this purpose, which
is based on syntactic complexity of the writing and the newly introduced
notion of the dispersion of key concepts mentioned in the section. The
model is learned using a tune set which is automatically generated in
a novel way. This procedure maps sampled text book sections to the
closest versions of Wikipedia articles having similar content and uses the
maturity of those versions to assign need-for-enrichment labels. The ma-
turity of a version is computed by considering the revision history of the
corresponding Wikipedia article and convolving the changes in size with
a smoothing filter [3].
We also provide the results of applying the proposed techniques to
a corpus of widely-used, high school textbooks published by the Na-
tional Council of Educational Research and Training (NCERT), India.
We consider books from grades IX–XII, covering four broad subject ar-
eas, namely, Sciences, Social Sciences, Commerce, and Mathematics. The
preliminary results are encouraging and indicate that developing tech-
nological approaches to enhancing the quality of textbooks could be a
promising direction for research for our field.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 1–2, 2011.

c Springer-Verlag Berlin Heidelberg 2011
2 R. Agrawal

References
1. World Bank Knowledge for Development. World Development Report 1998/1999
(1998)
2. Agrawal, R., Gollapudi, S., Kannan, A., Kenthapadi, K.: Enriching Textbooks with
Web Images. Working paper (2011)
3. Agrawal, R., Gollapudi, S., Kannan, A., Kenthapadi, K.: Identifying Enrichment
Candidates in Textbooks. In: WWW (2011)
4. Agrawal, R., Gollapudi, S., Kannan, A., Kenthapadi, K., Srivastava, N., Velu, R.:
Enriching Textbooks Through Data Mining. In: First Annual ACM Symposium on
Computing for Development, ACM DEV (2010)
5. Gillies, J., Quijada, J.: Opportunity to Learn: A High Impact Strategy for Improving
Educational Outcomes in Developing Countries. In: USAID Educational Quality
Improvement Program, EQUIP2 (2008)
6. Hanushek, E.A., Woessmann, L.: The Role of Education Quality for Economic
Growth. Policy Research Department Working Paper 4122, World Bank (2007)
7. Mohammad, R., Kumari, R.: Effective Use of Textbooks: A Neglected Aspect of
Education in Pakistan. Journal of Education for International Development 3(1)
(2007)
8. Oakes, J., Saunders, M.: Education’s Most Basic Tools: Access to Textbooks and In-
structional Materials in California’s Public Schools. Teachers College Record 106(10)
(2004)
9. Stein, M., Stuen, C., Carnine, D., Long, R.M.: Textbook Evaluation and Adoption.
Reading & Writing Quarterly 17(1) (2001)
Human Dynamics: From Human Mobility to
Predictability

Albert-László Barabási

Center of Complex Networks Research, Northeastern University and Department of


Medicine, Harvard University
barabasi@gmail.com

Abstract. A range of applications, from predicting the spread of hu-


man and electronic viruses to city planning and resource management
in mobile communications, depend on our ability to understand human
activity patterns. I will discuss recent effort to explore human activity
patterns, using the mobility of individuals as a proxy. As an application,
I will show that by measuring the entropy of each individuals trajectory,
we find can explore the underlying predictability of human mobility,
raising fundamental questions on how predictable we really are. I will
also discuss the interplay between human mobilty, social links, and the
predictive power of data mining.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 3, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Embracing Uncertainty: Applied Machine
Learning Comes of Age

Christopher Bishop

Microsoft Research Labs, Cambridge, UK


christopher.bishop@microsoft.com

Abstract. Over the last decade the number of deployed applications


of machine learning has grown rapidly, with examples in domains rang-
ing from recommendation systems and web search, to spam filters and
voice recognition. Most recently, the Kinect 3D full-body motion sensor,
which relies crucially on machine learning, has become the fastest-selling
consumer electronics device in history. Developments such as the advent
of widespread internet connectivity, with its centralisation of data stor-
age, as well as new algorithms for computationally efficient probabilistic
inference, will create many new opportunities for machine learning over
the coming years. The talk will be illustrated with tutorial examples, live
demonstrations, and real-world case studies.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 4, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Highly Dimensional Problems in Computational
Advertising

Andrei Broder

Yahoo! Research
broder@yahoo-inc.com

Abstract. The central problem of Computational Advertising is to find


the “best match” between a given user in a given context and a suit-
able advertisement. The context could be a user entering a query in a
search engine (“sponsored search”), a user reading a web page (“content
match” and “display ads”), a user interacting with a portable device,
and so on. The information about the user can vary from scarily de-
tailed to practically nil. The number of potential advertisements might
be in the billions. The number of contexts is unbound. Thus, depend-
ing on the definition of “best match” this problem leads to a variety of
massive optimization and search problems, with complicated constraints.
The solution to these problems provides the scientific and technical un-
derpinnings of the online advertising industry, an industry estimated to
surpass 28 billion dollars in US alone in 2011.
An essential aspect of this problem is predicting the impact of an ad
on users’ behavior, whether immediate and easily quantifiable (e.g. click-
ing on ad or buying a product on line) or delayed and harder to measure
(e.g. off-line buying or changes in brand perception). To this end, the
three components of the problem – users, contexts, and ads – are rep-
resented as high dimensional objects and terabytes of data documenting
the interactions among them are collected every day. Nevertheless, con-
sidering the representation difficulty, the dimensionality of the problem
and the rarity of the events of interest, the prediction problem remains
a huge challenge.
The goal of this talk is twofold: to present a short introduction to
Computational Adverting and survey several high dimensional problems
at the core of this emerging scientific discipline.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 5, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Learning from Constraints

Marco Gori

University of Siena
marco@dii.unisi.it

Abstract. In this talk, I propose a functional framework to understand


the emergence of intelligence in agents exposed to examples and knowl-
edge granules. The theory is based on the abstract notion of constraint,
which provides a representation of knowledge granules gained from the
interaction with the environment. I give a picture of the “agent body”
in terms of representation theorems by extending the classic framework
of kernel machines in such a way to incorporate logic formalisms, like
first-order logic. This is made possible by the unification of continuous
and discrete computational mechanisms in the same functional frame-
work, so as any stimulus, like supervised examples and logic predicates,
is translated into a constraint. The learning, which is based on con-
strained variational calculus, is either guided by a parsimonious match
of the constraints or by unsupervised mechanisms expressed in terms of
the minimization of the entropy.
I show some experiments with different kinds of symbolic and sub-
symbolic constraints, and then I give insights on the adoption of the
proposed framework in computer vision. It is shown that in most in-
teresting tasks the learning from constraints naturally leads to “deep
architectures”, that emerge when following the developmental principle
of focusing attention on “easy constraints”, at each stage. Interestingly,
this suggests that stage-based learning, as discussed in developmental
psychology, might not be primarily the outcome of biology, but it could
be instead the consequence of optimization principles and complexity
issues that hold regardless of the “body.”

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 6, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Permutation Structure in 0-1 Data

Heikki Mannila

Department of Computer Science, University of Helsinki, Finland


Heikki.Mannila@aalto.fi

Abstract. Multidimensional 0-1 data occurs in many domains. Typi-


cally one assumes that the order of rows and columns has no importance.
However, in some applications, e.g., in ecology, there is structure in the
data that becomes visible only when the rows and columns are permuted
in a certain way. Examples of such structure are different forms of nest-
edness and bandedness. I review some of the applications, intuitions,
results, and open problems in this area.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 7, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Reading Customers Needs and Expectations
with Analytics

Vasilis Aggelis

Pireus Bank
aggelisv@piraeusbank.gr

Abstract. Customers are the greatest asset for every bank. Do we know
them in whole? Are we ready to fulfill their needs and expectations? Use
of analytics is one of the keys in order to make better our relation with
customers. In advance, analytics can bring gains both for customers and
banks. Customer segmentation, targeted cross- and up-sell campaigns,
data mining utilization are tools that drive in great results and contribute
to customer centric turn.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 8, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Algorithms and Challenges on the GeoWeb

Radu Jurca

Google Maps Zurich


radu.jurca@gmail.com

Abstract. A substantial number of queries addressed nowadays to on-


line search engines have a geographical dimension. People look up ad-
dresses on a map, but are also interested in events happening nearby, or
inquire information about products, shops or attractions in a particular
area. It is no longer enough to index and display geographical informa-
tion; one should instead geographically organize the world’s information.
This is the mission of Google’s GeoWeb, and several teams inside Google
focus on solving this problem. This talk gives an overview of the main
challenges characterizing this endeavor, and offers a glimpse into some
of the solutions we built.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 9, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Data Science and Machine Learning at Scale

Neel Sundaresan

Head of eBay Research Labs


nsundaresan@ebay.com

Abstract. Large Social Commerce Network sites like eBay have to con-
stantly grapple with building scalable machine learning algorithms for
search ranking, recommender systems, classification, and others. Large
data availability is both a boon and curse. While it offers a lot more
diverse observation, the same diversity with sparsity and lack of reli-
able labeled data at scale introduces new challenges. Also, availability of
large data helps take advantage of correlational factors while requiring
creativity in discarding irrelevant data. In this talk we will discuss all of
this and more from the context of eBays large data problems.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 10, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Smart Cities: How Data Mining and
Optimization Can Shape Future Cities

Olivier Verscheure

IBM Dublin Research Lab, Smarter Cities Technology Centre


ov1@us.ibm.com

Abstract. By 2050, an estimated 70% of the worlds population will


live in cities up from 13% in 1900. Already, cities consume an estimated
75% of the worlds energy, emit more than 80% of greenhouse gases, and
lose as much as 20% of their water supply due to infrastructure leaks. As
their urban populations continue to grow and these metrics increase, civic
leaders face an unprecedented series of challenges to scale and optimize
their infrastructures.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, p. 11, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Preference-Based Policy Learning

Riad Akrour, Marc Schoenauer, and Michele Sebag

TAO
CNRS − INRIA − Université Paris-Sud
FirstName.Name@inria.fr

Abstract. Many machine learning approaches in robotics, based on re-


inforcement learning, inverse optimal control or direct policy learning,
critically rely on robot simulators. This paper investigates a simulator-
free direct policy learning, called Preference-based Policy Learning (PPL).
PPL iterates a four-step process: the robot demonstrates a candidate pol-
icy; the expert ranks this policy comparatively to other ones according
to her preferences; these preferences are used to learn a policy return
estimate; the robot uses the policy return estimate to build new can-
didate policies, and the process is iterated until the desired behavior is
obtained. PPL requires a good representation of the policy search space
be available, enabling one to learn accurate policy return estimates and
limiting the human ranking effort needed to yield a good policy. Further-
more, this representation cannot use informed features (e.g., how far the
robot is from any target) due to the simulator-free setting. As a second
contribution, this paper proposes a representation based on the agnostic
exploitation of the robotic log.
The convergence of PPL is analytically studied and its experimental
validation on two problems, involving a single robot in a maze and two
interacting robots, is presented.

1 Introduction

Many machine learning approaches in robotics, ranging from direct policy learn-
ing [3] and learning by imitation [7] to reinforcement learning [20] and inverse
optimal control [1, 14], rely on the use of simulators used as full motion and
world model of the robot. Naturally, the actual performance of the learned poli-
cies depends on the accuracy and calibration of the simulator (see [22]).
The question investigated in this paper is whether robotic policy learning can
be achieved in a simulator-free setting, while keeping the human labor and com-
putational costs within reasonable limits. This question is motivated by machine
learning application to swarm robotics [24, 29, 18]. Swarm robotics aims at ro-
bust and reliable robotic systems through the interaction of a large number of
small-scale, possibly unreliable, robot entities. Within this framework, the use
of simulators suffers from two limitations. Firstly, the simulation cost increases
more than linearly with the size of the swarm. Secondly, robot entities might dif-
fer among themselves due to manufacturing tolerance, entailing a large variance

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 12–27, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Preference-Based Policy Learning 13

among the results of physical experiments and severely limiting the accuracy of
simulated experiments.
The presented approach, inspired from both energy-based learning [21] and
learning-to-rank [12, 4], is called Preference-based policy learning (PPL). PPL
proceeds by iterating a 4-step process:
• In the demonstration phase, the robot demonstrates a candidate policy.
• In the teaching or feedback phase, the expert ranks this policy compared to
archived policies, based on her agenda and prior knowledge.
• In the learning phase, a policy return estimate (energy criterion) is learned
based on the expert ranking, using an embedded learning-to-rank algorithm. A
key issue concerns the choice of the policy representation space (see below).
• In the self-training phase, new policies are generated, using an adaptive trade-
off between the policy return estimate and the policy diversity w.r.t. the archive.
A candidate policy is selected to be demonstrated and the process is iterated.
PPL initialization proceeds by demonstrating two policies, which are ordered by
the expert.
A first contribution of the PPL approach compared to inverse reinforcement
learning [1, 14] is that it does not require the expert to know how to solve the task
and to demonstrate a perfect behavior (see also [27]); the expert is only required
to know whether some behavior is more able to reach the goal than some other
one. Compared to learning by demonstration [7], the demonstrated trajectories
are feasible by construction (whereas the teacher and the robot usually have
different degrees of freedom). Compared to policy gradient approaches [20], the
continued interaction with the expert offers some means to detect and avoid the
convergence to local optima, through the expert’s feedback.
In counterpart, PPL faces one main challenge. The human ranking effort
needed to yield a competent policy must be limited; the sample complexity
must be of the order of a few dozen to a couple hundred. Therefore the policy
return estimate must enable fast progress in the policy space, which requires
the policy representation space to be carefully designed (a general concern, as
noted by [1]). On the other hand, the simulator-free setting precludes the use of
any informed features such as the robot distance to obstacles or other robots.
The second contribution of the paper is an original representation of the pol-
icy space referred to as behavioral representation (BvR), built as follows. Each
policy demonstration generates a robotic log, reporting the sensor and actuator
values observed at each time step. The robotic logs are incrementally processed
using -clustering [11]; each cluster is viewed as a sensori-motor state (sms).
A demonstration can thus be represented by a histogram, associating to each
sensori-motor state the fraction of time spent in this sms. The BvR represen-
tation complies with a resource-constrained framework and it is agnostic w.r.t.
the policy return and the environment. Although the number of states exponen-
tially increases with the clustering precision  and the intrinsic dimension of the
sensori-motor space, it can however be controlled to some extent as  is set by
the expert. BvR refinement is a perspective for further research (section 5).
14 R. Akrour, M. Schoenauer, and M. Sebag

PPL is analytically studied on the artificial RiverSwim problem [25] and a


convergence proof is given. PPL is also experimentally validated on two prob-
lems relevant to swarm robotics. The first one concerns the reaching of a target
location in a maze. The difficulty is due to perceptual aliasing (non-Markovian
environment) as different locations are perceived to be similar due to the lim-
ited infra-red sensors of the robot. The second problem involves two interacting
robots. The task is to yield a synchronized behavior of the two robots, controlled
with the same policy. The difficulty is again due to the fact that sensors hardly
enable a robot to discriminate between its relative and an obstacle.
The paper is organized as follows. Section 2 discusses related work with re-
gards to resource-constrained robotics. Section 3 gives an overview of the PPL
approach and presents a convergence study. Section 4 describes the experimen-
tal setting and reports on the results of the approach. The paper concludes with
some perspectives for further research.

2 Related Work
Policy learning is most generally tackled as an optimization problem. The main
issues regard the search space (policy or state/action value function), the defini-
tion of the objective, and the exploration of the search space, aimed at optimizing
the objective function. The infinite horizon case is considered throughout this
section for notational convenience.
Among the most studied approaches is reinforcement learning (RL) [26],
for its guarantees of global optimality. The background notations involve a state
space S, an action space A, a reward function r defined on the state space
r : S → IR, and a transition function p(s, a, s ), expressing the probability of
arriving in state s on making action a in state s under the Markov property. A
policy π : (S, A) → IR maps each state in S on some action in A with a given
probability. The policy return is defined as the expectation of cumulative reward
gathered along time, conditionally to the initial state s0 . Denoting ah ∼ π(sh , a)
the action selected by π in state sh , sh+1 ∼ p(sh , ah , s) the state of the robot
at step h + 1 conditionally to being in state sh and selecting action ah at step
h, and rh+1 the reward collected in sh+1 , then the policy return is
∞ 

J(π; s0 ) = IEπ,s0 γ rh
h

h=0

where γ < 1 is a discount factor enforcing the boundedness of the return, and
favoring the reaping of rewards as early as possibly in the robot lifetime. The
so-called Q-value function Qπ (s, a) estimates the expectation of the cumulative
reward gathered by policy π after selecting action a in state s. As estimation
errors cumulate along time, the RL scalability is limited with respect to the
size of the state and action spaces. Moreover in order to enforce the Markov
property, the description of state s must provide any information about the
robot past trajectory relevant to its further decisions − thereby increasing the
size of the state space.
Preference-Based Policy Learning 15

As RL suffers from both the scalability issue and the difficulty of defining a
priori a reward function conducive to the task at hand [19], a way to “seed”
the search with a good solution was sought, referred to as inverse optimal
control. One possibility, pioneered by [3] under the name of behavioral cloning
and further developed by [7] under the name of programming by demonstration
relies on the exploitation of the expert’s traces. These traces can be used to turn
policy learning into a supervised learning problem [7]. Another possibility is to
use the expert’s traces to learn a reward function, along the so-called inverse
reinforcement learning approach [19]. The sought reward function should admit
the expert policy π ∗ (as evidenced from his traces) as solution, and further en-
force some return margin between the actual expert actions and other actions
in the same state, in order to avoid indeterminacy [1] (since a constant reward
function would have any policy as an optimal solution). The search space is care-
fully designed in both cases. In the former approach, an “imitation metric” is
used to account for the differences between the expert and robot motor spaces.
In the latter one, a few informed features (e.g. the average speed of the vehi-
cle, the number of times the car deviates from the road) are used together with
their desired sign (the speed should be maximized, the number of deviations
should be minimized) and the policy return is sought as a weighted sum of the
feature expectations. A min-max relaxation thereof is proposed by [27], maxi-
mizing the minimum policy return over all weight vectors. Yet another approach,
inverse optimal heuristic control [14] proposes a hybrid approach between be-
havioral cloning and inverse optimal control, addressing the low-dimensionality
restrictions on optimal inverse control. Formally, an energy-based action selec-
tion model is proposed using a Gibbs model which combines the reward function
(as in inverse optimal control) and a logistic regression function (learned using
behavioral cloning from the available trajectories).
The main limitation of inverse optimal control lies in the way it is seeded:
the expert’s traces can hardly be considered to be optimal in the general case,
even more so if the expert and the robot live in different sensori-motor spaces.
While [27] gets rid of the expert’s influence, its minmax approach yields very
conservative policies, as the relative importance of the features is unknown.
A main third approach aims at direct policy learning, using policy gradient
methods [20] or global optimization methods [28]. Direct policy learning most
usually assumes a parametric policy space Θ; policy learning aims at finding the
optimal θ∗ parameter in the sense of a policy return function J:
Find θ∗ = arg max {J(θ), θ ∈ Θ}
Depending on J, three cases are distinguished. The first case is when J is ana-
lytically known on a continuous policy space Θ ⊂ IRD . It then comes naturally
to use a gradient-based optimization approach, gradually moving the current
policy θt along the gradient ∇J [20] (θt+1 = θt + αt ∇θ J(θt )). The main issues
concern the adjustment of αt , the possible use of the inverse Hessian, and the
rate of convergence toward a (local) optimum. A second case is when J is only
known as a computable function, e.g. when learning a Tetris policy [28]. In such
cases, a wide scope optimization method such as Cross-Entropy Method [8] or
16 R. Akrour, M. Schoenauer, and M. Sebag

population-based stochastic optimization [23] must be used. In the third case,


e.g. [7, 14], J is to be learned from the available evidence.
This brief and by no means exhaustive review of related work suggests that
the search for an optimal policy relies on three main components. The first one
clearly is the expert’s support, usually provided through a reward function di-
rectly defined on the state space, or inferred from the expert’s demonstrations.
The second one is a simulator or forward model, enabling the robot to infer the
reward function in the IOC case, and more generally to consider the future conse-
quences of its exploratory actions; the lack of any planning component seemingly
is the main cause for the limitations of behavioral cloning, forcluding the gener-
alization of the provided demonstrations. The third one is a low-dimensionality
representation of the state or policy search spaces.
With regards to swarm robotics however, a good quality simulator or forward
model is unlikely to be available anytime soon, for the reasons discussed in the
introduction. The expert’s support is also bound to be limited, for she can hardly
impersonate the swarm robot on the one hand, and because the desired behav-
ior is usually unknown on the other hand1 . The proposed approach will thus be
based on a less demanding interaction between the policy learning process and
the expert, only assuming that the former can demonstrate policies and that the
latter can emit preferences, telling a more appropriate behavior from a lesser one.

3 Preference-Based Policy Learning


Only deterministic parameterized policies with finite time horizon H will be
considered in this section, where H is the number of time steps each candidate
policy is demonstrated.

3.1 PPL Overview and Notations


Policy πθ , defined from its parameter θ ∈ Θ ⊆ IRD , stands for a deterministic
mapping from the state space S onto the action space A. A behavioral repre-
sentation is defined from the demonstrations (section 3.2) and used to learn the
policy return estimate. PPL proceeds by maintaining a policy and constraint
archive, respectively denoted Π t and C t along a four-step process. At step t,
• A new policy πt is demonstrated by the robot and added to the archive
(Π t = Π t−1 ∪ {πt });
• πt is ranked by the expert w.r.t. the other policies in Π t , and the set Ct is
accordingly updated from Ct−1 ;
• The policy return estimate Jt is built from all constraints in C t (section 3.3);
• New policies are generated; candidate policy πt+1 is selected using an adap-
tive trade-off between Jt and an empirical diversity term (section 3.4), and the
process is iterated.
1
Swarm policy learning can be viewed as an inverse problem; one is left with the prob-
lem of e.g. designing the ant behavior in such a way that the ant colony achieves a
given task.
Preference-Based Policy Learning 17

PPL is initialized from two policies with specific properties (section 3.5), which
are ordered by the expert; the policy and constraint archives are initialized ac-
cordingly. After detailing all PPL components, an analytical convergence study
of PPL is presented in section 3.6.

3.2 The Behavioral Representation


In many parametric settings, e.g. when policy π is characterized as the weight
vector θ ∈ Θ of a neural net with fixed architecture, the Euclidean metric on Θ is
poorly informative of the policy behavior. As the sought policy return estimate
is meant to assess policy behavior, it thus appears inappropriate to learn Jt as
a mapping on Θ. The proposed behavioral representation (BvR) fits the robot
bounded memory and computational resources by mapping each policy onto a
real valued vector (μ : π → μπ ∈ IRd ) as follows. A robot equipped with a
policy π can be thought of as a data stream generator; the sensori-motor data
stream SMS(π) consists of the robot sensor and actuator values recorded at each
time step. It thus comes naturally to describe policy π through compressing
SMS(π), using an embedded online clustering algorithm. Let SMS(π) be given
as {x1 , . . . xH } with xh ∈ IRb , where b is the number of sensors and actuators
of the robot and xh denotes the concatenation of the sensory and motor data
of the robot at time step h. An -clustering algorithm ( > 0) [11] is used to
incrementally define a set S of sensori-motor states si from all SMS(π) data
streams. For each considered policy π, in each time step, xh is compared to all
elements of S, and it is added to S if min {||xh − si ||∞ , si ∈ S} > . Let nt
denote the number of states at time t. BvR is defined as follows:
Given SMS(π) = {x1 , . . . xH } and S = {s1 , . . . snt }
μ : π → μπ ∈ [0, 1]nt μπ [i] = |{xh s.t. ||xHh −si ||∞ <}|

BvR thus differs from the feature counts used in [1, 14] as features are learned
from policy trajectories as opposed to being defined from prior knowledge. Com-
pared to the discriminant policy representation built from the set of policy
demonstrations [17], the main two differences are that BvR is independent of
the policy return estimate, and that it is built online as new policies explore new
regions of the sensori-motor space. As the number of states nt and thus BvR di-
mension increases along time, BvR consistency follows from the fact μπ ∈ [0, 1]nt
is naturally mapped onto (μπ , 0) in [0, 1]nt +1 (the i-th coordinate of μπ is 0 if si
was not discovered at the time π was demonstrated).
The price to pay for the representation consistency is that the number of
states exponentially increases with precision parameter ; on the other hand, it

increases like b where b is the intrinsic dimension of the sensori-motor data.
A second limitation of BvR is to be subject to the initialization noise, meant
as the initial location of the robot when the policy is launched. The impact of
the initial conditions however can be mitigated by discarding the states visited
during the burn-in period of the policy.
18 R. Akrour, M. Schoenauer, and M. Sebag

3.3 Policy Return Estimate Learning

For the sake of notational simplicity, μi will be used instead of μπi when no
confusion is to fear. Let the policy archive Π t be given as {μ1 , . . . μt } at step t,
assuming with no loss of generality that μi s are ordered after the expert prefer-
ences. Constraint archive C t thus includes the set of t(t−1)
2 ordering constraints
μi
μj for i > j.
Using a standard constrained convex optimization formulation [4, 13], the
policy return estimate Jt is sought as a linear mapping Jt (μ) = wt , μ with
wt ∈ IRnt solution of (P ):
 t
Minimize 12 ||w||2 + C i,j=1,i>j ξi,j
(P )
subject to ( w, μi − w, μj ≥ 1 − ξi,j ) and (ξi,j ≥ 0) for all i > j

The policy return estimate Jt features some good properties. Firstly, it is consis-
tent despite the fact that the dimension nt of the state space might increase with
t; after the same arguments as above, the wt coordinate related to a state which
has not yet been discovered is set to 0. By construction, it is independent on the
policy parameterization and can be transferred among different policy spaces;
likewise, it does not involve any information but the information available to the
robot itself; therefore it can be computed on-board and provides the robot with
a self-assessment.
Finally, although Jt is defined at the global policy level, wt provides some
valuation of the sensori-motor states. If a high positive weight wt [i] is associated
to the i-th sensori-motor state si , this state is considered to significantly and
positively contribute to the quality of a policy, comparatively to the policies
considered so far. Learning Jt thus significantly differs from learning the optimal
RL value function V ∗ . By definition, V ∗ (si ) estimates the maximum expected
cumulative reward ahead of state si , which can only increase as better policies are
discovered. In contrast, wt [i] reflects the fact that visiting state si is conducive
to discovering better policies, comparatively to policies viewed so far by the
expert. In particular, wt [i] can increase or decrease along t; some states might
be considered as highly beneficial in the early stages of the robot training, and
discarded later on.

3.4 New Policy Generation

The policy generation step can be thought of in terms of self-training. The


generation of new policies relies on black-box optimization; expected-global im-
provement methods [6] and gradient methods [20] are not applicable since the
policy return estimate is not defined on the parametric representation space Θ.
New policies are thus generated using a stochastic optimization method, more
precisely a (1 + λ)-ES algorithm [2]. Formally, a Gaussian distribution N (θc , Σ)
on Θ is maintained, updating the center θc of the distribution and the covariance
matrix Σ after the parameter θ of the current best policy. For ng iterations, λ
policies π are drawn after N (θc , Σ), they are executed in order to compute their
Preference-Based Policy Learning 19

behavioral description μπ , and the best one after an optimisation criterion F


(see below) is used to update N (θc , Σ). After ng iterations of the policy genera-
tion step, the best policy after F denoted πt+1 is selected and demonstrated to
the expert, and the whole PPL process is iterated.
The criterion F used to select the best policy out of the current λ policies is
meant to enforce some trade-off between the exploration of the policy space and
the exploitation of the current policy return estimate Jt (note that the discovery
of new sensori-motor states is worthless after Jt since their weight after wt is 0).
Taking inspiration from [2], F is defined as the sum of the policy return estimate
Jt and a weighted exploration term Et , measuring the diversity Δ of the policy
w.r.t the policy archive Π t :

Maximize F (μ) = Jt (μ) + αt Et (μ) αt > 0


with Et (μ) = min {Δ(μ, μu ), μu ∈ Π t }

c.αt−1 if πt
πt−1 c > 1
αt = 1
α
c1/p t−1
otherwise
Parameter αt controls the exploration vs exploitation tradeoff, accounting
for the fact that both the policy distribution and the objective function Jt are
non stationary. Accordingly, αt is increased or decreased by comparing the em-
pirical success rate (whether πt improves on all policies in the archive) with the
expected success rate of a reference function (usually the sphere function [2]).
When the empirical success rate is above (respectively below) the reference one,
αt is increased (resp. decreased). Parameter p, empirically adjusted, is used to
guarantee that p failures cancel out one success and bring αt back to its original
value.
The diversity function Δ(μ, μ ) is defined as follows. Let a, b, c respectively
denote the number of states visited by μ only, by μ only, and by both μ and
μ . Then Δ(μ, μ ) is set to √a+c
a+b−c

b+c
; it computes and normalizes the symmetric
difference minus the intersection of the two sets of states visited by μ and μ .

3.5 Initialization
The PPL initialization is challenging for policy behaviors corresponding to uni-
formly drawn θ usually are quite uninteresting and therefore hard to rank. The
first two policies are thus generated as follows. Given a set P0 of randomly gener-
πd1 is selected as the policy in P0 with maximal information quantity
ated policies,
(J0 (μ) = − i=1 μ[i] log μ[i]). Policy π2 is the one in P0 with maximum diversity
w.r.t. π1 (π2 = argmax {Δ(μ, π1 ), μ ∈ P0 }). The rationale for this initialization
is that π1 should experiment as many distinct sensorimotor states as possible;
and π2 should experiment as many sensorimotor states different from that of
π1 as possible, to facilitate the expert ordering and yield an informative policy
return estimate J2 . Admittedly, the use of the information quantity and more
generally BvR only make sense if the robot faces a sufficiently rich environment.
In an empty environment, i.e. when there is nothing to see, the robot can only
visit a single sensori-motor state.
20 R. Akrour, M. Schoenauer, and M. Sebag

3.6 Convergence Study


PPL convergence is analytically studied and proved for the artificial RiverSwim
problem [25]. This problem involves N states and two actions, going left or
right; starting from the leftmost state, the goal is to reach the rightmost state
(Fig. 1). While the policy parametric representation space is {left, right}N , the

1 2 3 .. N Target
1 2 3 i N
Policy
right right right right left ... ... state

Initial state Target state Behavior 1 1 1 di ci 0 0

(a) State and Action Space (b) Parametric and Behavioral representation

Fig. 1. The RiverSwim problem. The parametric representation of a policy π specifies


the action (go left or right) selected in each state. The behavioral representation μπ
is (1, ...1, 0, ..0), with the rightmost 1 at position (π), defined as the leftmost state
associated to the left move.

policy π behavior is entirely characterized from the leftmost state i where it


goes left, denoted (π). In the following, we only consider the boolean BvR,
with μπ = {1, 1, .., 1, 0, ..0} and (π) the index of the rightmost 1. For notational
simplicity, let μπ and μπ be noted μ and μ in the following. By definition, the
expert prefers the one out of μ and μ which goes farther on the right (μ
μ
iff (μ) > (μ )).
In the RiverSwim setting, problem (P) can be solved analytically (section 3.3).
Let archive Π t be given as {μ1 . . . μt } and assume wlog that μi s are ordered
by increasing value of (μi ). It comes immediately that constraints related to
adjacent policies ( w, μi+1 − w, μi ≥ 1 − ξi,i+1 for 1 ≤ i < t − 1) entail all
(μi+1 )
other constraints, and can be all satisfied by setting k=(μ i )+1
w[k] = 1; slack
variables ξi,j thus are 0 at the optimum.
Problem (P) can thus be decomposed into t − 1 independent problems involv-
ing disjoint subsets of indices ] (μi ), (μi+1 )]; its solution w is given as:

0 if k < (μ1 ) or k > (μt )
w[k] = 1
(μi+1 )−(μi )
if (μi ) < k ≤ (μi+1 )

Proposition 1: In the RiverSwim setting, Jt (μ) = wt , μ where wt satisfies:


wt [k] = 0 if k ≤ (μ1 ) or k > (μt ); wt [k] > N1 if (μ1 ) < k ≤ (μt );
i=N
k=1 wt [k] = t. 

Policy generation (section 3.4) considers both the current Jt and the diversity Et
respective to the policy archive Πt , given as Et (μ) = min{Δ(μ, μu ), μu ∈ Πt }.
In the RiverSwim setting, it comes:
| (μ) − (μu )| − min( (μ), (μu ))
Δ(μ, μu ) = 
(μ) (μu )
Preference-Based Policy Learning 21
 
For i < j, |i−j|−√min (i,j)
= i − 2
j
j . It follows that the function i →
i
√ ij
|(μ)−i|−min((μ),i)
√ is strictly decreasing on [1, (μ)] and strictly increasing on
(μ) i
[ (μ), +∞). It reaches its minimum value of −1 for i = (μ). As a consequence,
for any policy μ in the archive, Et (μ) = −1.
Let μt and μ∗ respectively denote the best policy in the archive Πt and the
policy that goes exactly one step further right. We will now prove the following
result:
Proposition 2: The probability that μ∗ is generated at step t is bounded from
below by eN1
. Furthermore, after μ∗ has been generated, it will be selected accord-
ing to the selection criterion Ft (section 3.4), and demonstrated to the expert.
Proof
Generation step: Considering the boolean parametric representation space {0, 1}N ,
the generation of new policies proceeds using the standard bitflip mutation, flip-
ping each bit of the current μt with probability N1 . The probability to generate
μ∗ from μt is lower-bounded by N1 (1 − N1 )(μt ) (flip bit (μt ) + 1 and do not flip
any bit before that). Furthermore, (1 − N1 )(μt ) > (1 − N1 )N −1 > 1e and hence
the probability to generate μ∗ from μt is lower-bounded by eN 1
.

Selection step. As shown above, Et (μ ) > −1 and Et (μ) = −1 for all μ ∈ Πt . As
the candidate policy selection is based on the maximization of Ft (μ) = wt , μ +
αt Et (μ), and using wt , μt = wt , μt+1 = t, it follows that Ft (μt ) < Ft (μ∗ ),
and more generally, that Ft (μ) < Ft (μ∗ ) for all μ ∈ Πt . Consider now a policy
μ that is not in the archive though (μ) < t (the archive does not need to
contain all possible policies). From Proposition 1, it follows that w, μ < t − N1 .
Because Et (μ) is bounded, there exists a sufficiently small αt such that Ft (μ) =
wt , μ + αt Et (μ) < Ft (μ∗ ). 
Furthermore, thanks to the monotonicity of Et (μ) w.r.t. (μ), one has
Ft (μt+i ) < Ft (μt+j ) for all i < j where (μt+i ) = (μt ) + i. Better RiverSwim
policies will thus be selected along the policy generation and selection step.

4 Experimental Validation
This section reports on the experimental validation of the PPL approach. For
the sake of reproducibility, all reported results have been obtained using the
publicly available simulator Roborobo [5].

4.1 Experiment Goals and Experimental Setting


The experiments are meant to answer three main questions. The first one con-
cerns the feasibility of PPL compared to baseline approaches. The use as surro-
gate optimization objective of the policy return estimate learned from the expert
preferences, is compared to the direct use of the expert preferences. The perfor-
mance indicator is the speed-up in terms of sample complexity (number of calls
22 R. Akrour, M. Schoenauer, and M. Sebag

to the expert), needed to reach a reasonable level of performance. The second


question regards the merits of the behavioral representation comparatively to
the parametric one. More specifically, the point is to know whether BvR can en-
force an effective trade-off between the number of sensori-motor states and the
perceptual aliasing, enabling an accurate policy return estimate to be built from
few orderings. A third question regards the merits of the exploration term used
in the policy generation (self-training phase, section 3.4), and the entropy-based
initialization process (section 3.5), and how they contribute to the overall per-
formance of PPL. Three experiments are considered to answer these questions.
The first one is an artificial problem which can be viewed as a 2D version of the
RiverSwim [25]. The second experiment, inspired from [16], involves a robot in
a maze and the goal is to reach a given location. The third experiment involves
two robots and the goal is to enforce a coordinated exploration of the arena by
the two robots. In all experiments the robotic architecture is a Cortex-M3 with 8
infra-red (IR) sensors and two motors respectively controlling the rotational and
translation speed. The IR range is 6 cm; the surface of the arena is 6.4 square
meters.
Every policy is implemented as a 1-hidden-layer feed-forward neural net, us-
ing the 8 IR sensors (and a bias) as input, with 10 neurons on the hidden layer,
and the two motor commands as output (Θ = IR121 ). A random policy is built
by uniformly selecting each weight in [−1, 1]. BvR is built using norm L∞ and
 = 0.45; the number of sensori-motor states, thus the BvR dimension, is below
1,000. The policy return estimate is learned using SV M rank library [13] with
linear kernel and default parameter C = 1 (section 3.3). The optimization algo-
rithm used in the policy generation (section 3.4) and initialization (section 3.5)
is a (1+λ)-ES [23] with λ = 7; the variance of the Gaussian distribution is initial-
ized to .3, increased by a factor of 1.5 upon any improvement and decreased by a
factor of 1.5−1/4 otherwise. The (1+λ)-ES is used for ng iterations in each policy
generation step, with ng = 10 in the maze (respectively ng = 5 in the two-robot)
experiment; it is used for 20 iterations in the initialization step. All presented
results are averaged over 41 independent runs. Five algorithms are compared:
PPLBvR with entropy-based initialization and exploration term, where the pol-
icy return estimate relies on the behavioral representation; PPLparam , which
only differs from PPLBvR as the policy return estimate is learned using the
parametric representation; Expert-Only, where the policy return estimate is re-
placed by the true expert preferences; Novelty-Search [16] which can be viewed
as an exploration-only approach2; and PPLw/o i which differs from PPLBvR as
it uses a random uniform initialization.

4.2 2D RiverSwim
In this 4 × 4 grid world, the robot starts in the lower left corner of the grid; the
sought behavior is to reach the upper right corner and to stay there. The expert
2
In each time step the candidate policy with highest average diversity compared to
its k-nearest neighbors in the policy archive is selected, with k = 1 for the 2D
RiverSwim, and k = 5 for both other experiments.
Preference-Based Policy Learning 23

preference goes toward the policy going closer to the goal state, or reaching
earlier the goal state. The action space includes the four move directions with
deterministic transitions (the robot stays in the same position upon marching to
the wall). For a time horizon H = 16, the BvR includes only 6334 distinguishable
policies (to be compared with the actual 416 distinct policies). The small size of
this grid world allows one to enumeratively optimize the policy return estimate,
with and without the exploration term. Fig. 2 (left) displays the true policy
return vs the number of calls to the expert. In this artificial problem, Novelty-
Search outperforms PPL and discovers an optimal policy in the 20-th step. PPL
discovers an optimal policy at circa the 30-th step. The exploration term plays an
important role in PPL performance; Fig. 2 (right) displays the average weight αt
of the exploration term. In this small-size problem, a mere exploration strategy
is sufficient and typically Novelty-Search discovers two optimal policies in the
first 100 steps on average; meanwhile, PPL discovers circa 25 optimal policies in
the first 100 steps on average.

10 8
Exploration weight
True Policy Return

5 6
Epsilon

0 4

PPL - no exploration
-5 PPL 2
Novelty Search

0 50 100 10 30 50 70 90
Number of calls to the expert Number of calls to the expert

Fig. 2. 2D RiverSwim: Performance of PPL and Novelty-Search, averaged over 41 runs


with std. dev (left); average weight of the exploration term αt in PPL (right)

4.3 Reaching the End of the Maze


In this experiment inspired from [15], the goal is to traverse the maze and reach
the green zone when starting in the opposite initial position (Fig. 3-left). The
time horizon H is 2,000; an oracle robot needs circa 1,000 moves to reach the
target location. As in the RiverSwim problem, the expert preference goes toward
the policy going closer to the goal state, or reaching earlier the goal state; the
comparison leads to a tie if the difference is below 50 distance units or 50 time
steps. The “true“ policy return is the remaining time when it first reaches the
green zone, if it reaches it, minus the min distance of the trajectory to the green
zone, otherwise. The main difficulty in this maze problem is the severe perceptual
aliasing; all locations situated far from the walls look alike due to the IR sensor
limitations.
This experiment supports the feasibility of the PPL scheme. A speed-up fac-
tor of circa 3 is observed compared to Expert-Only; on average PPL reaches the
green zone after demonstrating 39 policies to the expert, whereas Expert-Only
24 R. Akrour, M. Schoenauer, and M. Sebag

True Policy Return


-1000
PPL-random init
PPL
Expert only
PPL-parametric
-2000 Novelty search

10 30 50 70 90
Number of calls to expert

Fig. 3. The maze experiment: the arena (left) and best performances averaged out of
41 runs

needs 140 demonstrations (Fig. 4, left). This speed-up is explained as PPL fil-
ters out unpromising policies. Novelty-Search performs poorly on this problem
comparatively to PPL and even comparatively to Expert-Only, which suggests
that Novelty-Search might not scale up well with the size of the policy space.
This experiment also establishes the merits of the behavioral representation.
Fig. 4 (right) displays the average performance of PPLBvR and PPLparam when
no exploration term is used, to only assess the accuracy of the policy return
estimate. As can be seen, learning is very limited when done in the parametric
representation by PPLparam, resulting in no speedup compared to Expert-Only
(Fig. 3, right). A third lesson is that the entropy-based initialization does not
seem to bring any significant improvement, as PPLBvR and PPLw/o i get same
results after the very first steps (Fig. 3).

1000

-500
True Policy Return

True Policy Return

0
-1000
PPL-parametric
PPL
-1000
-1500

Expert only
-2000 PPL -2000

10 30 50 70 90 10 30 50 70 90
Number of calls to the expert Number of calls to the expert

Fig. 4. The maze experiment: PPL vs Expert-Only (left); Accuracy of parametric and
behavioral policy return estimate(right)

4.4 Synchronized Exploration


This experiment investigates the feasibility of two robots exploring an arena (Fig.
5, left) in a synchronous way, using the same policy. The expert preferences are
emulated by associating to a policy the number of distinct 25x25 tiles explored
by any of the two robots conditionally to the fact that their distance is below 100
Preference-Based Policy Learning 25

unit distances at the moment the tile is explored. Both robots start in (distinct)
random positions. The difficulty of the task is that most policies wander in the
arena, making it unlikely for the two robots to be anywhere close to one another
at any point in time.
The experimental results (Fig. 5, right) mainly confirm the lessons drawn from
the previous experiment. PPL improves on Expert-Only, with a speed-up circa
10, while Novelty-Search is slightly but significantly outperformed by Expert-
Only. In the meanwhile, the entropy-based initialization does not make much of
a difference after the first steps, and might even become counter-productive in
the end.

300
True Policy Return

200 PPL-random init


PPL
Expert only
Novelty Search
100

0
10 30 50 70 90
Number of calls to expert

Fig. 5. Synchronous exploration: the arena (left) and the best average performance out
of 41 runs (right)

5 Conclusion and Perspectives


The presented preference-based policy learning demonstrates the feasibility of
learning in a simulator-free setting. PPL can be seen as a variant of Inverse Op-
timal Control, with the difference that the demonstrations are done by the robot
whereas the expert only provides some feedback (it’s better; it’s worse). PPL,
without any assumptions on the level of expertise of the teacher, incrementally
learns a policy return estimate. This learning is made possible through the origi-
nal unsupervised behavioral representation BvR, based on the compression of the
robot trajectories. The policy return estimate can be thought of as an embed-
ded “system of values”, computable on-board. Further research will investigate
the use of the policy return estimate within lifelong learning, e.g. to adjust the
policy and compensate for the fatigue of the actuators. While the convergence
of the approach can be established in the artificial RiverSwim framework [25],
the experimental validation demonstrates that PPL can be used effectively to
learn elementary behaviors involving one or two robots.
The main limitation of the approach comes from the exponential increase of
the behavioral representation w.r.t. the granularity of the sensori-motor states.
Further work will investigate how to refine BvR, specifically using quad-trees to
describe and retrieve the sensori-motor states while adaptively adjusting their
granularity. A second perspective is to adjust the length of the self-training
26 R. Akrour, M. Schoenauer, and M. Sebag

phases, taking inspiration from online learning on a budget [9]. A third research
avenue is to reconsider the expert preferences in a Multiple-Instance perspective
[10]; clearly, what the expert likes/dislikes might be a fragment of the policy
trajectory, more than the entire trajectory.

Acknowledgments. The first author is funded by FP7 European Project Sym-


brion, FET IP 216342, http://symbrion.eu/. This work is also partly funded
by ANR Franco-Japanese project Sydinmalas ANR-08-BLAN-0178-01.

References

[1] Abbeel, P., Ng, A.Y.: Apprenticeship Learning via Inverse Reinforcement Learn-
ing. In: Brodley, C.E. (ed.) Proc. 21st Intl. Conf. on Machine Learning (ICML
2004). ACM Intl. Conf. Proc. Series, vol. 69, p. 1. ACM, New York (2004)
[2] Auger, A.: Convergence Results for the (1,λ)-SA-ES using the Theory of ϕ-
irreducible Markov Chains. Theoretical Computer Science 334(1-3), 35–69 (2005)
[3] Bain, M., Sammut, C.: A Framework for Behavioural Cloning. In: Furukawa, K.,
Michie, D., Muggleton, S. (eds.) Machine Intelligence, vol. 15, pp. 103–129. Oxford
University Press, Oxford (1995)
[4] Bakir, G., Hofmann, T., Scholkopf, B., Smola, A.J., Taskar, B., Vishwanathan,
S.V.N.: Machine Learning with Structured Outputs. MIT Press, Cambridge (2006)
[5] Bredeche, N.: http://www.lri.fr/~ bredeche/roborobo/
[6] Brochu, E., de Freitas, N., Ghosh, A.: Active Preference Learning with Discrete
Choice Data. In: Proc. NIPS 20, pp. 409–416 (2008)
[7] Calinon, S., Guenter, F., Billard, A.: On Learning, Representing and Generalizing
a Task in a Humanoid Robot. IEEE Trans. on Systems, Man and Cybernetics, Spe-
cial Issue on Robot Learning by Observation, Demonstration and Imitation 37(2),
286–298 (2007)
[8] de Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A Tutorial on the
Cross-Entropy Method. Annals OR 134(1), 19–67 (2005)
[9] Dekel, O., Shalev-Shwartz, S., Singer, Y.: The Forgetron: A Kernel-Based Percep-
tron on a Budget. SIAM J. Comput. 37, 1342–1372 (2008)
[10] Dietterich, T.G., Lathrop, R., Lozano-Perez, T.: Solving the Multiple-Instance
Problem with Axis-Parallel Rectangles. Artif. Intelligence 89(1-2), 31–71 (1997)
[11] Duda, R.O., Hart, P.E.: Pattern Classification and scene analysis. John Wiley and
sons, Menlo Park, CA (1973)
[12] Joachims, T.: A Support Vector Method for Multivariate Performance Measures.
In: De Raedt, L., Wrobel, S. (eds.) Proc. 22nd ICML. ACM Intl. Conf. Proc.
Series, vol. 119, pp. 377–384. ACM, New York (2005)
[13] Joachims, T.: Training Linear SVMs in Linear Time. In: Eliassi-Rad, T., et al.
(eds.) Proc. 12th Intl. Conf. KDDM, pp. 217–226. ACM, New York (2006)
[14] Zico Kolter, J., Abbeel, P., Ng, A.Y.: Hierarchical Apprenticeship Learning with
Application to Quadruped Locomotion. In: Proc. NIPS 20. MIT Press, Cambridge
(2007)
[15] Lehman, J., Stanley, K.O.: Exploiting Open-Endedness to Solve Problems
Through the Search for Novelty. In: Proc. Artificial Life XI, pp. 329–336 (2008)
[16] Lehman, J., Stanley, K.O.: Exploiting Open-Endedness to Solve Problems through
the Search for Novelty. In: Proc. ALife 2008, MIT Press, Cambridge (2008)
Preference-Based Policy Learning 27

[17] Levine, S., Popovic, Z., Koltun, V.: Feature Construction for Inverse Reinforce-
ment Learning. In: Proc. NIPS 23, pp. 1342–1350 (2010)
[18] Liu, W., Winfield, A.F.T.: Modeling and Optimization of Adaptive Foraging in
Swarm Robotic Systems. Intl. J. Robotic Research 29(14), 1743–1760 (2010)
[19] Ng, A.Y., Russell, S.: Algorithms for Inverse Reinforcement Learning. In: Langley,
P. (ed.) Proc. 17th ICML, pp. 663–670. Morgan Kaufmann, San Francisco (2000)
[20] Peters, J., Schaal, S.: Reinforcement Learning of Motor Skills with Policy Gradi-
ents. Neural Networks 21(4), 682–697 (2008)
[21] Ranzato, M.-A., Poultney, C.S., Chopra, S., LeCun, Y.: Efficient Learning of
Sparse Representations with an Energy-Based Model. In: Schölkopf, B., Platt,
J.C., Hoffman, T. (eds.) Proc. NIPS 19, pp. 1137–1144. MIT Press, Cambridge
(2006)
[22] Saxena, A., Driemeyer, J., Ng, A.Y.: Robotic Grasping of Novel Objects using
Vision. Intl. J. Robotics Research (2008)
[23] Schwefel, H.-P.: Numerical Optimization of Computer Models. John Wiley & Sons,
New York (1981) 2nd edn. (1995)
[24] Stirling, T.S., Wischmann, S., Floreano, D.: Energy-efficient Indoor Search by
Swarms of Simulated Flying Robots without Global Information. Swarm Intelli-
gence 4(2), 117–143 (2010)
[25] Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC Model-free
Reinforcement Learning. In: Airoldi, E.M., Blei, D.M., Fienberg, S.E., Goldenberg,
A., Xing, E.P., Zheng, A.X. (eds.) ICML 2006. LNCS, vol. 4503, pp. 881–888.
Springer, Heidelberg (2007)
[26] Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press, Cam-
bridge (1998)
[27] Syed, U., Schapire, R.: A Game-Theoretic Approach to Apprenticeship Learning.
In: Proc. NIPS 21, pp. 1449–1456. MIT Press, Cambridge (2008)
[28] Thiery, C., Scherrer, B.: Improvements on Learning Tetris with Cross Entropy.
ICGA Journal 32(1), 23–33 (2009)
[29] Trianni, V., Nolfi, S., Dorigo, M.: Cooperative Hole Avoidance in a Swarm-bot.
Robotics and Autonomous Systems 54(2), 97–103 (2006)
Constraint Selection for Semi-supervised
Topological Clustering

Kais Allab and Khalid Benabdeslem

University of Lyon1, GAMA Laboratory,


43, Bd du 11 Novembre 1918, Villeurbanne 69622 France
kais.allab@etu.univ-lyon1.fr,
kbenabde@univ-lyon1.fr

Abstract. In this paper, we propose to adapt the batch version of self-


organizing map (SOM) to background information in clustering task. It
deals with constrained clustering with SOM in a deterministic paradigm.
In this context we adapt the appropriate topological clustering to pair-
wise instance level constraints with the study of their informativeness and
coherence properties for measuring their utility for the semi-supervised
learning process. These measures will provide guidance in selecting the
most useful constraint sets for the proposed algorithm. Experiments will
be given over several databases for validating our approach in comparison
with another constrained clustering ones.

Keywords: Constrain selection, semi-supervised clustering, SOM.

1 Introduction

Traditional clustering algorithms only access to variables which describe each


data but they do not deal with any other kind of given information. Neverthe-
less, taking a priori knowledge into account in such algorithms, if there exists,
is an important problem and a real challenge in nowadays clustering research.
It concerns a recent area in machine learning and data mining research which
is constrained clustering [1]. This semi-supervised approach has led to improved
performance for several data sets [2,16] as well as for real-world applications,
such as person identification from surveillance camera clips [3] , noun phrase
coreference resolution and GPS-based map refinement [4], and landscape detec-
tion from hyperspectral data [5].
Furthermore, The last ten years have seen extensive work on incorporating
instance-level constraints into clustering methods. The first work in this area
proposed a modified version of COBWEB that strictly enforced pairwise con-
strains (COP-COBWEB) [6]. It was followed by an enhanced version of widely
used k-means algorithm that could also accommodate constraints, called COP-
Kmeans [4]. Moreover, in [7], an exploration of the use of instance and cluster-
level constraints was performed with agglomerative hierarchical clustering. In [8]
the authors proposed a new graph based constrained clustering algorithm called

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 28–43, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Constraint Selection for Semi-supervised Topological Clustering 29

COP-Bcoloring where they have shown improvements in quality and computa-


tional complexity of clustering by using constraints in a graph b-coloring clus-
tering algorithm. Theoretically, it was proven that clustering with constraints
raised an intractable feasibility problem [9,10] for simply finding any clustering
that satisfies all constraints via a reduction from graph coloring [8,11,12].
Constraints can be generated from background knowledge about the data set
[13] or from a subset of the data with known labels [14]. Practically, it has been
illustrated that constraints can improve the results of a variety of clustering
algorithms. However, there can be a large variation in this improvement, even
for a fixed number of constraints for a given data set [4,10]. In fact, It has
been observed that constraints can have ill effects even when they are generated
directly from the data labels that are used to evaluate accuracy, so this behavior
is not caused by noise or errors in the constraints. Instead, it is a result of the
interaction between a given set of constraints and the algorithm being used [13].
In this paper we propose to adopt constraints in a self-organizing based clus-
tering, which represents a prominent tool for high dimensional data analysis,
since it provides a substantial data reduction that can be used for visualizing
and exploring properties of data. Our first contribution is based on a determin-
istic model, where constraints are integrated and identified in the neurons of the
corresponding topographic neural networks. In this sense, we extend the work
of Kohonen [15] by adapting the batch version of Self-Organizing Map (SOM)
to hard constraints. Our proposal is different from that proposed in [16], which
also address the problem of semi-supervised learning in SOM. However, in that
paper, the authors propose a label propagation based strategy over the training
database as it is done for the classification problem [17,18,19]. In our case, we
trait the problem in a different paradigm, where “Semi-supervised learning =
Unsupervised learning + Constraints”. In this context, we study two important
measures, informativeness and coherence [13], that captures relevant properties
of constraint sets. These measures provide insight into the effect a given con-
straint set has for a specific constrained clustering algorithm. They will be used
for selecting the relevant constraints in the clustering process.

2 Integrating Constraints in SOM


SOM is a very popular tool used for visualizing high dimensional data spaces. It
can be considered as doing vector quantization and/or clustering while preserving
the spatial ordering of the input data rejected by implementing an ordering of
the codebook vectors (also called prototype vectors, cluster centroids or reference
vectors) in one or two dimensional output space. The SOM consists of neurons
organized on a regular low-dimensional grid, called the map.
More formally, the map is described by a graph (M, E). M is a set of K
interconnected neurons having a discrete topology defined by E. For each pair
of neurons (c, r) on the map, the distance δ(c, r) is defined as the shortest path
between c and r on the graph. This distance imposes a neighborhood relation
30 K. Allab and K. Benabdeslem

Fig. 1. Two-dimensional topological map with 1-neighborhood of a neuron c. Rectan-


gular (red) with 8 neighbors and diamond (blue) with 4 neighbors.

between neurons (Figure 1). Each neuron c is represented by a D-dimensional


reference vector wc = wc1 , ...., wcD , where D is equal to the dimension of the input
vectors xi ∈ X (Data set).
The SOM training algorithm resembles K-means. The important distinction is
that in addition to the best matching reference vector, its neighbors on the map
are updated. The end result is that neighboring neurons on the grid correspond
to neighboring regions in the input space.

2.1 Constraints
The operating assumption behind all constrained clustering methods is that
the constraints provide information about the true (desired) partition, and that
more information will increase the agreement between the output partition and
the true partition. Constraints provide guidance about the desired partition and
make it possible for clustering algorithms to increase their performance [10].
In this paper, we propose to adapt a topology based clustering to hard con-
straints: Must-Link constraint (ML): involving xi and xj , specifies that they
must be placed into the same cluster. Cannot-Link constraint (CL): involving
xi and xj , specifies that they must be placed into deferent clusters.

2.2 The Topological Constrained Algorithm


The SOM’ algorithm is proposed on two versions: Stochastic (on-line) and batch
(off-line) versions. In this paper, we consider the second one which is determinis-
tic, fast and more adapted to our proposition to optimize the objective function
subject to different considered constraints.
The batch version of SOM is an iterative algorithm in which the whole data
set is presented to the map before any adjustments are made. In each training
step, the data set is partitioned according to the Voronoi regions of the map
reference vectors. More formally, we define an affectation function f from RD
(the input space) to C (the output space), that associates each element xi of RD
to the neuron whose reference vector is “closest” to xi . This function induces
a partition P = Pc ; c = 1...K of the set of observations where each part Pc is
defined by: Pc = {xi ∈ X; f (xi ) = c}.
Constraint Selection for Semi-supervised Topological Clustering 31

The quality of the partition (Pc )c∈C and its associated prototype vectors
(wc )c∈C is given by the following energy function [20]:
 
E T ((Pc )c∈C , (wc )c∈C ) = hT (δ(f (xi ), c)) wc − xi 2 (1)
xi ∈X c∈C

where f represents the assignment function: f (xi ) = c if xi ∈ Pc and hT (.)


is the neighborhood kernel around the winner unit c. In practice, we often use
x2
hT (x) = e− T 2 where T represents the neighborhood radius in the map. It can
be fixed or decreased from an initial value Tmax to a final value Tmin .
In general, f (xi ) = r where r ∈ C, so (1) can be rewritten as follow:
  
E T ((Pc )c∈C , (wc )c∈C ) = hT (δ(r, c)) wc − xi 2 (2)
r∈C xi ∈Pr c∈C

where Pr represents the set of elements which belong to the neuron r. Since
hT (0) = 1 (when r = c), E T can be decomposed on two terms:
  2
E1T = wr − xi  (3)
r∈C xi ∈Pr
  2
E2T = hT (δ(r, c)) wr − xi  (4)
=r xi ∈Pc
r∈C c

E1T corresponds to the distortion used in the partitioning based clustering algo-
rithms like K-Means. E2T is specific to SOM algorithm. Note, that if the neigh-
boring relationship is not considered (δ(.) = 0), optimizing the objective function
subject to constraints could be resolved by the known COP-Kmeans [4].
Our first contribution consists here to adapt SOM to M L and CL constraints
by minimizing the equation (1) subject to these constraints. For this optimiza-
tion problem we can consider several versions proposed by [21,20,15]. All these
versions proceed on tow steps: assignment step for calculating f and adaptation
step for calculating wc .
In this work, we use the version proposed by Heskes and Kappen [21]. The
assignment step consists to minimize E T with wc fixed and the adaptation step
minimizes the same objective function but with the prototypes as fixed. Al-
though the two-optimizations are performed accurately, we can not guarantee
that the energy is generally minimized by this algorithm. However, if we fixe the
neighborhood structure (T is fixed), the algorithm converges towards a stable
state after a finite number of steps [20].
Since the energy is a sum of independent equations, we can replace the two
optimization problems by a set of equivalent simple problems. The formulation
of (1) shows that the energy is constructed as the sum over all observations of a
measure of adequacy of RD × C into R+ defined by:

γ T (x, r) = hT (δ(r, c)) wc − x2 (5)
c∈C

which gives : E T ((Pc )c∈C , (wc )c∈C ) = γ T (xi , f (xi )) (6)
xi ∈X
32 K. Allab and K. Benabdeslem

For optimizing E T subject to M L and CL constraints, considering the pro-


totypes (wc )as fixed, it suffices to minimize each sum independently with the
incorporation of a new procedure V (.) for controlling the violation of the con-
straints in the assignment process:

c∗ = arg M inγ T (xi , r) such V (xi , c∗ , ΩM L , ΩCL ) is F alse
f (xi ) = r∈C (7)
∅ otherwise

With : V (xi , c∗ , ΩML , ΩCL ) =




⎨T rue = c∗ )
if ∀xj ∈ X|((xi , xj ) ∈ ΩM L ) ∧ (f (xj ) 
OR ∀xj ∈ X|((xi , xj ) ∈ ΩCL ) ∧ (f (xj ) = c∗ ) ∨ (f (xj ) ∈ N eigh(c∗ )) (8)


F alse Otherwise

where N eigh(c∗ ) represents the set of neighborhoods of c∗ in the map, ΩML is


the must-link constraint set and ΩCL is the cannot-link constraint set.
Similarly, when the classes (Pc )are fixed, the optimization of E T can be given
by minimizing the energy associated to each neuron:
 2
EcT (w) = hT (δ(f (xi ), c)) w − xi  (9)
xi ∈X

subject to M L and CL constraints.


We can easily resolve this problem giving an unique solution as weighting
mean of observations with a new defined control function gc (.) :
 T
xi ∈X h (δ(f (xi ), c)) xi gc (xi )
wc =  T
(10)
xi ∈X h (δ(f (xi )), c) gc (xi )

where

0 if ∃xj ∈ X|(f (xj ) = c) ∧ (xi , xj ) ∈ ΩCL
gc (xi ) = (11)
1 otherwise
Equations (7) and (10) represent the modified batch version of SOM algorithm
(that we call S3OM, for Semi-Supervised Self Organizing Map) with our changes
in both, assignment and adaptation steps. The algorithm takes in a data set (X),
a must-link constraints set (ΩML ), and a cannot-link constraints set(ΩCL ). It
returns a partition of the observations in X that satisfies all specified constraints.
The major modification is that, when updating cluster assignments, we ensure
that none of the specified constraints are violated. We attempt to assign each
point xi to its closest neuron c in the map. This will succeed unless a constraint
would be violated. If there is another point xj that must be assigned to the same
neuron as xi , but that is already in some other neuron, or there is another point
xk that cannot be grouped with xi but is already in c or in the neighboring of
c, then xi cannot be placed either in c or its neighboring.
Constraint Selection for Semi-supervised Topological Clustering 33

We continue down the sorted list of neurons until we find one that can legally
host xi . Constraints are never broken; if a legal neuron cannot be found for xi ,
the empty partition (∅) is returned. Note that with the use of equation(11), each
reference vector wc is updated only by the observations xi ∈ X which have not
CL constraints with any element xj belonging to c.

3 Constraint Selection
Regarding to other results obtained in [3,4,14,22,23], we observed that integrat-
ing constraints generally improve the clustering performance. But sometimes,
they could have ill effects even when they are generated from the data labels
that are used to evaluate accuracy. So it is more important to know why do
some constraint sets increase clustering accuracy while others have no effect or
even decrease accuracy. For that, the authors in [13] have defined two impor-
tant measures, informativeness and coherence, that capture relevant properties
of constraint sets.

Fig. 2. Two properties of constraints (red lines for M L and green lines for CL): (a)
Informatisness: m and c are informative. (b) Coherence: projected overlap between m
and c (overc m) is not null. So, The coherence of the subset {m,c} is null.

3.1 Informativeness
This measure represents the amount of conflict between the constraints and
the underlying objective function and search bias of an algorithm. It is based on
measuring the number of constraints that the clustering algorithm cannot predict
using its default bias. Given a possibly incomplete set of constraints Ω and an
algorithm A, we generate the partition PA by running A on the data set without
any constraints (Figure 2(a)). We then calculate the fraction of constraints in Ω
that are unsatisfied by PA :
1 
IA (Ω) = unsat(α, PA ) (12)
|Ω|
α∈Ω

where unsat(α, PA ) is 1 if PA does not satisfy α and 0 otherwise. This ap-


proach effectively uses the constraints as a hold-out set to test how accurately the
algorithm predicts them.
34 K. Allab and K. Benabdeslem

3.2 Coherence
This measure represents the amount of agreement between the constraints them-
selves, given a metric d that specifies the distance between points. It does not
require knowledge of the optimal partition P ∗ and can be computed directly.
The coherence of a constraint set is independent of the algorithm used to per-
form constrained clustering. One view of an M L(x, y) (or CL(x, y)) constraint
is that it imposes an attractive (or repulsive) force within the feature space
along the direction of a line formed by (x, y), within the vicinity of x and y.
Two constraints, one an M L constraint (m) and the other a CL constraint (c),
are incoherent if they exert contradictory forces in the same vicinity. Two con-
straints are perfectly coherent if they are orthogonal to each other. To determine
the coherence of two constraints, m and c, we compute the projected overlap of
each constraint on the other as follows.
Let −→
m and − →
c be vectors connecting the points constrained by m and c re-
spectively. The coherence of a given constraint set Ω is defined as a fraction of
constraints pairs that have zero projected overlap (Figure 2(b)):

m∈ΩM L ,c∈ΩCL δ(overc m = 0 ∧ overm c = 0)
Cohd (Ω) = (13)
|ΩM L | |ΩCL |
where overc m represents the distance between the two projected points linked
by m over c. δ is the number of the overlapped projections. Please, see [13] for
more details.
From the equation (13), we can easily define a specific measure for each con-
straint as follows:

δ(overc m = 0)
Cohd (m) = c∈ΩCL (14)
|ΩCL |

δ(overm c = 0)
Cohd (c) = m∈ΩM L (15)
|ΩML |

3.3 Hard Selection


In this section, we show how to select the relevant constraints according to their
informativeness and coherence. To be selected, a constrained αi must be (1) in-
formative, i.e it must not be satisfied by the classical SOM (without constraints)
and (2) fully coherent, i.e. it must not overlaps with any other constraint αj
(αj ∈ ΩCL if αi ∈ ΩML and vis-versa). This hard fashion to select constraints
can be described in Algorithm1.

3.4 Soft Selection


Hard selection is very selective and can considerably reduce the number of rel-
evant constraints Ωs and thus penalize the constraints having important values
of coherence. In fact, in a given constraint set Ω, we could have a great difference
Constraint Selection for Semi-supervised Topological Clustering 35

Algorithm 1. Hard selection


Input:Constraint set Ω = {αi }
Onput:Selected constraint set Ωs
Initialize Ωs = ∅
for i = 1 to |Ω| do
if unsat(αi , PSOM ) = 1 then
if Cohd (αi ) = 1 then
Ωs = Ωs ∪ {αi }
end if
end if
end for

between the number of M L constraints and the number of CL ones. So the


coherence could be favorable for the type of constraints (M L or CL) having the
bigger set. For example, let Ω be a set of 4 M L constraints and 10 CL constraints.
With hard selection, if a M L constraint overlaps with just one CL constraint
(yet, its coherence is 0.9!) it would not be selected. We propose therefore a soft
version of selection described in Algorithm 2. The aim is to know if the clustering
quality is more efficient with a great number of sof tly selected constraints than
with a few number of hardly selected constraints.

Algorithm 2. Soft selection


Input:Constraint set Ω = {αi }
Onput:Selected constraint set Ωs
Initialize Ωs = ∅, Ωs = ∅
for i = 1 to |Ω| do
if unsat(αi , PSOM ) = 1 then
Ωs = Ωs ∪ {αi }
end if
end for
for i = 1 to |Ωs | do
if Cohd (αi ) ≥ Cohd (Ωs ) then
Ωs = Ωs ∪ {αi }
end if
end for

4 Experimental Results

Extensive experiments were carried out over the data sets in Table 1. These
Data sets are voluntarily chosen for evaluating the clustering performance of
S3OM and comparing it with other state of the art techniques: COP-KMeans
(CKM) [4], PC-KMeans (PKM), M-KMeans (MKM), MPC-KMeans (MPKM)
[14], Cop-Bcoloring (CBC) [8], (lpSM) [16] and Belkin-Niyogi’s Approach [16].
36 K. Allab and K. Benabdeslem

Table 1. Used Data sets

Data sets N D #classes Map’ dimensions Reference


Glass 214 9 6 11 × 7 [2]
Ionosphere 351 34 2 13 × 7 [2]
Iris 150 4 3 16 × 4 [2]
Leukemia 72 1762 2 9×5 [24]
Chainlink 1000 3 2 18 × 9 [25]
Atom 800 3 2 14 × 10 [25]
Hepta 212 3 7 9×8 [25]
Lsun 400 2 3 13 × 8 [25]
Target 770 2 6 13 × 11 [25]
Tetra 400 3 4 11 × 9 [25]
TwoDiamonds 800 2 2 20 × 7 [25]
Wingnut 1070 2 2 16 × 10 [25]
EngyTime 4096 2 2 21 × 15 [25]

4.1 Evaluation of the Proposed Approach


For generating constraints from each data set, we considered randomly 20%
of data equally distributed on the labels. From these labelled data, we gener-
ated the set of all must-link constraints (ΩML ) and all cannot-link ones (ΩCL ).
Next, we applied a SOM based clustering algorithm on all data set without
any constraints. The obtained partition (PSOM ) is then used by Algorithm 1
(Algorithm 2) in informativeness study of Ω (ΩML ∪ ΩCL ) to select the most
relevent constraints. The selected constraint set (Ωs or Ωs ) is exploited in S3OM’
algorithm.
For the construction of the S3OM’ maps, we used a Principal Component
Analysis (PCA) based heuristic proposed by [15] for automatically providing
the initial number of neurons and the dimensions of the maps (Table 1). The
reference vectors are initialized linearly along the greatest eigenvectors of the
associated data set. Then, each S3OM map is clustered by an Ascendant Hier-
archical Clustering (AHC) for optimizing the number of clusters (by grouping
neurons) [26]. In general, an internal index, like Davies Bouldin or Generalized
Dunn [27], are used for cutting the dendrogram. Here, we propose to cut it ac-
cording to the violation of constraints, i.e., we select the small number of classes
which does not violate the constraints especially the CL ones. For the evaluation
of the accuracy of S3OM algorithm, we propose to use the Rand index [28].
Table 2 compares the results (averaged over 1000 trials) for each algorithm in
terms of its unconstrained and constrained performance, when provided with 25
randomly selected constraints. We added our obtained results to those shown in
[4,13,8]. The best result for each algorithm/data set combination is in bold.
In one hand, S3OM shows that integrating constraints in SOM model provides
a clear improvement to clustering accuracy. In other hand, the results obtained
by S3OM are similar and sometimes better than another known constrained
clustering methods. The most important remark is that with our constrained
Constraint Selection for Semi-supervised Topological Clustering 37

Table 2. Average Rand Index of S3OM vs five constrained clustering algorithms (for
1000 trials). Firstly without constraints (Unc) and secondly, with 25 randomly selected
constraints (Con).

Data CKM MKM PKM MPKM CBC S3OM


Unc/Con Unc/Con Unc/Con Unc/Con Unc/Con Unc/ Con
Glass 69.0 69.4 39.5 56.6 43.4 68.8 39.5 67.8 66.0 67.4 64.3 68.4(±2.1)
Iono. 58.6 58.7 58.8 58.9 58.8 58.9 58.9 58.9 56.0 58.8 50.5 61.3(±8.0)
Iris 84.7 87.8 88.0 93.6 84.3 88.3 88.0 91.8 82.5 87.2 91.2 92.2(±3.8)

algorithm the clustering performance often increases significantly with a few


number of constraints comparing to another ones. For example, in Table 2, for
Glass, S3OM (68.4%) is not better than CKM (69.4%) but S3OM yields an
improvement of 4.1% over the baseline and CKM achieves just 0.4% increase
in accuracy. Thanks to topological property and neighborhood relation which
allowed us to separate two points related by a CL constraint by putting them
in two distant clusters (neurons), unlike in the other constrained algorithms
in which the assignment is done in the closest cluster which don’t violate the
constraints.

4.2 Results of Constraint Selection


In Table 2, we have seen that in general, constraints improve clustering perfor-
mance. However, these constraints were generated randomly, so there exist some
ones which are not informative and some ones which are not coherent.

Fig. 3. Results of S3OM according to coherence rate, with fully informative constraints

To understand how this constraint set properties affect our constrained al-
gorithm, we performed the following experiment. We measured the accuracy
according to random informative constraint sets but with different rate of co-
herence (calculated by (13)). We can see that all databases exhibited important
increases of accuracy when coherence increases in a set of randomly selected
constraints (Figure 3). In fact, for all data sets, the accuracy increases steadily
and quickly when integrating “good” background information.
38 K. Allab and K. Benabdeslem

Fig. 4. Performance of S3OM according to both informativeness and coherence

Using randomly selected constraints could decreases clustering performance


(Red curves in Figure 4). Thus, for each data set, we selected the informative
ones (using Eq. 12). The obtained blue curves are better than the red ones, but
they contain some decreasing. For that, we divided the informative constraint
sets into two subsets: coherent and incoherent ones. In Figure 4, the constraints
are either fully coherent or fully incoherent in each considered subset. The
blue curves (mixed constraints) are clearly situated between the green ones (with
coherent constraints) and the pink ones (with incoherent constraints) for all
databases. For example, for Glass, S3OM achieves an accuracy of 64.3% without
any constraint. Overall accuracy increases with the incorporation of constraints,
reaching 72.4% after 100 random constraints. However, the performance is bet-
ter when using just the 49 informative constraints (75.3%) and it is even better
when using just the 25 informative and coherent constraints (77.2%) and weak
when using the 24 other non coherent ones (63.3%). Remark that for Iris, the
accuracy achieves 97.4% with 37 informative constraints (with some incoherent
constraints) but when dealing with the coherence ones, it increases quickly and
achieves the same accuracy using less number (16) of constraints. That means
that uninformative and incoherent constraints could be sometimes dramatic for
clustering performance.

4.3 Results of Selection on FCPS Data Sets

In this section, we applied S3OM on FCPS (Fundamental Clustering Problem


Suite) data sets [25] using 100 softly selected constraints. Then we compared the
obtained results (averaged over 100 trials) to those shown in [16].
Table 3 shows that S3OM can find the correct partition of Atom, Chainlink,
Hepta, Lsun and Tetra data sets and realizes similar results than lpSM’ approach
Constraint Selection for Semi-supervised Topological Clustering 39

Table 3. Average Rand Index of S3OM vs two semi-supervised learning algorithms (for
100 trials) with some FCPS data sets. Firstly without constraints (Unc) and secondly,
with 100 softly selected constraints (Con).

Data sets Belkin and lpSM S3OM


Niyogi Unc Con
Atom 93 100 63.87 100
Chainlink 78 100 57.72 100
Hepta 100 100 100 100
Lsun 83 100 90.87 100
Target 53 87 81.89 90.6(±3.4)
Tetra 100 100 96.56 100
TwoDiamonds 75 95 90.04 97.4(±2.5)
Wingnut 70 90 95.02 99.2(±0.8)
EngyTime 57 96 61.21 96.8(±3.1)

(Accuracy = 100 %) [16]. We can also remark that S3OM realizes better per-
formance than those of the other methods and confirms the contribution of the
selected constraints in the improvement of the quality of the clustering model.
For example, for EngyTime, S3OM without constraints realizes an accuracy of
61.21% without any constraint. By using the 100 softly selected constraints,
S3OM can reach an optimal averaged accuracy of 96.8%.

4.4 Results of Selection on Leukemia Data Set


The microarray “Leukemia” data is constituted of 72 samples, corresponding
to two types of Leukemia: 47 ALL (Acute Lymphocytic Leukemia) and 25
AML (Acute Myelogenous Leukemia). The data set contains expressions for
7129 genes. In [24] it was suggested to remove the affymetrix control genes and
all genes with an expression below to 20 (too low to be interpreted with confi-
dence). Finally 1762 genes are kept. We tested thus the hard selection and the
soft one over the obtained (72 × 1762) data set. Having 72 labeled samples, we
generated the maximum number of constraints (2556: 1381 M L and 1175 CL).
Using informativeness, we reduced this number to 1215 (540 M L and 675 CL).
We found only 3 coherent constraints using hard selection, achieving an accuracy
of 54.3%, yielding an improvement of 2% over the baseline! Thus, we wanted to
know if the accuracy could be significantly improved if we use soft selection.
First, we measured accuracy for 1000 trials with randomly selected constraints
with different sets of M L and CL. Their sizes vary between 5 and 37. We show
in Table 4 the results when the constraints are selected with hard selection and
when they are selected with soft selection. From this table, we can see that in
general, Hard selection is more efficient than the soft one, which is natural when
using the constraints with perfect coherence. However, the 3rd and 5th lines of
the table show the opposite, especially when the gap between the CL and the
M L is important.
40 K. Allab and K. Benabdeslem

Table 4. Performance of S3OM over “Leukemia” data set with hard selected con-
straints vs soft selected ones

Hard Soft
#ML #CL Acc #ML #CL Acc
1 4 62.0 2 7 55.9
5 5 70.9 6 9 66.5
11 4 75.7 13 7 77.0
11 9 86.7 14 17 84.7
8 17 89.6 11 21 91.1
14 16 97.2 16 21 88.6

Fig. 5. Ranking constraints according to coherence using soft selection

Second, we sorted the obtained set of 1215 informative constraints according


to their coherence. The global coherence of this set was equal to 0.215, which
allowed us to obtain 453 (176 M L and 277 CL) constraints softly selected (37%
from the informative constraint set). It corresponds to the first cut in Figure 5.
Finally, we studied the behavior of the S3OM quality according to different
percents of this constraint set (between 5% and 37%), in order to find the optimal
number of coherent constraints which realizes the best accuracy, in Figure 6. The
different numbers in this figure represents the number of coherent constraints
corresponding to the different percents.
Subsequently, we obtained a perfect accuracy with an optimal number of co-
herent constraints (111 constraints) which corresponds to 9.1% of the informative
data set. Note that the coherence of the 111th coherent constraint is 0.498, that
corresponds to the second cut in Figure 5. Thus, in this case, a constraint is
selected if its coherence is greater than or equal to 0.498.

4.5 Visualization
In this section, we present some visualization inspections of “Chainlink” data
set which represents an important problem for data structure. It describes two
chains tied in 3D space. The aim is to see if our proposals are able to disentangle
the two clusters and represent the real structure of data.
Constraint Selection for Semi-supervised Topological Clustering 41

Fig. 6. Performance of S3OM according to soft selected constraints

Fig. 7. Behaviors of Chainlink maps done by S3OM. Orange neurons represent the
first chain (Majorelle Blue neurons for the second one). Green lines for M L constraints
(Red lines for CL).

Figure 7 shows the representation of the topological projection of the map’s


neurons in “Chainlink” data space. Firstly, without constraints and secondly
with 20, 40 and 60 constraints. Note that this projection is done in the 3D
plan according to the 3 dimensions of data. We can see that we always obtain
a partition with 2 classes which corresponds to the correct number of the real
partition. However, without any constraints the algorithm (standard SOM) is
not able to completely find the structure of data and more we add constraints
more the algorithm disentangle the two chains. The best structure is obtained
with 60 informative and coherent constraints and the map is well organized.
42 K. Allab and K. Benabdeslem

5 Conclusion

In this work, the contributions are two-fold. First, a new algorithm S3OM was
developed for semi-supervised clustering by considering instance level constraints
in the objective function optimization of batch version of SOM. The determin-
istic aspect of this algorithm allowed us to perform hard constraints satisfaction
in a topological clustering. Second, we studied a constraint set properties, in-
formativeness and coherence, that provided a quantitative basis for explaining
why a given constraint set increases or decreases performance. These measures
was used for selecting the most useful constraints for clustering. The constraint
selection was done in both, hard and soft fashions.

References
1. Basu, S., Davidson, I., Wagstaff, K.: Constrained clustering: Advances in algo-
rithms, theory and applications. Chapman and Hall/CRC Data Mining and Knowl-
edge Discovery Series (2008)
2. Frank, A., Asuncion, A.: Uci machine learning repository. Technical report, Uni-
versity of California (2010)
3. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning a mahalanobis metric
from equivalence constraints. Journal of Machine Learning Research 6, 937–965
(2005)
4. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Clustering with instance level
constraints. In: Proc. of the 18th International Conference on Machine Learning,
pp. 577–584 (2001)
5. Lu, Z., Leen, T.K.: Semi-supervised learning with penalized probabilistic cluster-
ing. In: Advances in Neural information Processing Systems 17 (2005)
6. Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Clustering with instance level
constraints. In: Proc. of the 17th International Conference on Machine Learning,
pp. 1103–1110 (2000)
7. Davidson, I., Ravi, S.S.: Agglomerative hierarchical clustering with constraints:
theorical and empirical results. In: Proc. of ECML/PKDD, pp. 59–70 (2005)
8. Elghazel, H., Benabdeslem, K., Dussauchoy, A.: Constrained graph b-coloring
based clustering approach. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK
2007. LNCS, vol. 4654, pp. 262–271. Springer, Heidelberg (2007)
9. Davidson, I., Ravi, S.S.: The complexity of non-hierarchical clustering with instance
and cluster level constraints. Data Mining and Knowledge Discovery 61, 14–25
(2007)
10. Davidson, I., Ravi, S.S.: Clustering with constraints: feasibility issues and the k-
means algorithm. In: Proc. of the SIAM International Conference on Data Mining,
pp. 138–149 (2005)
11. Kulis, B., Basu, S., Dhillon, I., Mooney, R.: Semi-supervised graph clustering, a
kernel approach. In: Proc. of the 22th International Conference on Machine Learn-
ing, pp. 577–584 (2005)
12. Davidson, I., Ester, M., Ravi, S.S.: Efficient incremental clustering with constraints.
In: Proc. of 13th ACM Knowledge Discovery and Data Mining (2007)
13. Davidson, I., Wagstaff, K., Basu, S.: Measuring constraint-set utility for partitional
clustering algorithms. In: Proc. of ECML/PKDD (2006)
Constraint Selection for Semi-supervised Topological Clustering 43

14. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning
in semi-supervised clustering. In: Proc. of the 21th International Conference on
Machine Learning, pp. 11–18 (2004)
15. Kohonen, T.: Self organizing Map. Springer, Berlin (2001)
16. Herrmann, L., Ultsch, A.: Label propagation for semi-supervised learning in self-
organizing maps. In: Proc. of the 6th WSOM (2007)
17. Belkin, M., Niyogi, P.: Using manifold structure for partially labelled classification.
In: Proc. of Advances in Neural Information Processing Systems (2003)
18. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training.
In: Proc. of COLT: Proc. of the Workshop on Computational Learning Theory, pp.
92–100 (1998)
19. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning. The MIT Press,
Cambridge (2006)
20. Cheng, Y.: Convergence and ordering of kohonen’s batch map. Neural Computa-
tion 9(8), 1667–1676 (1997)
21. Heskes, T., Kappen, B.: Error potentials for self-organization. In: Proc. of IEEE
International Conference on Neural Networks, pp. 1219–1223 (1993)
22. Xing, E.P., Ng, A.Y., Jordan, M.I., Russel, S.: Distance metric learning, with ap-
plication to clustering with side-information. Advances in Neural Information Pro-
cessing Systems 15, 505–512 (2003)
23. Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-
level constraints: Making the most of prior knowledge in data clustering. In: Proc.
of the 19th International Conference on Machine Learning, pp. 307–313 (2002)
24. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,
Coller, H., Loh, M., Downing, L.R., Caligiuri, M.A., Bloomfield, C.D., Lander,
E.S.: Molecular classification of cancer: Class discovery and class prediction by
gene expression monitoring. Science 15 286(5439), 531–537 (1999)
25. Ultsch, A.: Fundamental clustering problems suite (fcps). Technical report, Uni-
versity of Marburg (2005)
26. Vesanto, J., Alhoniemi, E.: Clustering of the self organizing map. IEEE Transac-
tions on Neural Networks 11(3), 586–600 (2000)
27. Kalyani, M., Sushmita, M.: Clustering and its validation in a symbolic framework.
Pattern Recognition Letters 24(14), 2367–2376 (2003)
28. Rand, W.M.: Objective criteria for the evaluation of clustering method. Journal of
the American Statistical Association 66, 846–850 (1971)
Is There a Best Quality Metric for Graph
Clusters?

Hélio Almeida1 , Dorgival Guedes1 ,


Wagner Meira Jr.1 , and Mohammed J. Zaki2
1
Universidade Federal de Minas Gerais, MG, Brazil
{helio,dorgival,meira}@dcc.ufmg.br
2
Rensselaer Polytechnic Institute, NY, USA
zaki@cs.rpi.edu

Abstract. Graph clustering, the process of discovering groups of similar


vertices in a graph, is a very interesting area of study, with applications
in many different scenarios. One of the most important aspects of graph
clustering is the evaluation of cluster quality, which is important not
only to measure the effectiveness of clustering algorithms, but also to
give insights on the dynamics of relationships in a given network. Many
quality evaluation metrics for graph clustering have been proposed in
the literature, but there is no consensus on how do they compare to each
other and how well they perform on different kinds of graphs. In this
work we study five major graph clustering quality metrics in terms of
their formal biases and their behavior when applied to clusters found by
four implementations of classic graph clustering algorithms on five large,
real world graphs. Our results show that those popular quality metrics
have strong biases toward incorrectly awarding good scores to some kinds
of clusters, especially seen in larger networks. They also indicate that
currently used clustering algorithms and quality metrics do not behave as
expected when cluster structures are different from the more traditional,
clique-like ones.

1 Introduction

The problem of graph clustering consists of discovering natural groups (or clus-
ters) in a graph [17]. Graph clustering has become very popular recently, given
the large number of applications it has in areas like social network analysis
(finding groups of related people), e-commerce (doing recommendations based
on relations in a group) and bioinformatics (classifying gene expression data,
studying the spread of a disease in a population).
What is the basic structure of a group is open to discussion, but the most
classical and adopted view is the one based on the concept of homophily: similar
elements have a greater tendency to group with each other than with other
elements [16]. When working with graphs, homophily is usually viewed in terms
of edge densities, with clusters having more edges linking their elements among
themselves (high internal density) than linking them to the rest of the graph

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 44–59, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Is There a Best Quality Metric for Graph Clusters? 45

(sparser external connections). However, discovering those edge-dense clusters


in graphs is a complex task since, by this definition, a cluster can be anything
between a connected subgraph and a maximal clique.
When a graph has only a small number of vertices, the result of its clustering
can be evaluated manually. However, as the size of the graph grows, manual
evaluation becomes unfeasible. For those cases, evaluation metrics, such as mod-
ularity [14] and conductance [8], which try to encapsulate the most important
characteristics expected from a good cluster, may be used as indicative of clus-
ter quality. The quality of a clustering algorithm is then estimated in terms of
the values its output gets for that metric. That allows researchers to compare
different proposed clusterings to each other in order to find the best one among
them.
Still, as much as those metrics try to identify good clusters by quantifying
the quality of a grouping, they are not perfect. It is actually very difficult to
determine whether a given quality metric gives the expected answer for a given
clustering of a graph, since typically there is no ground truth for comparison.
That is particularly true for larger datasets derived from the relationships of real
groups.
In this work we compare some of the most popular quality metrics for graph
clustering. We evaluate if those metrics really represent the classical view of what
a cluster should be and how do they behave in larger, less predictable cases. Our
results show that the currently used quality indexes for graph clustering have
a series of structural anomalies that cause them to be biased and unreliable in
bigger, real world graphs. Also, we observed that the type of network (social,
technological, etc) can cause the structure of its clusters to become different
from what is typically expected by the clustering algorithms and quality metrics
studied.

2 Related Work

Many papers have focused on the problem of comparing the effectiveness of


clustering algorithms. To do so, they use different cluster quality evaluation
metrics. The problem is that, in general, authors just assume those metrics are
good enough to represent good clusters, without concerning themselves with the
evaluation of such claims. Most of the related work presented here falls into this
category.
Brandes et al. [1] compared three algorithms for graph clustering: Markov
clustering, iterative conductance cut and geometric MST clustering. To evaluate
their results, they used three validation metrics, namely, conductance, coverage
and performance. Their experiments used synthetic clusters with 1000 nodes.
However, the goal of that work was to compare the results from the three algo-
rithms, assuming that the validation metrics used represented the ground truth
for the “real” clustering of the graphs. Because of that, they do not compare
metrics against each other. Our work differs from theirs in that we want to know
whether those validation metrics in fact identify a good clustering correctly.
46 H. Almeida et al.

Gustafson and Lombardi [7] compare K-means and hierarchical clustering


using modularity and silhouette index as quality metrics. They use Zachary’s
karate club, the American college football, the gene network of yeast and syn-
thetic graphs as the datasets for their work. Once again, the focus of their work
is on the clustering algorithms, while ours is on the quality of the validation
metrics.
Danon et al. [2] compare many different algorithms for graph clustering, in-
cluding agglomerative, divisive and modularity maximization techniques. How-
ever, the only quality metric used to compare those algorithms is modularity.
They use very small (128 vertices) synthetic datasets for evaluation.
One paper by Good et al. [6] discusses the effectiveness of modularity maxi-
mization methods for graph clustering. This is interesting because, in a way, they
are evaluating how good modularity is as a quality index. However, they use a
different formula to calculate modularity than the one we use in this paper, one
that clearly generates unbounded scores and is, therefore, inadequate as a valida-
tion index. Unbounded scores can be used to compare clusterings, but are of no
use to evaluate the quality of a single cluster in terms of its own structure, since
there are no upper or lower bounds of metric values to compare its result to.
Another work, by Leskovec et al. [10], uses external conductance to evaluate
the quality of clusters in terms of their size, to determine if there is a maximum
or expected size for well formed clusters. The problems with this approach is
that, similarly to other works discussed, it assumes that conductance is the best
quality index to evaluate said clusters. In a follow-up work, Leskovec at al. [11]
use other quality metrics to evaluate the same problem, but those new metrics
also focus only on the external sparsity of the clusters.
Tan et al. [20] present a very interesting comparison between many metrics
used to determine the “interestingness” of association metrics like lift, confidence
and support, widely used in data mining. They show that there is no single metric
that is consistently better than the others in different scenarios, so that the met-
rics should be chosen case-by-case to fit the expectations of the domain experts.
Our work does a similar comparison for graph clustering validation metrics.

3 Quality Metrics

The most accepted notion of a cluster is based on the concept of assortative


mixing: elements have a greater tendency to form bonds with other elements with
whom they share common traits than with others [15]. Applying this concept
to graphs, a cluster’s elements will display stronger than expected similarity
among themselves, while also having sparser than expected connections to the
rest of the graph. Element similarity can be derived from many graph or vertex
characteristics, such as edge density [5], vertex distance [21] or labels [24].
In this section we will present some of the most popular cluster quality metrics
present in the literature. We will also discuss if those metrics behave consistently
with what is expected of good clusterings, that is, high internal edge density
and sparse connections with other clusters. The metrics studied in this paper
Is There a Best Quality Metric for Graph Clusters? 47

use only a graph’s topological information, like vertex distance or edge density,
to evaluate the quality of a given cluster.

3.1 Graph Definitions

A graph G = (V, E) is composed of a set V of vertices and a set E = (u, v)|u, v ∈


V of edges. If nothing is said in opposition, assume that the graphs discussed are
undirected, so that (u, v) = (v, u). The number of edges of a graph G is |E(G)| =
m, and the number of edges linked to a given vertex v is represented as deg(v).
Edges may have an associated weight w(u, v). In unweighted cases, we assume
that w(u, v) = 1 for all (u, v) ∈ E. Also, consider E(Ci , Cj )|i  = j as the set of
edges linking clusters Ci and Cj and E(Ci ) as the set of edges (u, v)|u, v ∈ Ci .
Then, E(C) is the set of all internal edges for all clusters in C, and Ē(C) is the
set of all inter-cluster edges in the graph ((u, v)|u ∈ Ci , v ∈ Cj , i 
= j).
A clustering C is the set of all clusters of a graph, so that C = C1 , C2 , . . . , Ck ,
and the number k of clusters may be a parameter of some clustering algorithms.
Also, unless stated otherwise, Ci ∩ Cj = ∅, ∀i  = j. A cluster Ci that is composed
by only one vertex is called a singleton. The
 weight of all internal edges of a single
cluster is given by w(Ci ), a shortcut for e∈E(Ci ) w(e). By the same logic, w̄(C)
is the sum of the weights of all inter-cluster edges.
A graph cut K = (S, S̄), where S̄ = V \ S), divides a set of vertices V into two
disjoint groups (S ∩ S̄ = ∅). The cost of a cut is given by the sum of the weights
of the inter-cluster edges. Another important concept is that of an induced graph,
which is a graph formed by a subset of the vertices and edges of a graph so that
G[Ci ] = (Ci , E(Ci )).

3.2 Modularity

One of the most popular validation metrics for topological clustering, modularity
states that a good cluster should have a bigger than expected number of internal
edges and a smaller than expected number of inter-cluster edges when compared
to a random graph with similar characteristics [14]. The modularity score Q for a
clustering is given by Equation 1, where e is a symmetric matrix whose element
eij is the fraction of all edges in the network that link vertices in communities i
and j, and T r(e) is the trace of matrix e, i.e., the sum of elements from its main
diagonal.
Q = T r(e) − ||e2 || (1)
The modularity index Q often presents values between 0 and 1, with 1 repre-
senting a clustering with very strong community characteristics. However, some
limit cases may even present negative values. One example of such cases is in
the presence of clusters with only one vertex. In this case, those clusters have 0
internal edges and, therefore, contribute nothing to the trace. Sufficiently large
numbers singleton clusters in a given clustering might cause its trace value to be
so low as to overshadow other, possibly better formed, of its clusters and lead
to very low modularity values regardless.
48 H. Almeida et al.

3.3 Silhouette Index

This metric uses concepts of cohesion and separation to evaluate clusters, using
the distance between nodes to measure their similarity [21]. The silhouette index
for a given vertex i is given by Equation 2

v∈Ci Sv b v − av
S(Ci ) = , where Sv = (2)
|Ci | max(av , bv )
Where av is the average distance between vertex v and all the other vertices
in the same cluster as it is, and bv is the average distance between v and all
the vertices in the nearest cluster that is not v’s. The silhouette index for a
given cluster is the average value of silhouette for all its member vertices. The
silhouette index can assume values between −1 and 1, with a negative value
being undesirable, as it means that the average internal distance of the cluster
is greater than the external one.
The silhouette index presents some limitations, though. First of all, it is a very
expensive metric to calculate, requiring an all pairs shortest path execution. The
other is how it behaves in the presence of singleton clusters. Since a singleton
possesses no internal edges, its internal distance will be 0, causing its silhouette
to wrongly score a perfect 1. This way, clusterings with many singletons will
always have high silhouette scores, no matter the quality of the other clusters.

3.4 Conductance

The conductance [8] of a cut is a metric that compares the size of a cut (i. e.,
the number of edges cut) and the weight of the edges in either of the two sub-
graphs induced by that cut. The conductance φ(G) of a graph is the minimum
conductance value between all its clusters.
Consider a cut that divides G into k non-overlapping clusters C1 , C2 . . . Ck .
The conductance of any  givencluster φ(Ci ) can be obtained as shown in Equa-
tion 3, where a(Ci ) = u∈Ci v∈V w(u, v) is the sum of the weights of all edges
with at least one endpoint in Ci . This φ(Ci ) value represents the cost of one cut
that bisects G into two vertex sets Ci and V \Ci . Since we want to find a number
k of clusters, we will need k − 1 cuts to achieve that number. In this paper we
assume the conductance for the whole clustering to be the average value of those
(k − 1) φ cuts, as formalized in Equation 4.
 
u∈Ci v
∈C i w({u, v})
φ(Ci ) = (3)
min(a(Ci ), a(C̄i ))
φ(G) = avg(φ(Ci )) , Ci ⊆ V (4)

Based on this information, it is possible to define the concept of intra-cluster


conductance α(C) (Eq. 5) and the inter-cluster conductance σ(C) (Eq. 6) for a
given clustering C = C1 , C2 , . . . , Ck .
Is There a Best Quality Metric for Graph Clusters? 49

α(C) = mini∈{1,...,k} φ(G[Ci ]) (5)


σ(C) = 1 − maxi∈{1,...,k} φ(Ci ) (6)

The intra-cluster conductance will be the minimum conductance value of the


graphs induced by each cluster Ci , with a low value meaning that at least one
of the clusters may be too coarse to be good. The inter-cluster conductance is
the complement of the maximum conductance value of the clustering, so that
lower values might show that at least one of the clusters have strong connections
outside of it, i. e., the clustering might be too fine. So, a good clustering should
have high values of both intra- and inter-cluster conductance.

(a) Two clusters (b) Three clusters

Fig. 1. Two possible clusterings of a same graph

Although the use of both internal and external conductance gives a better, well
rounded view of both internal density and external sparsity of a cluster, many
works use only the external conductance while evaluating cluster quality [10,11].
So, in this paper we will likewise use only the external conductance, referred from
now on simply as conductance, to evaluate if it is a good enough quality metric by
itself. One negative characteristic of conductance that can be pointed out is that
it might have a tendency of giving better scores to clusterings with fewer clusters,
as more clusters will probably have more cut-edges. Also, the lack of internal edge
density information used in this kind of conductance may cause problems, as can
be seen in Figure 1, where both clusterings presented would have the same con-
ductance score, even though the one in Figure 1b is obviously better.

3.5 Coverage
The coverage of a clustering C (where C = C1 , C2 , . . . , Ck ) is given as the fraction
of the weight of all intra-cluster edges with respect to the total weight of all edges
in the whole graph G [1], as shown in Equation 7:

w(C)
coverage(C) = , where (7)
w(G)
k

w(C) = w(E(vx , vy )); vx , vy ∈ Ci
i=1

Coverage values usually range from 0 to 1. Higher values of coverage mean that
there are more edges inside the clusters than edges linking different clusters,
which translates to a better clustering. From its formulation, we can observe
50 H. Almeida et al.

that the main clustering characteristic needed for a high value of coverage is
inter-cluster sparsity. Internal cluster density is in no way taken into account
by this metric, and it probably causes a strong bias toward clusterings with less
clusters. This can be seen in the example on Figure 1, where the clustering with
two clusters would receive a better score than the clearly better clustering with
three clusters.

3.6 Performance
This metric counts the number of internal edges in a cluster along with the edges
that don’t exist between the cluster’s nodes and other nodes in the graph [22],
as can be seen in Equation 8
f (C) + g(C)
perf (C) = , where (8)
2 n(n − 1)
1

k

f (C) = |E(Ci )|
i=1
k 

g(C) = | {{u, v} 
∈ E|u ∈ Ci , v ∈ Cj }|
i=1 j>i

This formulation assumes an unweighted graph, but there are also variants for
weighted graphs [1]. Values range from 0 to 1, and higher values indicate that
a cluster is both internally dense and externally sparse and, therefore, a better
cluster. However, if we consider that complex networks tend to be sparse in
nature, when performance is applied to larger graphs, there is a great possibility
that g(C) becomes so high that it will dominate all other factors in its formula,
awarding high scores indiscriminately.

4 Clustering Algorithms
To be able to compare different clusterings with the validation metrics avail-
able, we selected representatives from four different, representative categories of
clustering algorithms. The chosen algorithms were Markov clustering, bisection
K-means, spectral clustering and normalized cut.

4.1 Markov Clustering


The Markov clustering algorithm (MCL) [22,4] is based on the simulation of
stochastic flows in a graph. The basic idea behind MCL is that the distances be-
tween vertices are what identify a cluster, with small distances between vertices
indicating that they should belong to the same cluster and large distances mean-
ing the opposite. By that logic, a random walker would have greater probability
to stay inside a cluster than to wander to neighboring ones, and the algorithm
explores that to identify clusters.
Is There a Best Quality Metric for Graph Clusters? 51

The clustering process of MCL consists of two iterative steps: expansion and
inflation. The expansion step of the algorithm is done taking the power of the
normalized adjacency matrix representing the graph using traditional matrix
multiplication. The inflation step consists in taking the Hadamard power of the
expanded matrix, followed by a scaling step to make the matrix stochastic again,
with the elements of each column corresponding to a probability value. MCL does
not need to have a pre-defined number of clusters as input, it’s only parameter
being the inflation value, which affects the coarsening of the graph (the lower
the value, the coarser the clustering).

4.2 Bisecting K-means


In the traditional K-means algorithm, k elements are chosen as the centroids
of each one of the k clusters to be found and other elements closer to a given
centroid than to others are added to that cluster. With this basic cluster in
hand, a new centroid is calculated for each cluster, reflecting their new “cen-
ters”, and the process is repeated until the centroids calculated do not change
anymore.
Bisecting K-means [19] differs from the traditional algorithm in the following
way: the whole graph is considered to be a cluster, which we bisect using tradi-
tional K-means, the topological distance between the nodes acting as the vertex
similarity function. One of the new clusters is chosen to be once more bisected
and the process repeats until the desired number of clusters is found.

4.3 Spectral Clustering


Spectral clustering [8,17] is a technique that uses the eigenvectors (spectrum)
and eigenvalues of a matrix to define cluster membership. It is based on the fact
that if a graph is formed by k disjoint cliques, then it’s normalized Laplacian will
be a block-diagonal matrix with eigenvalue of zero and multiplicity k. Also, its
eigenvectors function as indicators of cluster membership. More than that, small
perturbations like adding a few edges linking clusters or removing edges from
inside the clusters will make the eigenvalues become slightly higher than zero
and change its eigenvectors, but not enough to cause the underlying structure
to be lost. This clustering technique requires the number of desired clusters as
an input.

4.4 Normalized Cut


This method, proposed By Shi and Malik [18], tries to find the best possible
clustering through the optimization of an objective function, in this case, a cut.
Consider the cost of cut(A, B), that divides the vertices V of a graph G = (V, E)
in two sets A, B|A ∪ B = V, A ∩ B = ∅, as the sum of the weights of all edges
linking vertices in A to vertices in B. We want to find the cut that minimizes
52 H. Almeida et al.

the cost function given by Equation 9, where the volume of a set is the sum of
the weights of all edges with at least one endpoint inside it.
 
1 1
cut(A, B) = + (9)
V ol(A) V ol(B)
This cost function is designed to penalize cuts that generate subsets with highly
different sizes. So, by minimizing the normalized cut of a graph, we are dividing
sets of vertices with low similarity and that potentially have high internal simi-
larity. This technique also requires the desired number of clusters to be given as
an input.

5 Experiments
This section presents the experiments used to help evaluating the quality metrics
studied. We will briefly describe our methodology and graphs used, following
with a discussion of the obtained results.

5.1 Methodology
We implemented the five quality metrics discussed in Section 3. To evaluate their
behavior, we applied them to clusters obtained through the execution of the four
classical graph clustering algorithms discussed in Section 4 on five large, real
world graphs that will be briefly discussed in the next subsection. This variety
of clustering algorithms and graphs is necessary to minimize the pollution of
the results by possible correlations between metrics algorithms and/or graph
structures.
We used freely available implementations for all clustering algorithms: the
MCL implementation by Van Dongen, which is available within many Linux dis-
tributions, the implementation of bisecting K-means available in the Cluto1 suite
of clustering algorithms, the spectral clustering algorithm implementation avail-
able in SCPS, by Nepusz [13] and the normalized cut clustering implementation
GRACLUS, by Dhillon [3].
Three different inflation indexes where chosen for the MCL algorithm, based
on the values suggested by the algorithm’s documentation: 1.5, 2, and 3. The
number of clusters found by each MCL configuration was used as the input for
the other algorithms, so that we could compare clusterings with roughly the
same number of clusters.

Graphs. We used 7 different datasets derived from real complex networks. Two
of them are smaller, but with known expected partitions that could be used for
comparison, and the other five are bigger and with unknown expected partitions.
All graphs used are undirected and unweighted.

1
http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
Is There a Best Quality Metric for Graph Clusters? 53

The first small dataset is the Karate club network. It was first presented
by Zachary [23] and depicts the relationships between the students in a karate
dojo. During Zachary’s study a fight between two teachers caused a division of
the dojo in two, with the students more related to one teacher moving to his
new dojo. Even though this dataset is small (34 vertices), it is interesting to
consider because it possesses information about the real social partition of the
graph, providing a ground truth for the clustering.
The other small dataset used was the American College football team’s
matches [5]. It represents a graph where the vertices are football teams and an
edge links two teams if they have played against each together. Since the teams
play mostly with other teams in the same league as theirs, with the exception
of some military school teams, which belong to no league and can play against
anyone, there is also an expected clustering already known for this graph. It is
composed of 115 vertices and 616 edges.
The five remaining networks were obtained from the Stanford Large Network
Dataset Collection2 . Two of them represent the network of collaborations in
papers submitted to the arXiv e-prints in two different areas of study, namely
Astrophysics and High Energy Physics. In those networks, researchers are the
vertices, and they are linked by edges if they collaborated in at least one paper.
The Astrophysics network is composed by 18,772 vertices and 396,160 edges,
while the High Energy Physics has 12,008 vertices and 237,010 edges. Another
network based on the papers submitted to the arXiv e-prints was used, but
covering the citation network of authors in the High Energy Physics category.
In this case, an edge links two authors if one cites the other. This network has
34,546 vertices and 421,578 edges.
The last two networks are snapshots from a Gnutella P2P file sharing network,
taken in two different dates. Here the vertices are the Gnutella clients and the
edges are the overlay network connections between them. The first snapshot was
collected in August, 4 2002 and comprises 10,876 vertices and 39,994 edges. The
second one was collected in August, 30 2002 and has 36,682 vertices and 88,328
edges.

5.2 Results
We first used the smaller datasets, the karate club and the college football, in
order to check how the algorithms and quality metrics behaved in small net-
works where the expected result was already known. The results for the Karate
club dataset can be seen on Table 1. The College Football dataset gave similar
results and was omitted for brevity. The results shown represent the case with
two clusters, which is the expected number for this dataset. It can be observed
that the scores obtained were fairly high. Also, the resulting clusters were very
similar to the expected ones, with variations of 2 or 3 wrongly clustered vertices.
However, those two study cases were very small and classical, so good results

2
http://snap.stanford.edu/data/
54 H. Almeida et al.

Table 1. Karate Club dataset and its quality indexes for two clusters

Algorithm SI Mod Cov Perf Cond


MCL 0.13 ± 0.02 0.29 0.71 0.55 0.55 ± 0.15
B. k-means 0.081 ± 0.001 0.37 0.87 0.62 0.26 ± 0.13
Spectral 0.13 ± 0.02 0.36 0.87 0.61 0.30 ± 0.15
Norm. Cut 0.14 ± 0.017 0.18 0.68 0.56 0.65 ± 0.32

here were more than expected, as most of the quality metric biases we pointed
out in Section 3 were connected to bigger networks with many clusters.
Now, for the larger datasets. The quality metric values for the Astrophysics
Collaboration network are available in Table 2. It’s already possible to observe
some trends on the quality metrics’ behavior, no matter what clustering algo-
rithm is used. For example, modularity, coverage and conductance always give
better results for smaller numbers of clusters. Also, we can see that, as expected
from our observations in Section 3, performance values have no discriminating
power to compare any of our results. The silhouette index presents a somewhat
erratic behavior in this case, without a clear tendency of better or worse results
for more or less clusters.

Table 2. Astrophysics collaboration network clusters and their quality indexes

Algorithm # Clusters SI Mod. Cover. Perf. Cond.


MCL 1036 -0.22 ± 0.038 0.35 0.42 0.99 0.55 ± 0.02
MCL 2231 -0.23 ± 0.026 0.28 0.31 0.99 0.70 ± 0.006
MCL 4093 0.06 ± 0.015 0.19 0.27 0.99 0.82 ± 0.003
B. k-means 1037 -0.73 ± 0.017 0.25 0.28 0.99 0.70 ± 0.002
B. k-means 2232 -0.48 ± 0.005 0.21 0.24 0.99 0.70 ± 0.002
B. k-means 4094 -0.21 ± 0.01 0.17 0.19 0.99 0.76 ± 0.001
Spectral 1034 -0.15 ± 0.036 0.34 0.38 0.99 0.53 ± 0.015
Spectral 2131 -0.26 ± 0.027 0.25 0.28 0.99 0.66 ± 0.007
Spectral 3335 0.04 ± 0.017 0.19 0.21 0.99 0.78 ± 0.004
Norm. Cut 1037 -0.69 ± 0.021 0.23 0.25 0.99 0.66 ± 0.006
Norm. Cut 2232 -0.51 ± 0.019 0.17 0.19 0.99 0.73 ± 0.015
Norm. Cut 4094 -0.31 ± 0.006 0.13 0.15 0.99 0.81 ± 0.0004

For the High Energy Physics Collaboration network, as we can see on Table 3,
the tendencies observed in the last network are still true. Also, silhouette index
shows a more pronounced bias toward larger numbers of clusters. If we look at
the cumulative distribution function (CDF) of cluster sizes (as shown in Figure 2
for just two instances of our experiments, but that is consistent with the rest of
the obtained results), we can see that bigger clusterings tend to have a larger
number of smaller clusters. So, this bias of the silhouette index is expected from
our observations in Section 3. Those same tendencies occur in the High Energy
Physics Citation network, as seen in Table 4.
Is There a Best Quality Metric for Graph Clusters? 55

High Energy Physics Citation Net Gnutella Snapshot from 08/04/2002


1 1
814 Clusters 2189 Clusters
3898 Clusters 4724 Clusters
0.9 12911 Clusters 0.9 6089 Clusters

0.8 0.8

0.7 0.7

0.6 0.6
P(X <= x)

P(X <= x)
0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 100 200 300 400 500 600 0 5 10 15 20 25
Cluster size Cluster size

(a) High Energy Physics Citation (b) Gnutella Snapshot 08/04/2002

Fig. 2. Some cluster size’s Cumulative Distribution Functions (bisecting k-means)

The quality metric scores for one of the Gnutella snapshot networks can be
seen in Table 5. The scores for the other one were very similar to it, so we
suppressed them for brevity. It is possible to notice that the results for those
graphs still present the same tendencies shown in the other cases, but with a
key difference: while silhouette and performance results show no big difference
from the other datasets, as they are easily fooled by high numbers of singleton
clusters and network size, respectively, modularity, coverage and conductance
give abysmally low quality results. This happens because the structure of a
Gnutella network, with common peers connected only to “superpeers”, and those
superpeers also connected with each other. This structure leads to a very low
occurrence probability of 3-cliques (0.5% for the Gnutella networks against 31.8%
for the Astrophysics Collaboration network, for example). Also, the Gnutella
networks presented here are way sparser than the other studied networks, with
only 6.76% of all possible edges present in the graph for the 08/04/2002 snapshot
against 32.88% for the High Energy Physics citation one, for example.

Discussion. For all the generated cases, coverage, modularity and conductance
have better values for smaller numbers of clusters. This behavior is expected
from the formulation of coverage, since it observes the number of inter-cluster
edges, which tends to be smaller if there are less clusters to link to. The same
thing happens to conductance, as more inter-cluster edges mean more expensive
cuts. Without balancing the external conductance with the internal conductance,
results will only give us a partial and biased results.
Concerning modularity, we already know that singleton clusters have a very
bad impact on the modularity score, and the more the clusters, the bigger the
chance for singletons to occur. It is interesting to notice that giving low scores
to singleton clusters is not wrong per se, but since those scores will influence in
the overall score, they can obfuscate the existence of well scored clusters in the
final tally.
Silhouette Index generally gives better results for more clusters, which can
also be attributed to the larger occurrence of singletons, clusters that wrongly
give optimal results for SI.
56 H. Almeida et al.

Table 3. High energy physics collaboration network clusters and their quality indexes

Algorithm # Clusters SI Mod Cov Perf Cond


MCL 1002 -0.17 ± 0.037 0.35 0.52 0.99 0.51 ± 0.016
MCL 1742 -0.17 ± 0.028 0.33 0.42 0.99 0.62 ± 0.009
MCL 2650 0.005 ± 0.019 0.22 0.27 0.99 0.73 ± 0.005
B. k-means 1005 -0.54 ± 0.012 0.33 0.41 0.99 0.61 ± 0.007
B. k-means 1744 -0.30 ± 0.004 0.30 0.37 0.99 0.61 ± 0.006
B. k-means 2652 -0.14 ± 0.016 0.25 0.31 0.99 0.68 ± 0.003
Spectral 1005 -0.16 ± 0.037 0.34 0.44 0.99 0.53 ± 0.015
Spectral 1710 -0.04 ± 0.025 0.29 0.35 0.99 0.64 ± 0.009
Spectral 2525 0.019 ± 0.019 0.25 0.29 0.99 0.71 ± 0.006
Norm. Cut 1005 -0.59 ± 0.025 0.26 0.33 0.99 0.64 ± 0.02
Norm. Cut 1744 -0.37 ± 0.01 0.18 0.21 0.99 0.70 ± 0.01
Norm. Cut 2652 -0.25 ± 0.014 0.18 0.23 0.99 0.76 ± 0.015

Table 4. High energy physics citation network clusters and their quality indexes

Algorithm # Clusters SI Mod Cov Perf Cond


MCL 814 -0.07 ± 0.037 0.41 0.43 0.98 0.58 ± 0.015
MCL 3898 -0.039 ± 0.017 0.26 0.26 0.99 0.81 ± 0.003
MCL 12911 0.41 ± 0.005 0.12 0.12 0.99 0.93 ± 0.0006
B. k-means 814 -0.71 ± 0.014 0.25 0.25 0.99 0.71 ± 0.005
B. k-means 3898 -0.64 ± 0.008 0.14 0.14 0.99 0.80 ± 0.004
B. k-means 12911 -0.077 ± 0.01 0.06 0.056 0.99 0.90 ± 0.0008
Spectral 812 -0.236 ± 0.04 0.34 0.35 0.99 0.59 ± 0.014
Spectral 3490 0.043 ± 0.016 0.20 0.21 0.99 0.81 ± 0.003
Norm. Cut 814 -0.74 ± 0.006 0.25 0.25 0.99 0.65 ± 0.003
Norm. Cut 3898 -0.70 ± 0.005 0.10 0.10 0.99 0.82 ± 0.002
Norm. Cut 12845 -0.004 ± 0.006 0.06 0.06 0.99 0.92 ± 0.0006

For performance, as we already expected from the observations we did on the


formula itself, the sheer size of the networks we worked with here eclipsed any
kind of meaningful results we could gather from the clusterings themselves. The
results here serve as a confirmation that the expected behavior really happens
on real networks.
Another important point raised by our experiments is that networks of
different origins might have clusters with very different characteristics. Clusters
obtained from technological networks (in our case, the Gnutella snapshots) got
markedly poor quality metric results, especially when compared to the results
from social networks (all the other networks used). It could be argued those tech-
nological networks in particular might not have clusters, but we know that there
should be community-like structures in a Gnutella network: a superpeer and its
neighboring peers form a fairly cohesive subset, even though it is a sparse one.
It seems that the network structure in this case, with its non clique-like com-
munities, affects very negatively the ability of both clustering algorithms and
Is There a Best Quality Metric for Graph Clusters? 57

Table 5. Gnutella peers network (08/04/2002) clusters and their quality indexes

Algorithm # Clusters SI Mod Cov Perf Cond


MCL 2189 -0.81 ± 0.039 0.0004 0.001 0.99 0.99 ± 0.0
MCL 4724 -0.037 ± 0.015 0.0003 0.0007 0.99 0.99 ± 0.0
MCL 6089 0.10 ± 0.011 0.00003 0.0003 0.99 1.00 ± 0.0
B. k-means 2189 -0.88 ± 0.0001 0.0004 0.001 0.99 0.99 ± 0.00034
B. k-means 4724 -0.52 ± 0.02 0.00007 0.0004 0.99 0.99 ± 0.0
B. k-means 6089 -0.18 ± 0.01 -0.00006 0.0002 0.99 1.00 ± 0.0
Spectral 2158 -0.90 ± 0.0006 0.0004 0.001 0.99 0.99 ± 0.0
Spectral 4079 -0.94 ± 0.0005 0.0001 0.0005 0.99 0.99 ± 0.0
Spectral 6089 -0.30 ± 0.02 -0.00007 0.0002 0.99 1.00 ± 0.0
Norm. Cut 2189 -0.90 ± 0.002 0.0003 0.001 0.99 0.99 ± 0.0
Norm. Cut 4616 -0.2 ± 0.012 0.00025 0.0006 0.99 0.99 ± 0.0
Norm. Cut 5690 0.1 ± 0.012 0.0002 0.0005 0.99 0.99 ± 0.0

quality metrics to identify said clusters. This observation that different kinds
of cluster structures exist and that the usual clustering methods wouldn’t work
with them was already discussed by Nepusz [12]. In that case, she defended that,
in a bipartite graph, each one of the sides of the bipartition should be consid-
ered as a cluster. Kumar et al. [9] also cite the existence of this kind of cluster
structure, pointing out that there are many on-line communities that behave as
bipartite subgraphs and giving the websites of cellphone carriers as an example:
they represent the same category of service, but will not have direct links to each
other.
It is interesting to notice that even the most simple instances of a bipartite
graph would score poorly on the quality metrics studied in this paper, as their
internal density is nonexistent and all their edges connect to other clusters. For
example, consider a small, 10 vertex bipartite graph with two 5 vertex partitions
connected by 11 edges. This simple case would give scores such as −0.29 for
silhouette index, −5 for modularity, 0 for coverage, 0.31 for performance and 2
for conductance, results that are indeed very poor.

6 Conclusion
In this paper we presented a study of some of the most popular quality met-
rics for graph clustering, namely, the Silhouette Index, Modularity, Coverage,
Performance and Conductance. To evaluate those metrics, we compared their
results for clusters generated by four different clustering algorithms: Markovian,
Bisecting K-means, Spectral and Normalized Cut. We used seven different real
datasets in our experiments, with two of them having an already known opti-
mal clustering based on the semantics of the relationships between the elements
represented by their graphs.
Based on our experiments, we could identify some interesting behaviors for
those cluster quality assessing metrics. For example, Modularity, Conductance
58 H. Almeida et al.

and Coverage have a bias toward giving better results for smaller numbers of
clusters, while the other studied metrics have a completely opposite bias. This
indicates that all those metrics do not share a common view of what a true
clustering should look like.
Our results suggest that there is no such a thing as a “best” quality metric
for graph clustering. Even more, the currently used quality metrics have strong
biases that do not always point in the direction of what is assumed to be a
well-formed cluster. Also, those biases can get even more pronounced in large
graphs, which are the ones that depend on those metrics the most, as they are
the hardest to manually evaluate any results.
Another point observed was that the structure of clusters can be different for
graphs with different origins. In our case, we saw clear differences in the results of
technological and social networks. Current clustering and evaluation techniques
seem to be inadequate to tackle those different kinds of complex networks.
As future work, we intend to study how particular aspects of a graph topology
can affect the structure of a cluster, so that we can evaluate clusters with different
characteristics, and not only the clique-like ones. We will also consider adding
other dimensions to the graph, such as weights and labels.

Acknowledgments. This research was partially funded by FAPEMIG, CNPq,


CAPES, Finep and the Brazilian National Institute for Science and Technology
of the Web — InWeb (MCT/CNPq 573871/2008-6). MJZ was supported in part
by NSF grant EMT-0829835, and NIH grant 1R01EB0080161-01A1.

References
1. Brandes, U., Gaertler, M., Wagner, D.: Engineering graph clustering: Models and
experimental evaluation. J. Exp. Algorithmics 12, 1–26 (2008)
2. Danon, L., Dı́az-Guilera, A., Duch, J., Arenas, A.: Comparing community structure
identification. Journal of Statistical Mechanics: Theory and Experiment 2005(09),
P09008 (2005)
3. Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a
multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 1944–1957
(2007)
4. Van Dongen, S.: Graph clustering via a discrete uncoupling process. SIAM Journal
on Matrix Analysis and Applications 30(1), 121–141 (2008)
5. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works. Proceedings of the National Academy of Sciences of the United States of
America 99(12), 7821–7826 (2002)
6. Good, B.H., de Montjoye, Y.A., Clauset, A.: Performance of modularity maximiza-
tion in practical contexts. Physical Review E 81(4), 046106+ (2010)
7. Gustafson, M., Lombardi, A.: Comparison and validation of community structures
in complex networks. Physica A: Statistical Mechanics and its Application 367,
559–576 (2006)
8. Kannan, R., Vempala, S., Vetta, A.: On clusterings: Good, bad and spectral. J.
ACM 51(3), 497–515 (2004)
9. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the web for
emerging cyber-communities. Comput. Netw. 31, 1481–1493 (1999)
Is There a Best Quality Metric for Graph Clusters? 59

10. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of
community structure in large social and information networks. In: WWW 2008:
Proceeding of the 17th International Conference on World Wide Web, pp. 695–704.
ACM, New York (2008)
11. Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for
network community detection. In: Proceedings of the 19th International Conference
on World Wide Web, WWW 2010, pp. 631–640. ACM, New York (2010)
12. Nepusz, T., Bazso, F.: Likelihood-based clustering of directed graphs, pp. 189–194
(March 2007)
13. Nepusz, T., Sasidharan, R., Paccanaro, A.: Scps: a fast implementation of a spectral
method for detecting protein families on a genome-wide scale. BMC Bioinformat-
ics 11(1), 120 (2010)
14. Newman, M.E., Girvan, M.: Finding and evaluating community structure in net-
works. Physical Review E 69(2) (February 2004)
15. Newman, M.E.J.: Mixing patterns in networks. Phys. Rev. E 67(2), 26126 (2003)
16. Newman, M.E.J., Girvan, M.: Mixing Patterns and Community Structure in Net-
works. In: Pastor-Satorras, R., Rubi, M., Diaz-Guilera, A. (eds.) Statistical Me-
chanics of Complex Networks. Lecture Notes in Physics, vol. 625, pp. 66–87.
Springer, Berlin (2003)
17. Schaeffer, S.E.: Graph clustering. Computer Science Review 1(1), 27–64 (2007)
18. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern
Anal. Mach. Intell. 22, 888–905 (2000)
19. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering tech-
niques. In: Grobelnik, M., Mladenic, D., Milic-Frayling, N. (eds.) KDD-2000 Work-
shop on Text Mining, Boston, MA, August 20, pp. 109–111 (2000)
20. Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure
for association patterns. In: KDD 2002: Proceedings of the Eighth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 32–41.
ACM, New York (2002)
21. Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-
Wesley Longman Publishing Co., Inc., Boston (2005)
22. van Dongen, S.M.: Graph Clustering by Flow Simulation. PhD thesis, University
of Utrecht, The Netherlands (2000)
23. Zachary, W.W.: An information flow model for conflict and fission in small groups.
Journal of Anthropological Research 33, 452–473 (1977)
24. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute
similarities. Proc. VLDB Endow. 2(1), 718–729 (2009)
Adaptive Boosting for Transfer Learning Using
Dynamic Updates

Samir Al-Stouhi1 and Chandan K. Reddy2


1
Department of Computer Engineering
2
Department of Computer Science
Wayne State University, Detroit, MI, USA
s.alstouhi@wayne.edu, reddy@cs.wayne.edu

Abstract. Instance-based transfer learning methods utilize labeled ex-


amples from one domain to improve learning performance in another
domain via knowledge transfer. Boosting-based transfer learning algo-
rithms are a subset of such methods and have been applied successfully
within the transfer learning community. In this paper, we address some of
the weaknesses of such algorithms and extend the most popular transfer
boosting algorithm, TrAdaBoost. We incorporate a dynamic factor into
TrAdaBoost to make it meet its intended design of incorporating the ad-
vantages of both AdaBoost and the “Weighted Majority Algorithm”. We
theoretically and empirically analyze the effect of this important factor
on the boosting performance of TrAdaBoost and we apply it as a “cor-
rection factor” that significantly improves the classification performance.
Our experimental results on several real-world datasets demonstrate the
effectiveness of our framework in obtaining better classification results.

Keywords: Transfer learning, AdaBoost, TrAdaBoost, Weighted Ma-


jority Algorithm.

1 Introduction
Transfer learning methods have recently gained a great deal of attention in the
machine learning community and are used to improve classification of one dataset
(referred to as target set) via training on a similar and possibly larger auxiliary
dataset (referred to as the source set). Such knowledge transfer can be gained by
integrating relevant source samples into the training model or by mapping the
source set training models to the target models. The knowledge assembled can be
transferred across domain tasks and domain distributions with the assumption
that they are mutually relevant, related, and similar. One of the challenges of
transfer learning is that it does not guarantee an improvement in classification
since an improper source domain can induce negative learning and degradation in
the classifier’s performance. Pan and Yang [13] presented a comprehensive
survey of transfer learning methods and discussed the relationship between trans-
fer learning and other related machine learning techniques. Methods for trans-
fer learning include an adaptation of Gaussian processes to the transfer learning
scheme via similarity estimation between source and target tasks [2]. A SVM

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 60–75, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Adaptive Boosting for Transfer Learning Using Dynamic Updates 61

framework was proposed by Wu and Dietterich [18] where scarcity of target data is
offset by abundant low-quality source data. Pan, Kwok, and Yang [11] used learn-
ing a low-dimensional space to reduce the distribution difference between source
and target domains by exploiting Borgwardt’s Maximum Mean Discrepancy Em-
bedding (MMDE) method [1], which was originally designed for dimensionality
reduction. Pan et al. [12] proposed a more efficient feature-extraction algorithm,
known as Transfer Component Analysis (TCA), to overcome the computationally
expensive cost of MMDE. Several boosting-based algorithms have been modified
for transfer learning and will be more rigorously analyzed in this paper. The rest of
the paper is organized as follows: In Section 2, we discuss boosting-based transfer
learning methods and highlight their main weaknesses. In Section 3, we describe
our algorithm and provide its theoretical analysis. In Section 4, we provide an
empirical analysis of our theorems. Our experimental results along with related
discussion are given in Section 5. Section 6 concludes our work.

2 Boosting-Based Transfer Learning


Consider a domain (D) comprised of feature space (X). We can specify a map-
ping function to map the feature space to the label space as “X → Y ” where
Y ∈ {−1, 1}. Let us denote the domain with auxiliary data as the source domain
set (Xsrc ) and denote (Xtar ) as the target domain set that needs to be mapped
to the label space (Ytar ).
Boosting-based transfer learning methods apply ensemble methods to both
source and target instances with an update mechanism that incorporates only the
source instances that are useful for target instance classification. These methods
perform this form of mapping by giving more weight to source instances that
improve target training and vice-versa.

Table 1. Summary of the Notations

Notation Description
X feature space, X ∈ Rd
Y label space = {−1, 1}
d number of features
F mapping function X → Y
D domain
src source (auxiliary) instances
tar target instances
εt classifier error at boosting iteration “t”
w weight vector
N number of iterations
n number of source instances
m number of target instances
t index for boosting iteration
..
ft weak classifier at boosting iteration “t”
1I Indicator function
62 S. Al-Stouhi and C.K. Reddy

TrAdaBoost [5] is the first and most popular transfer learning method that
uses boosting as a best-fit inductive transfer learner. As outlined in Algorithm
11 , TrAdaBoost trains the base classifier on the weighted source and target set
in an iterative manner. After every boosting iteration, the weights of misclassi-
fied target instances are increased and the weights of correctly classified target
instances are decreased. This target update mechanism is based solely on the
training error calculated on the normalized weights of the target set and uses a
strategy adapted from the classical AdaBoost [8] algorithm. The Weighted Ma-
jority Algorithm (WMA) [10] is used to adjust the weights of the source set by
iteratively decreasing the weight of misclassified source instances by a constant
factor, set according to [10], and preserving the current weights of correctly clas-
sified source instances. The basic idea is that the source instances that are not
correctly classified on a consistent basis would converge to zero by N2 and would
not be used in the final classifier’s output since that classifier only uses boosting
iterations N2 → N .

Algorithm 1. TrAdaBoost
Require: Source and Target Instances : D = {(xsrci , ysrci ) ∪ (xtari , ytari )},
..
Maximum number of iterations(N), Base Learning algorithm ( f )
Ensure: Weak classifiers for boosting iterations : N
2
→N
Procedure:
1: for t = 1 to N do ..
2: Find the candidate weak learner for f t : X → Y that minimizes error for D
3: Update source weights via WMA to decrease weights of misclassified  instances
4: Update target weights via AdaBoost using target error rate εttar
5: Normalize weights for D
6: end for

The main weaknesses of TrAdaBoost are highlighted in the list below:


1. Weight Mismatch: As outlined in [14], when the size of source instances is
much larger than that of target instances, many iterations might be required
for the total weight of the target instances to approach that of the source
instances. This problem can be alleviated if more initial weight is given to
target instances.
2. Disregarding First Half of Ensembles: Eaton and desJardins [6] list
the choice to discard the first half of the ensembles as one of TrAdaBoost’s
weaknesses since it is these classifiers that fit the majority of the data, with
later classifiers focusing on “harder” instances. Their experimental analyses
along with the analyses reported by Pardoe and Stone [14] and our own
investigation show mixed results. This is the outcome of a final classifier
that makes use of all ensembles and thus infers negative transfer introduced
from non-relevant source instances whose weights had yet to converge to
zero.
1
Detailed algorithm can be found in the referenced paper.
Adaptive Boosting for Transfer Learning Using Dynamic Updates 63

3. Introducing Imbalance: In [7], it was noted that TrAdaBoost sometimes


yields a final classifier that always predicts one label for all instances as it
substantially unbalances the weights between the different classes. Dai et
al. [5] re-sampled the data at each step to balance the classes.
4. Rapid Convergence of Source Weights: This seems to be the most seri-
ous problem with TrAdaBoost. Various researchers observed that even source
instances that are representative of the target concept tend to have their
weights reduced quickly and erratically. This quick convergence is examined
by Eaton and desJardins [6] as they observe that in TrAdaBoost’s reweighing
scheme, the difference between the weights of the source and target instances
only increases and that there is no mechanism in place to recover the weight
of source instances in later boosting iterations when they become beneficial.
This problem is exacerbated since TrAdaBoost, unlike AdaBoost, uses the
second half of ensembles when the weights of these source instances have
already decreased substantially from early iterations. These weights may be
so small that they become irrelevant and will no longer influence the output
of the combined boosting classifier. This rapid convergence also led Pardoe
and Stone [14] to the use of an adjusted error scheme based on experimental
approximation.
TrAdaBoost has been extended to many transfer learning problems including
regression transfer [14] and multi-source learning [19]. Some of the less popu-
lar methods use AdaBoost’s update mechanism for target and source instances.
TransferBoost [6] is one such method and is used for boosting when multiple
source tasks are available. It boosts all source weights for instances that belong
to tasks exhibiting positive transferability to the target task. TransferBoost cal-
culates an aggregate transfer term for every source task as the difference in error
between the target only task and the target plus each additional source task. Ad-
aBoost was also extended in [17] for concept drift, where a fixed cost is incorpo-
rated, via AdaCost [16], to the source weight update. This cost is pre-calculated
using probability estimates as a measure of relevance between source and target
distributions. Since such methods update the source weights via AdaBoost’s up-
date mechanism, they create a conflict within this update mechanism. A source
task that is unrelated to the target task will exhibit negative transferability and
its instances’ weights would be diminished by a fixed [17] or dynamic rate [6]
within AdaBoost’s update mechanism. This update mechanism will be simulta-
neously increasing these same weights since AdaBoost increases the weights of
misclassified instances. A source update strategy based on the WMA would be
more appropriate.

3 Proposed Algorithm
We will theoretically and empirically demonstrate the cause of early convergence
in TrAdaBoost and highlight the factors that cause it. We will incorporate an
adaptive “Correction Factor” in our proposed algorithm, Dynamic-TrAdaBoost,
to overcome some of the problems discussed in the previous section.
64 S. Al-Stouhi and C.K. Reddy

Algorithm 2. Dynamic-TrAdaBoost
Require:
• Source domain instances Dsrc = {(xsrci , ysrci )}
• Target domain instances Dtar = {(xtari , ytari )}
• Maximum number of iterations : N
..
• Base learner : f
 . 
Ensure: Target Classifier Output : f : X → Y
  ..

. N t − ft N t −1
2
f = sign β tar − β tar
t= 1
2
t= 1
2

1 
Procedure: wsrc = wsrc , . . . , wsrc
n
1

1: Initialize the weight vector D = {Dsrc ∪ Dtar }, where: wtar = wtar , . . . , wtar
m

w = {wsrc ∪ wtar }
2: Set βsrc =  21 ln(n)
1+ N

3: for t = 1 to N do
w
4: Normalize Weights: w = n
 m
wsrci + wtarj
i j
..
5: Find the candidate weak learner f t : X → Y that minimizes error for D
weighted according to w
..
6: Calculate the error

of f t on Dtar :
..
j
m [wtar ]1I ytarj =fjt
εtar =
t
m
i
j=1 [wtar ]
i=1
εttar
7: Set βtar =
 1 − εtar
t

8: C t = 2 1 − εttar
 .. 
1I ysrci =fit
9: wsrc
t+1
i
= C t wsrc
t
β
i  src
where i ∈ Dsrc
.. 
1I ytari =fit
10: wtari = wtari β tar
t+1 t t
where i ∈ Dtar
11: end for

3.1 Algorithm Description


Algorithm 2, Dynamic-TrAdaBoost, uses TrAdaBoost’s concept of ensemble
learning as per training on the combined set of source and target instances. The
weak classifier is applied to the combined set where the features of the source
and target distributions are the same even though the distributions themselves
may differ. The weight update of the source instances uses the WMA update
mechanism on line 9. This update mechanism converges at a rate that is set by
the WMA rate (βsrc ) and the cost term (C t ). The target instances’ weights are
updated on line 10 using AdaBoost’s update mechanism. The target instances’
weights are updated using only the target error rate (εttar ), which is calculated on
line 7. As per the transfer learning paradigm, the source distribution is relevant
and target instances can benefit from incorporating relevant source instances.
Adaptive Boosting for Transfer Learning Using Dynamic Updates 65

3.2 Theoretical Analysis of the Algorithm

We will refer to the cost (C t ) on line 8 as “Correction Factor” and prove that it
addresses the source instances’ rapid weight convergence, which will be termed
as “Weight Drift”.

Axiom 1: All source instances


  are correctly classified by the weak classifiers
ysrci = f¨it , ∀i ∈ {1, . . . , n} and thus according to the Weighted Majority Al-
gorithm:
n n
t+1 t t
wsrc i
= wsrc i
= nwsrc
i=1 i=1

This assumption will not statistically hold true for real datasets. It allows us
to ignore the stochastic difference in the classifiers’ error rates at individual
boosting iterations. This is done so we can calculate a “Correction Factor” value
for any boosting iterations. It will be later demonstrated in (Theorem 5) and
subsequent analysis that there is an inverse correlation between Axiom 1 and
the impact of the “Correction Factor”. The impact of the “Correction Factor”
approaches unity (no correction needed) as the source error (εtsrc ) increases and
the assumption in Axiom 1 starts to break down.

Theorem 1: In TrAdaBoost, unlike the Weighted Majority Algorithm, source


weights are converging even when they are correctly classified.
Proof. To analyze the source convergence rate of TrAdaBoost, the weight of the
source instances’ update as per the Weighted Majority Algorithm is examined.
In the WMA, the weights are updated as:
⎧ t ..

⎪  wsrc y = f t

⎪ t
wsrc + t
βsrc wsrc src

⎨ {yi =fi } {yi =fi }
t+1
wsrc =

⎪ t
βsrc wsrc
..

⎪   y 
= f t

⎩ w t
src + β w t
src src
src
{yi =fi } {yi =fi }

With all source instances classified correctly, the source weights would not change
as:
t
t+1 wsrc t
wsrc =n = wsrc
t
wsrci
i=1

TrAdaBoost, on the other hand, updates the same source weights as:
t
t+1 wsrc
wsrc =  .. 

n 
m  1I ytarj =fjt
t t 1−εttar
wsrc i
+ wtar j εttar
i=1 j=1
66 S. Al-Stouhi and C.K. Reddy

This indicates that in TrAdaBoost, all source weights are converging by a factor
in direct correlation to the value of:
 .. 
m

1I ytarj =fjt
t 1 − εttar
wtar
j=1
j
εttar

This weight convergence will be referenced as “Weight Drift” since it causes


weight entropy to drift from source to target instances. We will skip proving
that weights for target instances are increasing since we already proved that
t t
source instance weights are decreasing and (nw src + mwtar ) = 1. 

Now that the cause of quick convergence of source instances was examined, the
factors that bound this convergence and make it appear stochastic will be ana-
lyzed. It is important to investigate these bounds as they have significant impact
when trying to understand the factors that control the rate of convergence. These
factors will reveal how different datasets and classifiers influence that rate.

Theorem 2: For n source instances, TrAdaBoost’s rate of convergence at iter-


ation t is bounded by:
1. Number of target training samples (m).
2. Target error rate at every iteration (εttar ).
Proof. The fastest convergence rate is bounded
 t+1  by minimizing the weight at each
subsequent boosting iteration, mint wsrc . This is minimized as:
m,n,εtar

 t+1  t
wsrc
mint wsrc =   
m,n,εtar 
n 
m  1I ytarj =f..jt
t t 1−εttar
max wsrc i
+ wtar j εttar
m,n,εttar i=1 j=1

This equation shows that the rate of convergence can be maximized as:
1. εttar → 0.
2. m/n → ∞.
It should be noted that the absolute value of m also indirectly bounds εttar as
t
m−1 ≤ εtar < 0.5.
1


Theorem 2 illustrates that a fixed cost cannot control the convergence rate since
the cumulative effect of m, n, and εttar changes at every iteration. A new term has
to be calculated at every boosting iteration to compensate for “Weight Drift”.

Theorem 3: A correction factor of 2 (1 − εttar ) can be applied to the source


weights to prevent their “Weight Drift” and make the weights converge as out-
lined by the Weighted Majority Algorithm.
Adaptive Boosting for Transfer Learning Using Dynamic Updates 67

Proof. Un-wrapping the TrAdaBoost source update mechanism yields:


t
t+1 wsrc
wsrc =


1I ytar
.. 

n 
m =f t
t 1−εttar
j j
t
wsrc + wtar
i j εt
tar
i=1 j=1
t
wsrc
= t
nwsrc +A+B

Where A and B are defined as:


A = Sum of correctly classified target

weights at boosting iteration“t + 1”
.. 
 1I ytarj =fjt
t 1−εt
= mwtar (1 − εttar ) εt tar
 tar  ..  
t
= mwtar (1 − εttar ) since 1I ytarj = fjt = 0
B = Sum of misclassified target

weights at boosting iteration“t + 1”
.. 
  t
1I ytarj =fj
t 1−εt
= mwtar (εttar ) εt tar

tar  ..  
t
= mwtar (1 − εttar ) since 1I ytarj  = fjt = 1

Substituting for A and B would simplify the source update of TrAdaBoost to:
t
t+1 wsrc
wsrc = t t (1 − εt )
nwsrc + 2mwtar tar

We will introduce and solve for a correction factor C t to equate (wsrc


t+1 t
= wsrc )
as per the WMA.
t t+1
wsrc = wsrc
t C t wt
wsrc = C t nwt +2mwsrc t t
src tar (1−εtar )
t t
2mw ( 1−ε )
Ct = tar
(1−nwsrc t
tar
)
t
2mwtar (1−εttar )
= mwtart
t
= 2 (1 − εtar ) 

The correction factor equates the behavior of Dynamic-TrAdaBoost to the


Weighted Majority Algorithm. (Theorem 4) will examine the effect of this “Cor-
rection Factor” on the target instances’ weight updates.

Theorem 4: Applying a correction factor of 2 (1 − εttar ) to the source weights


would cause the target weights to converge as outlined by AdaBoost.
Proof. In AdaBoost, without any source instances (n = 0), target weights for
correctly classified instances would be updated as:
t
t+1 wtar
wtar =

 .. 
ytar =f t

m 1−εt
1I j j
t
wtar tar
j εt
tar
j=1
t t
wtar wtar
= A+B
= t
2mwtar (1−εttar )
t
wtar
= 2(1)(1−εttar )
68 S. Al-Stouhi and C.K. Reddy

Applying the “Correction Factor” to the source instances’ weight update would
equate the target instances’ weight update mechanism of Dynamic-TrAdaBoost
to that of AdaBoost since:
t
t+1 wtar
wtar = nwtsrc +2mwtar t (1−εttar )
t
wtar
= C t nwtsrc +2mwtart (1−εttar )
t
wtar
= 2(1−εttar )nw tsrc +2mwtart (1−εttar )
t
wtar
= 2(1−εttar )(nw tsrc +mwtart )
t
wtar
= t
2(1−εtar )(1)


n
 n
Theorem 5: The assumptions in Axiom 1 can be approximated, t+1 ≈  wt
wsrc ,
i srci
i=1 i=1
regardless of εtsrc , by increasing the number of boosting iterations (N ).

Proof. In the Weighted Majority Algorithm, the source weights at iteration t


would be updated as:

P = Sum of correctly classified



source weights at boosting iteration“t + 1”
.. 
1 I ysrci =fit
t
= nwsrc (1 − εtsrc ) βsrc  ..  
t
= nwsrc (1 − εtsrc ) since 1I ysrcj = fjt = 0
Q = Sum of misclassified source

weights at boosting iteration“t + 1”
.. 
1 I ysrci =fit
t
= nwsrc (1 − εtsrc ) βsrc
.. 

t t
1I ysrci =fit
= nwsrc (εsrc ) βsrc   ..  
t
= nwsrc (εtsrc ) βsrc since 1I ysrcj  = fjt = 1

The sum of source weights at boosting iteration “t+1” is (S = P + Q). It can


now be calculated as:
S = Sum of source weights at boosting iteration“t + 1”

t t t
= nwsrc  − ε
(1 src ) + nwsrc
(εtsrc) βsrc 
t εt
= nwtar 1− src
N
since βsrc = 1
2 ln(n)
1+ 2 ln(n) 1+ N

As the number of boosting iterations (N ) increases, the assumptions in Axiom


1 can be approximated as:
⎧ ⎡ ⎛ ⎞⎤⎫
⎨ ε t ⎬
t ⎣
lim {S} = lim nwtar 1−⎝ src ⎠⎦ = nwtart
N →∞ N →∞ ⎩ N ⎭
1 + 2 ln(n)

Adaptive Boosting for Transfer Learning Using Dynamic Updates 69

Theorem 5 proves that the assumption of Axiom 1 can be approximated by


increasing the number of boosting iterations N → ∞. It will be later empirically
demonstrated that a reasonable value of N will suffice once the other variables
((n, εtsrc )) that contribute to the total weight are analyzed.
It was proven that a dynamic cost can be incorporated into TrAdaBoost to
correct for weights drifting from source to target instances. This factor would
ultimately separate the source instance updates which rely on the WMA and
βsrc , from the target instance updates which rely on AdaBoost and εttar .

4 Empirical Analysis
4.1 “Weight Drift” and “Correction Factor” (Theorems 1, 2, 3, 5)
A simulation is used to demonstrate the effect of “Weight Drift” on source and
target weights. In Figure 1(a), the number of instances was constant (n =
10000,m = 200) and the source error rate was set to zero as per Axiom
t+1  1.
t
According to the WMA, the weights should not change, wsrc =wsrc , since
εtsrc = 0. The ratio of the weights of TrAdaBoost to that of the WMA was
plotted at different boosting iterations and with different target error rates
εttar ∈ {0.1, 0.2, 0.3, 0.4}. The simulation validates the following theorems:

1. In TrAdaBoost, source weights converge even when correctly classified.


2. Dynamic-TrAdaBoost matches the behavior of the WMA.
3. If correction is not applied, strong classifiers cause faster convergence than
weak ones as proven in (Theorem 2).

The figure also demonstrates that for N = 30 and a weak learner with εttar ≈ 0.1,
TrAdaBoost would not be able to benefit from all 10,000 source instances even
though they were never misclassified. The final classifier uses boosting
iterations N/2 → N , or 15 → 30, where the source instances’ weights would
have already converged to zero. Dynamic-TrAdaBoost conserves these instances’
weights in order to utilize them for classifying the output label.

4.2 Rate of Convergence (Theorem 2)


In Figure 1(b), the number of source instances was set (n = 1000), while
the number of target instances was varied m n ∈ {1%, 2%, 5%} and plotted for
εttar ∈ {0.1, . . . , 0.5}. It can be observed that after a single boosting iteration,
the weights of correctly classified source instances start converging at a rate
bounded by m/n and the error rate εtar (which is also bounded by m).
It should be noted that for both plots in Figure 1, the weight lost by the source in-
stances is drifting to the target instances. The plots for the target weights would

n
t

m
t
look inversely proportional to the plots in Figure 1 since wsrc i
+ wtar j
= 1.
i=1 j=1
70 S. Al-Stouhi and C.K. Reddy

1 1

0.9
Correctly Classified Source Weight

Correctly Classified Source Weight


0.8 0.99

0.7

0.6 0.98

0.5
0.97
0.4
1%
0.3
2%
0.96
0.2 10%
5%
20% Correction Applied
0.1 30%
40%
Correction Applied 0.95
0 0.1 0.2 0.3 0.4 0.5
2 4 6 8 10 12 14 16 18 20
Boosting Iteration Target Classifier Error

(a) (b)

Fig. 1. The ratio of a correctly classified source weight for TrAdaBoost/WMA (a) For
20 iterations with different target error rates (b) After a single iteration with different
number of target instances and error rates.

4.3 Sum of Source Weights (Theorem 5, Axiom 1)



εt
In Theorem 5, we proved that minimizing src would relax the assump-
N
1+ 2 ln(n)

tion made in Axiom 1. The following steps can be applied to minimize this term:

1. Minimize (εtsrc ): The weak classifier can be strengthened. This is addi-


tionally required for boosting as will be explained in more details in our
experimental setup.
2. Minimize the number of source instances (n): Evidently, not desired.
3. Maximize the number of boosting iterations (N ): This can be easily
controlled and increases linearly.

The first experiment analyzed the effects of N and n on the sum of source
weights. The source error rate (εtsrc ) was set to 0.2, while the number of source
instances (n) varied from 100 to 10,100 and N ∈ {20, 40, 60, 80}. The plot in
Figure 2(a) demonstrates that the number of source instances (n) has little im-
pact on the total sum of source weights while N has more significance. This is
expected since the logarithmic value of n is already small and increases logarith-
mically with the increase in the number of source instances.
The second experiment considered the effects of N and εtsrc on the sum of
source weights. The number of source instances (n) was set to 1000 with εttar ∈
{0.05, . . . , 0.5} and N ∈ {20, 40, 60, 80}. It can be observed in Figure 2(b) that
the error rate does have a significant effect on decreasing the total weight for
t+1. This effect can be only partially offset via increasing N and it would require
a large value of N for a reasonable adjustment. However, this problem is negated
Adaptive Boosting for Transfer Learning Using Dynamic Updates 71

1
N=20
1 N=40
N=60
N=80
0.99 N=20 No Error
0.95
N=40
0.98
N=60

Sum of Source Weights


0.97 N=80
Sum of Source Weights

No Error 0.9
0.96

0.95
0.85
0.94

0.93
0.8
0.92

0.91
0.75
0.9 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Source Training Instances (n) Source Error Rate

(a) (b)

Fig. 2. (a) The ratio of a correctly classified source weight for “t + 1”/“t” (a) For
different number of source instances
 and number of boosting iterations (N ) (b) For
different source error rate εtsrc and number of boosting iterations (N )

by the fact that the correction factor, C = 2 (1 − εttar ), is inversely proportional


to εttar and its impact decreases as the target error rate increases. Since the source
data comprises the majority training data, we can generally expect εtsrc ≤ εttar
or εtsrc ≈ εttar . Here is a summary of this experiment’s findings:

1. The number of source instances (n) has a negligible impact on the sum of
source weights as it increases logarithmically.
2. The number of boosting iterations (N ) has significant impact on the sum of
source weights and can be used to strengthen the assumption in Axiom 1.
3. High source error rates, εtsrc → 1, weakens the assumption in Axiom 1 but
this will be negated by the fact that the impact of the correction factor is
reduced at high error as since it reaches unity (No Correction) as:
   
t
lim {C} = t lim 2 1 − εttar ≈ t lim 2 1 − εtsrc = 1
εtar →0.5 εtar →0.5 εsrc →0.5

5 Experimental Results on Real-World Datasets

5.1 Experiment Setup


We tested several popular transfer learning datasets and compared AdaBoost [8]
(using target instances), TrAdaBoost [5], TrAdaBoost with fixed costs of (1.1,
1.2, 1.3) and Dynamic-TrAdaBoost. Instances were balanced to have an equal
number of positive and negative labels. We ran 30 iterations of boosting.
72 S. Al-Stouhi and C.K. Reddy
 ..
Base Learner f : We did not use decision stumps as weak learners since the
majority of training data belongs to the source and we need to guarantee an
error rate of less than 0.5 on the target to avoid early termination of boosting
(as mandated by AdaBoost). For example, applying decision stumps on data
with 95% source and 5% target is not guaranteed (and will certainly not work for
many boosting iterations) to get an error rate of less than 0.5 on target instances
that compromise a small subset of the training data. We used a strong classifier,
classification trees, and applied a top-down approach where we trimmed the tree
at the first node that achieved a target error rate that is less than 0.5.
Cross Validation: We did not use standard cross validation methods since the
target datasets were generally too large and did not need transfer learning to get
good classification rates. We generated target datasets by using a small fraction
for training and left the remainder for testing. A 2% ratio means that we had two
target instances, picked randomly, for each 100 source instances and we used the
remaining target instances for validation. We also used all the minority labels
and randomly picked an equal number of instances from the majority labels, as
we tried to introduce variation in the datasets whenever possible. We applied
each experiment 10 times and reported the average accuracy to reduce bias.

5.2 Real-World Datasets

20 Newsgroups2 : The 20 Newsgroups dataset [9] is a text collection of approxi-


mately 20,000 newsgroup documents, partitioned across 20 different newsgroups.
We generated 3 cross-domain learning tasks with a two-level hierarchy so that
each learning task would involve a top category classification problem where the
training and test data are drawn from different sub categories with around 2300
source instances (Rec vs Talk, Rec vs Sci, Sci vs Talk) as outlined in further de-
tail in [4]. We used the threshold of Document Frequency with the value of 188
to maintain around 500 attributes. We used a 0.5% target ratio in our tabulated
results and displayed results of up to 10% target ratio in our plots.
Abalone3 : This dataset’s features include the seven physical measurements of
male, source, and female, target, abalone sea snails. The goal is to use these
physical measurements to determine the age of the abalone instead of endur-
ing the time consuming task of cutting the shell through the cone, staining it,
and counting the number of rings through a microscope. We used 160 source
instances with 11 target instances for training and 77 for testing.
Wine3 : The task is to determine the quality of white wine samples by using
red white samples as source set. The features are the wine’s 11 physical and
chemical characteristics and the output labels are given by experts’ grades of 5
and 6. We used 3655 source instances and 14 target instances for training and
1306 for testing.
2
http://people.csail.mit.edu/jrennie/20Newsgroups/
3
http://archive.ics.uci.edu/ml/
Adaptive Boosting for Transfer Learning Using Dynamic Updates 73

Table 2. Classification accuracy of AdaBoost (Target), TrAdaBoost, Fixed-Cost (best


result reported for TrAdaBoost with costs fixed at (1.1,1.2,1.3), Dynamic (Dynamic-
TrAdaBoost)

Dataset AdaBoost TrAdaBoost Fixed-Cost (1.1,1.2,1.3) Dynamic


Sci vs Talk 0.552 0.577 0.581 0.618
Rec vs Sci 0.546 0.572 0.588 0.631
Rec vs Talk 0.585 0.660 0.670 0.709
Wine Quality 0.586 0.604 0.605 0.638
Abalone Age 0.649 0.689 0.682 0.740

5.3 Experimental Results


The comparison of classification accuracy is presented in Table 2. The results show
that Dynamic-TrAdaBoost significantly improved classification on real-world
datasets. We performed the following tests to show significance of our results:
1. Tested the null hypothesis that transfer learning is not significantly better
than standard AdaBoost. We applied the Friedman Test with p < 0.01. Only
Dynamic-TrAdaBoost was able to reject the hypothesis.
2. We performed paired t-tests with α = 0.01 to test the null hypothesis
that classification performance was not improved over TrAdaBoost. For all
datasets, Dynamic-TrAdaBoost rejected the hypothesis while “Fixed-Cost
TrAdaBoost” did not.
3. Paired t-tests with α = 0.01 also rejected the null hypothesis that Dynamic-
TrAdaBoost did not improve classification over “Fixed-Cost TrAdaBoost”
for all datasets.
In Figure 3, the accuracy of the “20Newsgroups” dataset is plotted at different
target/source ratios. The plots demonstrate that incorporating a dynamic cost
into Dynamic-TrAdaBoost improved classification at different ratios as compared
to TrAdaBoost or a fixed correction cost.

5.4 Discussion and Extensions


The “Dynamic Factor” introduced in Dynamic-TrAdaBoost can be easily ex-
tended and modified to improve classification because it allows for strict control
of source weights’ convergence rate. In TrAdaBoost’s analysis, it was noted by
researchers [7] that it introduces imbalance to the classifier and sampling had to
be applied to remedy this problem [5]. SMOTEBoost [3] can be used to generate
synthetic data within boosting, while AdaCost [16] could integrate cost to the
target update scheme. The “Correction Factor” can be extended to integrate
cost into the source instances’ weight update. Balancing can be done manually
by means of limiting the maximum value of C for a given label. It can be done
dynamically as:
 
t cost
Clabel = 2 1 − εttar (1 − εttarlabel ) label
label ∈ {majority, minority}, cost ∈ {R}
74 S. Al-Stouhi and C.K. Reddy

0.76 0.76 0.8

0.74
0.74
0.72 0.75

Classification Accuracy

Classification Accuracy
Classification Accuracy

0.72 0.7

0.7
0.68
0.7
0.66

0.65
0.68 0.64

0.62
0.66
0.6
0.6
Dynamic-TrAdaBoost Dynamic-TrAdaBoost Dynamic-TrAdaBoost
0.64 Fixed-TrAdaBoost 0.58
Fixed-TrAdaBoost Fixed-TrAdaBoost
TrAdaBoost TrAdaBoost TrAdaBoost
0.55
0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0% 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0% 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0%
Ratio of Target/Source Ratio of Target/Source Ratio of Target/Source

(a) (b) (c)

Fig. 3. Accuracy of TrAdaBoost, Best of Fixed-Cost-TrAdaBoost (1.1,1.2,1.3) and


Dynamic-TrAdaBoost on the “20 Newsgroup” dataset at different target/source ratios.
(a) REC vs TALK. (b) SCI vs TALK. (c) REC vs SCI.

This extension can dynamically speed up the weight convergence of the labels
that exhibit low error rate and would slow it for labels that exhibit high error
rates. The value cost ∈ {R} was also included to allow the user to set the em-
phasis on balancing the labels’ error rate. This cost value controls the steepness
of the convergence rate for a given label as mandated by this label’s error rate
(εttarlabel ).

6 Conclusion
We investigated boosting-based transfer learning methods and analyzed their
main weaknesses. We proposed an algorithm with an integrated dynamic cost
to resolve a major issue in the most popular boosting-based transfer algorithm,
TrAdaBoost. This issue causes source instances to converge before they can be
used for transfer learning. We theoretically and empirically demonstrated the
cause and effect of this rapid convergence and validated that the addition of our
dynamic cost improves classification of several popular transfer learning datasets.
In the future, we will explore the possibility of using multi-resolution boosted
models [15] in the context of transfer learning.

References
1. Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.P., Schlkopf, B., Smola,
A.J.: Integrating structured biological data by kernel maximum mean discrepancy.
Bioinformatics 22(14), e49–e57 (2006)
2. Cao, B., Pan, S.J., Zhang, Y., Yeung, D., Yang, Q.: Adaptive transfer learning. In:
Proceedings of the AAAI Conference on Artificial Intelligence, pp. 407–412 (2010)
3. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: Improving
prediction of the minority class in boosting. In: Lavrač, N., Gamberger, D., Todor-
ovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119.
Springer, Heidelberg (2003)
Adaptive Boosting for Transfer Learning Using Dynamic Updates 75

4. Dai, W., Xue, G.R., Yang, Q., Yu, Y.: Co-clustering based classification for out-
of-domain documents. In: Proceedings of the 13th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 210–219 (2007)
5. Dai, W., Yang, Q., Xue, G.R., Yu, Y.: Boosting for transfer learning. In: Proceed-
ings of the International Conference on Machine Learning, pp. 193–200 (2007)
6. Eaton, E., desJardins, M.: Set-based boosting for instance-level transfer. In: Pro-
ceedings of the 2009 IEEE International Conference on Data Mining Workshops,
pp. 422–428 (2009)
7. Eaton, E.: Selective Knowledge Transfer for Machine Learning. Ph.D. thesis, Uni-
versity of Maryland Baltimore County (2009)
8. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning
and an application to boosting. In: Proceedings of the Second European Conference
on Computational Learning Theory, pp. 23–37 (1995)
9. Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the 12th
International Machine Learning Conference, pp. 331–339 (1995)
10. Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. In: Proceedings
of the 30th Annual Symposium on Foundations of Computer Science, pp. 256–261
(1989)
11. Pan, S.J., Kwok, J.T., Yang, Q.: Transfer learning via dimensionality reduction.
In: Proceedings of the National Conference on Artificial Intelligence, pp. 677–682
(2008)
12. Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer
component analysis. In: Proceedings of the 21st International Jont Conference on
Artifical Intelligence, pp. 1187–1192 (2009)
13. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowl-
edge and Data Engineering 22(10), 1345–1359 (2010)
14. Pardoe, D., Stone, P.: Boosting for regression transfer. In: Proceedings of the 27th
International Conference on Machine Learning, pp. 863–870 (2010)
15. Reddy, C.K., Park, J.H.: Multi-resolution boosting for classification and regression
problems. Knowledge and Information Systems (2011)
16. Sun, Y., Kamel, M.S., Wong, A.K., Wang, Y.: Cost-sensitive boosting for classifi-
cation of imbalanced data. Pattern Recognition 40(12), 3358–3378 (2007)
17. Venkatesan, A., Krishnan, N., Panchanathan, S.: Cost-sensitive boosting for con-
cept drift. In: Proceedings of the 2010 International Workshop on Handling Con-
cept Drift in Adaptive Information Systems (2010)
18. Wu, P., Dietterich, T.G.: Improving svm accuracy by training on auxiliary data
sources. In: Proceedings of the Twenty-First International Conference on Machine
Learning, pp. 871–878 (2004)
19. Yao, Y., Doretto, G.: Boosting for transfer learning with multiple sources. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 1855–1862 (2010)
Peer and Authority Pressure in
Information-Propagation Models

Aris Anagnostopoulos1, George Brova2, and Evimaria Terzi2


1
Department of Computer and System Sciences, Sapienza University of Rome
aris@dis.uniroma1.it
2
Computer Science Department, Boston University
gbrova@bu.com, evimaria@cs.bu.com

Abstract. Existing models of information diffusion assume that peer in-


fluence is the main reason for the observed propagation patterns. In this
paper, we examine the role of authority pressure on the observed infor-
mation cascades. We model this intuition by characterizing some nodes
in the network as “authority” nodes. These are nodes that can influence
large number of peers, while themselves cannot be influenced by peers.
We propose a model that associates with every item two parameters
that quantify the impact of the peer and the authority pressure on the
item’s propagation. Given a network and the observed diffusion patterns
of the item, we learn these parameters from the data and characterize
the item as peer- or authority-propagated. We also develop a random-
ization test that evaluates the statistical significance of our findings and
makes our item characterization robust to noise. Our experiments with
real data from online media and scientific-collaboration networks indicate
that there is a strong signal of authority pressure in these networks.

1 Introduction

Most of the existing models of information propagation in social networks focus


on understanding the role of peer influence on the observed propagation pat-
terns. [2, 3, 6, 8, 13]. We use the term peer models to collectively refer to all such
information-propagation models. In peer models, as more neighbors (or peers) of
a node adopt an information item, it becomes more probable that the node itself
adopts the same item. For example, users adopt a particular instant-messenger
software because their friends use the same software; using the same platform
makes communication among friends more convenient.
Figure 1(a) depicts a small network of peers and their connections. A directed
link from node u to v denotes that v can be influenced by u. The key characteris-
tic of peer models is that all nodes are treated on an equal footing. That is, each
node can equally well influence its neighbors or be influenced by them. However,
the strength of each agent’s influence on others is not the same in reality. For

The research leading to these results has received funding from the EU FP7 Project
N. 255403 – SNAPS, from the NSF award #1017529, and from a gift from Microsoft.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 76–91, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Peer and Authority Pressure in Information-Propagation Models 77

u v Authority
nodes

Peer
nodes

(a) Peer-only informa- (b) Peer and authority infor-


tion propagation mation propagation

.
Fig. 1. Influence graph of a network consisting of peer nodes and peer and authority
nodes

example, mass media can strongly affect the opinions of individuals, whereas the
influence of any one individual on the mass media is most likely infinitesimal.
This distinction is often modeled with the use of edge weights.
In this paper we focus on this distinction and we apply a simple way to model
it: we posit that some agents are authorities (such as the mass media). These
nodes have high visibility in the network and they typically influence a large
number of non-authority nodes. We call these latter nodes the peer nodes. Peers
freely exchange information among themselves and therefore there are influence
links between them. Peers have no influence on authorities. That is, an influence
link that joins an authority to a peer is unidirectional from the authority to the
peer. In our model, we also ignore the influence that one authority node might
have on another. At a global level, the network of authorities and peers looks as
in Figure 1(b). That is, peers and authorities are clustered among themselves,
and there are only directed one-way links from authorities to peers.
The existence of authority nodes allows us to incorporate the authority influ-
ence (or pressure) into the classic information-propagation models. Given a net-
work of peers and authorities and the observed propagation patterns of different
information items (e.g., products, trends, or fads) our goal is to develop a frame-
work that allows us to categorize the items as authority- or peer-propagated.
To do so, we define a model that associates every propagated item with two
parameters that quantify the effect that authority and peer pressure has played
on the item’s propagation. Given data about the adoption of the item by the
nodes of a network we develop a maximum-likelihood framework for learning
the parameters of the item and use them to characterize the nature of its prop-
agation. Furthermore, we develop a randomization test, which we call the time-
shuffle test. This test allows us to evaluate the statistical significance of our
findings and increase our confidence that our findings are not a result of noise in
the input data. Our extensive experiments on real data from online media and
collaboration networks reveal the following interesting finding. In online social-
media networks, where the propagated items are news memes, there is evidence
of authority-based propagation. On the other hand, in a collaboration network
of scientists, where the items that propagate are themes, there is evidence that
peer influence governs the observed propagation patterns to a stronger degree
than authority pressure.
78 A. Anagnostopoulos, G. Brova, and E. Terzi

The main contribution of this paper lies in the introduction of authority pres-
sure as a part of the information-propagation process. Quantifying the effect that
peer and authority pressure plays in the diffusion of information items will give
us a better understanding of the underpinnings of viral markets. At the same
time, our proposed methodology will allow for the development of new types
of recommendation systems, advertisement strategies and election campaigns.
For example, authority nodes are better advertisement targets for authority-
propagated products. On the other hand, election campaign slogans, might gain
popularity due to the network effect and therefore be advertised accordingly.
Roadmap: The rest of the paper is organized as follows: In Section 2 we give
a brief overview of the related work. Sections 3 and 4 give an overview of our
methodology for incorporating authority pressure into the peer models. We show
an extensive experimental evaluation of our framework in Section 5 and we
conclude the paper in Section 6.

2 Related Work

Despite the large amount of work on peer models and on identification of au-
thority nodes, to our knowledge, our work is the first attempt to combine peer
and authority pressure into a single information-propagation model. Also, con-
trary to the goal of identifying authority nodes, our goal is to classify propagated
trends as being peer- or authority-propagated.
One of the first models to capture peer influence was by Bass [4], who de-
fined a simple model for product adoption. While the model does not take into
account the network structure, it manages to capture some commonly-observed
phenomena, such as the existence of a “tipping point.” More recent models such
as the linear-threshold model [9, 10] or the cascade model [10] introduce the de-
pendence of influence on the set of peers, and since then there has been a large
number of generalizations.
In a series of papers based on the analysis of medical data and offline social
networks Christakis, Fowler, and colleagues showed the existence of peer influ-
ence on social behavior and emotions, such as obesity, alcoholism, happiness,
depression, loneliness [7, 5, 14]. An important characteristic in these analyses is
the performance of statistical tests through modifying the social graph to provide
evidence for peer influence. It was found that in general influence can extend up
to three degrees of separation. Around the same time, Anagnostopoulos et al. [2]
and Aral et al. [3], provided evidence that a lot of the correlated behavior among
peers can be attributed to other factors such as homophily, the tendency of indi-
viduals to associate and form ties with similar others. The time-shuffle test that
we apply later is a randomization test used in [2] to rule out influence effects
from peers. Although clearly related, the above work is only complementary to
ours: none of the above papers considers authorities as a factor that determines
the propagation of information.
Recently, there have been many studies related to the spreading of ideas, news
and opinions in the blogosphere. Authors refer to all these propagated items as
Peer and Authority Pressure in Information-Propagation Models 79

memes. Gomez-Rodriguez et al. [8] try to infer who influences whom based on the
time information over a large set of different memes. Contrary to our work where
the underlying network is part of the input, Gomez-Rodrigues et al. assume that
the network structure is unknown. In fact, their goal is to discover this hidden
network and the key assumption of the method is that a node only gets influenced
by its neighbors. Therefore, they do not account for authority influence.
More recently, Yang and Leskovec [16] applied a nonparametric modeling
approach to learn the direct or indirect influence of a set of nodes (e.g. news
sites) to other blogs or tweets. Although one can consider the discovered set of
nodes as authority nodes, the model of Yang and Leskovec does not take into
account the network of peers. Our work is mostly focused on the interaction and
the separation of peer and authority influence within a social-network ecosystem.
Recent work by Wu et al. [15] focuses on classifying twitter users as “elite”
and “ordinary”; elite users (e.g., celebrities, media sources, or organizations) are
those with large influence on the rest of the users. Exploiting the twitter-data
characteristics the authors discover that a very small fraction of the popula-
tion (0.05%) is responsible for the generation of half of the content in twitter.
Although related, the focus of our paper is rather different: our goal is not to
identify the authorities and the peers of the network. Rather, we want to clas-
sify the trends as those that are being authority-propagated versus those being
peer-propagated.
Related in spirit is also the work of Amatriain et al. [1]; their setting and
their techniques, however, are entirely different than ours: they consider the
problem of collaborative filtering and they compare the information obtained by
consulting “experts” as opposed to “nearest neighbors” (i.e., nodes similar to the
node under consideration). The motivation for that work is the fact that data
on nearest neighbors is often sparse and noisy, as opposed to the more global
information of experts.

3 Peer and Authority Models


An information-propagation network can be represented by a directed graph.
The graph consists of a set of n nodes, denoted by V . We refer to these nodes
as peers (or agents). These nodes are organized in a directed graph G = (V, E).
The edges of the graph represent the ability of a node to influence another node.
That is, a directed link from node u to node v, (u → v) denotes that node u can
influence node v. We call the graph G the peer influence graph. Given a node
u we refer to all the nodes that can influence u, that is, the nodes that have
directed links to u as the peers or neighbors of u.
In addition to the n peer nodes, our model assumes the existence of N globally
accepted authorities, represented by the set A. Every authority a ∈ A has the
potential to influence all the nodes in V . Intuitively, this means that there are
directed influence edges from every authority in a ∈ A to every peer v ∈ V ; we
use F to represent the directed edges from authorities to peers. For simplicity
we assume that there are no edges amongst authorities. We refer to the graph
H = (V ∪ A, E ∪ EA ) as the extended influence graph.
80 A. Anagnostopoulos, G. Brova, and E. Terzi

Fashion trends, news items or research ideas propagate amongst peers and
authorities. We collectively refer to all the propagated trends as information
items (or simply items). We call the nodes (peers or authorities) that have
adopted a particular item active and the nodes that have not adopted the same
item as inactive.
We assume that the propagation of every item happens in discrete time steps;
we assume that we have a limited observation period from timestamp 1 to times-
tamp T . At every point in time t ∈ {1, . . . , T }, each inactive node u decides
whether to become active. The probability that an inactive node u becomes ac-
tive is a function P (x, y) of the number x of peers that can influence u that are
already active and the number y of active authorities. In principle, function P
can be any function that is increasing in both x and y. As we will see in the next
section, we will focus on a simple function that fits our purposes.

4 Methodology
In this section, we present our methodology for measuring peer and authority
pressure in information propagation. Based on that we offer a characterization
of trends as peer- or authority-propagated trends. Peer-propagated trends are
those whose observed propagation patterns can be largely explained due to peer
pressure. Authority-propagated trends are those that have been spread mostly
due to authority influence.
We start in Section 4.1 by explaining how logistic regression can be used to
quantify the extent of peer and authority pressure. In Section 4.2 we define a
randomization test that we use in order to quantify the statistical significance
of the logistic regression results.

4.1 Measuring Social Influence


The discussion below focuses on a single propagated item. Assume that at some
point in time, there are y active authorities. At this point in time, a node with x
active peers becomes active with probability P (x, y). As it is usually the case [2],
we use the logistic function to model the dependence of the probability P (x, y)
as a function of the independent variables x and y. That is,

eα ln(x+1)+β ln(y+1)+γ
P (x, y) = , (1)
1 + eα ln(x+1)+β ln(y+1)+γ
where α, β and γ are the coefficients of the logistic function. The values of α
and β capture respectively the strength of peer and authority pressure in the
propagation of item i. More specifically α, β take values in R. Large values of α
provide evidence for peer influence in the propagation of item i. Large values of
β provide evidence for authority influence in the propagation of i. For every item
i, we call α the peer coefficient and β the authority coefficient of i. Parameter
γ models the impact of factors other than peer and authority pressure in the
propagation of the item. For example, the effect of random chance is encoded
Peer and Authority Pressure in Information-Propagation Models 81

in the value of the parameter γ. We call γ the externality coefficient since it


quantifies the effect of external parameters.
The logit function of probability P (x, y) (Equation (1)) gives
 
P (x, y)
ln = α ln(x + 1) + β ln(y + 1) + γ. (2)
1 − P (x, y)
We estimate α, β and γ using maximum likelihood logistic regression. More
specifically, for each t = 1, 2, . . . , T , let N (x, y, t) be the number of users who at
the beginning of time t had x active neighbors and they themselves became active
at time t when y authorities were active. Similarly, let N (x, y, t) be the number
of users who at the beginning of time t had x active neighbors, but did not
become active  themselves at time t when y authorities
 were also active. Finally,
let N (x, y) = t N (x, y, t) and N (x, y) = t X(x, y, t). Then, the maximum-
likelihood estimation of parameters α, β are those that maximize the likelihood
of the data at time t, namely,
 N (x,y)
P (x, y)N (x,y) (1 − P (x, y)) . (3)
x,y

While in general there is no closed form solution for the above maximum likeli-
hood estimation problem, there are many software packages that can solve such
a problem quite efficiently. For our experiments, we have used Matlab’s statistics
toolbox.
We apply this analysis to every propagated item and thus obtain the maximum-
likelihood estimates of the peer and authority coefficients for each one of them.

4.2 Randomization Test


One way of inferring whether item i’s propagation is better explained due to peer
or authority influence is to obtain the maximum-likelihood estimates of peer and
authority coefficients (α, β) and conclude the following: if α > β, then i is a peer-
propagated item. Otherwise, if β > α then i is an authority-propagated item.
Although this might be a reasonable approach towards the categorization of the
item, the question of how much larger should the value of α (resp. β) be in order
to characterize i as peer- (resp. authority-) propagated item. Even if α  β (or
vice versa), we still need to verify that this result is due to strong evidence in
the data.
In order to reach conclusions based on strong evidence in the data, we devise
a randomization test which we call the time-shuffle test. Let H be the input
influence graph and let D be the dataset that associates every node in V ∪ A
that becomes active with its activation time. The time-shuffle test permutes the
activation times of the nodes in D. In this way, a randomized version D of D
is obtained. Note that D contains the same nodes as D (those that eventually
become active), however the activation times are permuted.
Assume that the maximum-likelihood estimation method for input H, D es-
timates the peer and authority coefficients (α, β). Also, denote by (α(D ), β(D ))
the peer and authority coefficients computed running maximum-likelihood
82 A. Anagnostopoulos, G. Brova, and E. Terzi

estimation on input H, D . Let D be the set of all possible randomized versions
that can be created from the input dataset D via the time-shuffle test. Then we
define the strength of peer influence Sα to be the fraction of randomized datasets
D ∈ D for which α > α(D ), namely,
Sα = PrD (α > α(D )) . (4)
Note that the probability is taken over all possible randomized versions D ∈ D
of the original dataset D.
Similarly, we define the strength of authority influence Sβ , to be the fraction
of randomized datasets D for which β > β(D ), namely,
Sβ = PrD (β > β(D )) . (5)
Both the peer and the authority strengths take values in [0, 1]; the larger the
value of the peer (resp. authority) strength the stronger the evidence of peer
influence (resp. authority influence) in the data.

5 Experimental Results
In this section, we present our experimental evaluation both on real and synthetic
data. Our results on real data coming from online social media and computer-
science collaboration networks reveal the following interesting findings: In on-
line social-media networks the fit of our model indicates a stronger presence of
authority pressure, as opposed to the scientific collaboration network that we
examine. Our results on synthetically-generated data show that our methods
recover the authority and peer coefficients accurately and efficiently.

5.1 Datasets and Implementation


We experiment with the following real-world datasets:
The MemeTracker dataset [12].1 The original dataset tracks commonly used
memes across online (mostly political) blogs and news sources. The dataset con-
tains information about the time a particular meme appeared on a given webpage
as well as links from and to each listed webpage.
In order to analyze the data with our approach, we assume that each meme
is an information item that propagates through the network of peers. The influ-
ence graph G = (V, E) consists of directed relationships between the blog sites
in the dataset. That is, the peer nodes (the set V ) are the blog sites in the
dataset. In our version of the data we only consider blogs from wordpress.com
and blogspot.com since their URL structure makes it easy to identify the same
blogger across posts. There is a directed link (inlfuence) from blog b to blog b if
there exist at least one hyperlink from b to b. That is, b refers to b and therefore
b can influence b . Such directed links constitute the edges in E.
The authority nodes A in our dataset are the news-media sites available in
the original dataset. In total, the processed dataset consists of 123,400 blog
1
The dataset is available at http://snap.stanford.edu/data/memetracker9.html.
Peer and Authority Pressure in Information-Propagation Models 83

sites, 13,095 news sites and 124,694 directed links between the blogs (edges).
Although the dataset contains 71,568 memes, their occurrence follows a power-
law distribution and many memes occur very infrequently. In our experiments,
we only experiment with the set of 100 most frequently-appearing memes. We
denote this set of memes by MF . For every meme m ∈ MF we construct a
different extended influence graph. The set of authorities for this meme Am is
the subset of the top-50 authorities in A that have most frequently used this
particular meme.
The Bibsonomy dataset [11]. BibSonomy is an online system with which in-
dividuals can bookmark and tag publications for easy sharing and retrieval. In
this dataset, the influence graph G = (V, E) consists of peers that are scientists.
There is a link between two scientists if they have co-authored at least three
papers together. The influence links in this case are bidirectional since any of
the co-authors can influence each other. The items that propagate in the net-
work are tags associated with publications. A node is active with respect to a
particular tag if at least one of the node’s publications has been associated with
the tag. For a given tag t, the set of authorities associated with this tag, At ,
are the top-20 authors with the largest number of papers tagged with t. These
authors are part of the extended influence graph of tag t, but not part of the
original influence graph.
For our experiments, we have selected papers from conferences. There are a
total of 62,932 authors, 9,486 links and 229 tags. Again, we experiment with the
top-100 most frequent tags.
Implementation. Our implementation consists of two parts. For each item, we
first count the number of users who were active and inactive at each time period;
that is, we evaluate the matrices N (x, y) and N (x, y) in Equation (3). Then, we
run the maximum likelihood regression to find the best estimates for α, β, and γ
in Equation (1). We ran all experiments on a AMD Opteron running at 2.4GHz.
Our unoptimized MATLAB code processes one meme from the MemeTracker
dataset in about 204 seconds. On average, the counting step requires 96% of this
total running time. The rest 4% is the time required to run the regression step.
For the Bibsonomy dataset, the average total time spent on a tag is 38 seconds.
Again, 95% of this time is spent on counting and 5% on regression.

5.2 Gain of Authority Integration

The goal of our first experiment is to demonstrate that the integration of au-
thority influence in the information-propagation models can help to explain ob-
served phenomena that peer models had left unexplained. For this we use the
MemeTracker and the Bibsonomy datasets to learn the parameters α, β and γ
for each of the propagated items. At the same time, we use the peer-only version
of our model by setting β = 0 and learn the parameters α and γ  for each one
of the propagated items. This way, the peer-only model does not attempt to
distinguish authority influence, and is similar to the models currently found in
the literature.
84 A. Anagnostopoulos, G. Brova, and E. Terzi

25 35
Peer and authority model Peer and authority model
Peer only model Peer only model
30
20
25

15
Frequency

Frequency
20

15
10

10
5
5

0 0
−0.5 0 0.5 1 1.5 2 2.5 3 −16 −14 −12 −10 −8 −6 −4 −2
Recovered α Recovered γ

(a) α coefficients (b) γ coefficients

Fig. 2. MemeTracker dataset. Histogram of the values of the peer and externality co-
efficients α and γ recovered for the pure peer (β = 0) and the integrated peer and
authority model.

The results for the MemeTracker dataset are shown in Figure 2. More specif-
ically, Figure 2(a) shows the histogram of the recovered values of α and α we
obtained. The two histograms show that the distribution of the values of the
peer coefficient we obtain using the two models are very similar. On the other
hand, the histogram of the values of the externality coefficients obtained for the
two models (shown in Figure 2(b)) are rather distinct. In this latter pair of his-
tograms, we can see that the values of the externality coefficient obtained in the
peer-only model are larger than the corresponding values we obtain using our
integrated peer and authority model. This indicates that the peer-only model
could only explain a certain portion of the observed propagation patterns asso-
ciating the unexplained patterns to random effects. The addition of an authority
parameter explains a larger portion of the observed data, attributing much less
of the observations to random factors. The results for the Bibsonomy dataset
(Figure 3) indicate the same trend.

12 30
Peer and authority model Peer and authority model
Peer only model Peer only model
10 25

8 20
Frequency

Frequency

6 15

4 10

2 5

0 0
−0.5 0 0.5 1 1.5 2 2.5 3 −16 −14 −12 −10 −8 −6 −4 −2
Recovered α Recovered γ

(a) α coefficients (b) γ coefficients

Fig. 3. Bibsonomy dataset. Histogram of the values of the peer and externality co-
efficients α and γ recovered for the pure peer (β = 0) and the integrated peer and
authority model.
Peer and Authority Pressure in Information-Propagation Models 85

5.3 Analyzing the MemeTracker Dataset


In this section, we show the results of our analysis for the MemeTracker dataset.
The results show that the majority of the memes we consider are authority-
propagated. This means that the bloggers adopt memes by authoritative online
news media sites more than by their fellow bloggers.

30 35

30
25

25
20
Number of items

Number of items
20
15
15

10
10

5
5

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Strength of peer influence Strength of authority influence

(a) Peer influence (b) Authority influence

.
Fig. 4. MemeTracker dataset. Frequency distribution of the recovered strength of peer
and authority influence.

The above result is illustrated in Figure 4. These histograms show the number
of memes that have a particular strength of peer (Figure 4(a)) and authority
influence (Figure 4(b)). We obtain these results by estimating the strength of
peer and authority influence using 100 random instances of the influence graph
generated by the time-shuffle test (see Section 4.2). The two histograms shown
in Figures 4(a) and 4(b) indicate that for most of the memes in the dataset,
authority pressure is a stronger factor affecting their propagation compared to
peer influence. More specifically, the percentage of memes with peer strength
greater than 0.8 is only 18% while the percentage of memes with authority
strength greater than 0.8 is 45%. Also, 46% of memes have peer strength below
0.2, while only 22% of of memes have authority strength below 0.2.
We demonstrate some anecdotal examples of peer- and authority-propagated
memes in Figure 5. The plot is a two-dimensional scatterplot of the peer and
authority strength of each one of the top-100 most frequent memes. The size of
the marker associated with each meme is proportional to the meme’s frequency.
A lot of the memes in the lower left, that is, memes with both low peer and
authority strength, are commonly seen phrases that are arguably not subject to
social influence. The memes in this category tend to be short and generic. For
example, “are you kidding me” and “of course not” were placed in this category.
The meme “life liberty and the pursuit of happiness” is also placed in the same
category. Although this last quote is not as generic as the others, it still does
not allude to any specific event or controversial political topic. The low social
correlation attributed to these memes indicates that they sporadically appear
in the graph, without relation to each other. Using an equi-depth histogram we
extract the the top-5 most frequent memes with low peer and low authority
strength and show them in Group 1 of Table 1.
86 A. Anagnostopoulos, G. Brova, and E. Terzi

i barack hussein obama do solemnly swear yes we can yes we can


1
joe the plumber
this is from the widows the orphans and those who were killed in iraq
0.8

Authority strength
0.6
mark my words it will not be six months before the
world tests barack obama like they did john kennedy
0.4
i think we should all be fair and balanced don’t you

0.2
life liberty and the pursuit of happiness

of course not
0 are you kidding me
0 0.2 0.4 0.6 0.8 1
Peer strength

Fig. 5. MemeTracker dataset. Peer strength (x-axis) and authority strength (y-axis) of
the top-100 most frequent memes. The size of the circles indicate is proportional to the
frequency of the meme.

Diagonally opposite, in the upper right part of the plot, are the memes with
high peer and high authority strength. These are particularly widely-spread
quotes that were pertinent to the 2008 U.S. Presidential Election, and that
frequently appeared in both online news media sites and blog posts. Examples
of memes in this category include “joe the plumber” and President Obama’s
slogan, “yes we can.” Finally, the meme “this is from the widows the orphans
and those who were killed in iraq” is also in this category. This is a reference to
the much-discussed incident where an Iraqi journalist threw a shoe at President
Bush. The top-5 most frequent memes with high peer and authority strengths
are also shown in Group 2 of Table 1. Comparing the memes in Groups 1 and
2 in Tables 1, one can verify that, on average, the quotes with high peer and
high authority strength are much longer and more specific than those with low
peer and low authority strengths. As observed before, exceptions to this trend
are the presidential campaign memes “joe the plumber” and “yes we can”.
Memes with low peer and high authority strength (left upper part of the
scatterplot in Figure 5) tend to contain quotes of public figures, or refer to
events that were covered by the news media and were then referenced in blogs.
One example is “I barack hussein obama do solemnly swear,” the first line of
the inaugural oath. The inauguration was covered by the media, so the quotes
originated in news sites and the bloggers began to discuss it immediately after.
Typically, memes in this group all occur within a short period of time. In con-
trast, memes with both high peer and high authority influence are more likely to
gradually gain momentum. The top-5 most frequent memes with low peer and
high authority strength are also shown in Group 3 of Table 1.
We expect that memes with high peer and low authority strength (right lower
part of the scatterplot in Figure 5) are mostly phrases that are not present in
the mainstream media, but are very popular within the world of bloggers. An
example of such a meme, as extracted by our analysis, is “mark my words it
Peer and Authority Pressure in Information-Propagation Models 87

Table 1. MemeTracker dataset. Examples of memes with different peer and authority
strengths. Bucketization was done using equi-depth histograms.

Group 1: Top-5 frequent memes with low peer and low authority strength.
1. life liberty and the pursuit of happiness
2. hi how are you doing today
3. so who are you voting for
4. are you kidding me
5. of course not
Group 2: Top-5 frequent memes with high peer and high authority strength.
1. joe the plumber
2. this is from the widows the orphans and those who were killed in iraq
3. our opponent is someone who sees america it seems as being so imperfect
imperfect enough that he’s palling around with terrorists who would target their
own country
4. yes we can yes we can
5. i guess a small-town mayor is sort of like a community organizer
except that you have actual responsibilities
Group 3: Top-5 frequent memes with low peer and high authority strength.
1. i need to see what’s on the other side
i know there’s something better down the road
2. i don’t know what to do
3. oh my god oh my god
4. how will you fix the current war on drugs in
america and will there be any chance of decriminalizing marijuana
5. i barack hussein obama do solemnly swear
Group 4: Top-5 frequent memes with high peer and low authority strength.
1. we’re in this moment and if we fail to do the right thing heaven help us
2. if you know what i mean
3. what what are you talking about
4. i think we should all be fair and balanced don’t you
5. our national leaders are sending u s soldiers on a task that is from god

will not be six months before the world tests barack obama like they did john
kennedy.” This is a quote by Joe Biden that generates many more high-ranked
blog results than news sites on a Google search. Another example is “i think we
should all be fair and balanced don’t you,” attributed to Senator Schumer in an
interview on Fox News, which was not covered by mainstream media but was
an active topic of discussion for bloggers. The top-5 most frequent memes with
high peer and low authority strength are also shown in Group 4 of Table 1.

5.4 Analyzing the Bibsonomy Dataset


In this section, we show the results of our analysis for the Bibsonomy dataset. The
results show that the majority of the items we consider here are peer-propagated.
Recall that in the case of the Bibsonomy dataset the propagated items are
tags associated with papers written by the scientists forming the collaboration
88 A. Anagnostopoulos, G. Brova, and E. Terzi

network. One should interpret tags as research topics or themes. Our findings
indicate that when choosing a research direction, scientists are more likely to be
influenced by people they collaborated with rather than experts in their field.
The above result is illustrated in Figure 6. These histograms show the num-
ber of tags that have a particular strength of peer (Figure 6(a)) and authority
influence (Figure 6(b)). We obtain these results by estimating the strength of
peer and authority influence using 100 random dataset instances generated by
the time-shuffle test (see Section 4.2).

50 20

45 18

40 16

35 14
Number of items

Number of items
30 12

25 10

20 8

15 6

10 4

5 2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Strength of peer influence Strength of authority influence

(a) Peer influence (b) Authority influence

Fig. 6. Bibsonomy dataset. Frequency distribution of the recovered strength of peer


and authority influence.

Overall, we observe stronger peer influence than authority influence in the


Bibsonomy dataset, as illustrated in Figures 6(a) and 6(b). The percentage
of tags with peer strength greater than 0.8 is 67% while the percentage of tags
with authority strength greater than 0.8 is only 15%. Also, 41% of the tags have
authority strength below 0.2, while only 11% of the tags have peer strength
below 0.2.

5.5 Experiments on Synthetic Data


We conclude our experimental evaluation by showing the performance of our
model using synthetically generated data. The sole purpose of the experiments
reported here is to demonstrate that in such data the maximum-likelihood pa-
rameter estimation and the time-shuffle randomization test lead to conclusions
that are consistent with the data-generation process.

Accuracy of Recovery: To verify that the recovery for α and β is accurate, we


randomly generate a synthetic power-law graph, and simulate the propagation
of an item over this graph using the logistic function with predetermined values
for α, β, and γ. In particular, we used the Barabasi model to generate graphs
with 10000 peers and 50 authorities. We used α, β ∈ {1, 2, 3, 4, 5, 6}, and γ =
−10. Higher values for α and β cause all the nodes to become active almost
immediately, so it becomes very difficult to observe how the items propagate.
To quantify the accuracy with which a parameter x is recovered, we define the
Peer and Authority Pressure in Information-Propagation Models 89

−0.1

1 −0.1 1
−0.2

−0.2 −0.3
2 2
−0.3 −0.4

α 3 −0.4
3 −0.5

α
−0.6
−0.5
4 4
−0.7
−0.6

−0.8
5 5
−0.7
−0.9

−0.8
6 6 −1

1 2 3 4 5 6 1 2 3 4 5 6
β β

(a) of peer coefficient α (b) Authority coefficient β

Fig. 7. Synthetic data. Relative error for the peer coefficient α and the authority
coefficient β.

relative recovery error. If x is the value of the coefficient used in the generation
process and x̂ is the recovered value, then the relative error is given by

|x − x̂|
RelErr(x, x̂) = .
x
The relative error takes values in the range [0, ∞), where a smaller value indicates
better accuracy of the maximum-likelihood estimation method.
Figure 7 shows the relative recovery errors for different sets of values for α and
β in this simulated graph, where darker colors represent smaller relative errors.
In most cases, for both the peer and the authority coefficients, the relative error
is below 0.2 indicating that the recovered values are very close to the ones used
in the data-generation process.

7 6
Recovered Recovered
Shuffled Shuffled
6 y=x y=x
5

5
Predicted value of α

Predicted value of β

4
3
3

2
2

1
1

0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Input value of α Input value of β

(a) Recovered α for fixed β = 2 (b) Recovered β for fixed α = 2

Fig. 8. Synthetic data. Recovering the peer coefficient α and the authority coefficient
β for the real data and the data after time-shuffle randomization.

Time-Shuffle Test on Synthetic Data: Figures 8(a) and 8(b) show the
recovered value of peer and authority coefficient respectively, as a function of the
value of the same parameter used in the data-generation process. One can observe
that in both cases the estimated value of the parameter is very close to the input
parameter. Visually this is represented by the proximity of the recovered curve
90 A. Anagnostopoulos, G. Brova, and E. Terzi

to the y = x curve; curve y = x represents ideal, errorless recovery. Second,


the recovered values follow the trend of the values used in the data-generation
process. That is, as the values of α and β used in the data generation increase,
the recovered values of α and β also increase. This indicates that even if the
maximum-likelihood estimation does not recover the data-generation parameters
exactly, it correctly identifies the types of influence (peer or authority).
In addition to the recovered values of peer and authority coefficient we also
report the average of the corresponding parameters obtained in 100 randomized
instances of the originally generated dataset; the randomized instances are all
generated using the time-shuffle test. These averages are reported as the dashed
line in both Figures 8(a) and 8(b). We observe that the values of peer and
authority coefficients obtained for these randomized datasets are consistently
smaller than the corresponding recovered and actual values. This means that
the time-shuffle test is consistently effective in identifying both peer and au-
thority influence. Observe that as the values of α and β parameters used in the
data-generation process increase, the difference between the average randomized
value of the parameter and the recovered value increases. This suggests that
as the dependence of the propagation on a particular type of influence (peer
or authority) becomes larger, it becomes easier to identify the strength of the
corresponding influence type using the time-shuffle test.

6 Conclusions
Given the adoption patterns of network nodes with respect to a particular item,
we have proposed a model for deciding whether peer or authority pressure played
a central role in its propagation. For this, we have considered an information-
propagation model where the probability of a node adopting an item depends
on two parameters: (a) the number of the node’s neighbors that have already
adopted the item and (b) the number of authority nodes that appear to have
the item. In other words, our model extends traditional peer-propagation mod-
els with the concept of authorities that can globally influence the network. We
developed a maximum-likelihood framework for quantifying the effect of peer
and authority influence in the propagation of a particular item and we used this
framework for the analysis of real-life networks. We find that accounting for au-
thority influence helps to explain more of the signal which many previous models
classified as noise. Our experimental results indicate that different types of net-
works demonstrate different propagation patterns. The propagation of memes in
online media seems to be largely affected by authority nodes (e.g., news-media
sites). On the other hand, there is not evidence for authority pressure in the
propagation of research trends within scientific collaboration networks.
There is a set of open research questions that arise from our study. First,
various generalizations could fit in our framework: peers or authorities could
influence authorities, nodes or edges could have different weights indicating
stronger/weaker influence pressures, and so on. More importantly, while our
methods compare peer and authority influence, it would be interesting to ac-
count for selection effects [2, 3] that might affect the values of the coefficients
Peer and Authority Pressure in Information-Propagation Models 91

of our model. Such a study can give a stronger signal about the exact source
of influence in the observed data. Furthermore, in this paper we have consid-
ered that the set of authority nodes are predefined. It would be interesting to see
whether the maximum-likelihood framework we have developed can be extended
to automatically identify the authority nodes, or whether some other approach
(e.g., one based on the HITS algorithm).

References
1. Amatriain, X., Lathia, N., Pujol, J.M., Kwak, H., Oliver, N.: The wisdom of the
few: A collaborative filtering approach based on expert opinions from the web. In:
SIGIR (2009)
2. Anagnostopoulos, A., Kumar, R., Mahdian, M.: Influence and correlation in social
networks. In: KDD (2008)
3. Aral, S., Muchnik, L., Sundararajan, A.: Distinguishing influence-based contagion
from homophily-driven diffusion in dynamic networks. Proceedings of the National
Academy of Sciences, PNAS 106(51) (2009)
4. Bass, F.M.: A new product growth model for consumer durables. Management
Science 15, 215–227 (1969)
5. Caccioppo, J.T., Fowler, J.H., Christakis, N.A.: Alone in the crowd: The structure
and spread of loneliness in a large social network. Journal of Personality and Social
Psychology 97(6), 977–991 (2009)
6. Christakis, N., Fowler, J.: Connected: The surprising power of our social networks
and how they shape our lives. Back Bay Books (2010)
7. Fowler, J.H., Christakis, N.A.: The dynamic spread of happiness in a large social
network: Longitudinal analysis over 20 years in the framingham heart study. British
Medical Journal 337 (2008)
8. Gomez-Rodriguez, M., Leskovec, J., Krause, A.: Inferring networks of diffusion and
influence. In: KDD (2010)
9. Granovetter, M.: Threshold models of collective behavior. The American Journal
of Sociology 83, 1420–1443 (1978)
10. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through
a social network. In: KDD (2003)
11. Knowledge and Data Engineering Group, University of Kassel, Benchmark Folk-
sonomy Data from BibSonomy, Version of June 30th (2007)
12. Leskovec, J., Backstrom, L., Kleinberg, J.M.: Meme-tracking and the dynamics of
the news cycle. In: KDD (2009)
13. Onnela, J.-P., Reed-Tsochas, F.: Spontaneous emergence of social influence in on-
line systems. Proceedings of the National Academy of Sciences, PNAS (2010)
14. Rosenquist, J.N., Fowler, J.H., Christakis, N.A.: Social network determinants of
depression. Molecular Psychiatry 16(3), 273–281 (2010)
15. Wu, S., Hofman, J.M., Mason, W.A., Watts, D.J.: Who says what to whom on
twitter. In: WWW, pp. 705–714 (2011)
16. Yang, J., Leskovec, J.: Modeling information diffusion in implicit networks. In:
ICDM (2010)
Constrained Logistic Regression for
Discriminative Pattern Mining

Rajul Anand and Chandan K. Reddy

Department of Computer Science, Wayne State University, Detroit, MI, USA

Abstract. Analyzing differences in multivariate datasets is a challenging prob-


lem. This topic was earlier studied by finding changes in the distribution differ-
ences either in the form of patterns representing conjunction of attribute value
pairs or univariate statistical analysis for each attribute in order to highlight the
differences. All such methods focus only on change in attributes in some form and
do not implicitly consider the class labels associated with the data. In this paper,
we pose the difference in distribution in a supervised scenario where the change
in the data distribution is measured in terms of the change in the corresponding
classification boundary. We propose a new constrained logistic regression model
to measure such a difference between multivariate data distributions based on
the predictive models induced on them. Using our constrained models, we mea-
sure the difference in the data distributions using the changes in the classification
boundary of these models. We demonstrate the advantages of the proposed work
over other methods available in the literature using both synthetic and real-world
datasets.

Keywords: Logistic regression, constrained learning, discriminative pattern min-


ing, change detection.

1 Introduction
In many real-world applications, it is often crucial to quantitatively characterize the
differences across multiple subgroups of complex data. Consider the following moti-
vating example from the biomedical domain: Healthcare experts analyze cancer data
containing various attributes describing the patients and their treatment. These experts
are interested in understanding the difference in survival behavior of the patients be-
longing to different racial groups (Caucasian-American and African-American) and in
measuring this difference across various geographical locations. Such survival behav-
ior distributions of these two racial groups of cancer/non-cancer patients are similar in
one location but are completely different in other locations. The experts would like
to simultaneously (i) model the cancer patients in each location and (ii) quantify the
differences in the racial groups across various locations. The problem goes one step
further: the eventual goal is to rank the locations based on the differences in the can-
cer cases of the two racial groups. In other words, the experts want to find the locations
where the difference in the predictive (cancer) models for the two racial groups is higher
and the locations where such difference is negligible. Depending on such information,

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 92–107, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Constrained Logistic Regression for Discriminative Pattern Mining 93

more health care initiatives will be organized in certain locations to reduce the racial
discriminations in cancer patients [22].
In this problem, the main objective is not only to classify the cancer and non-cancer
patients, but also to identify the discriminations (distribution difference) in the cancer
patients across multiple subpopulations (or subgroups) in the data. The traditional so-
lutions for this research problem partially addresses the dissimilarity issue, but fails to
provide any comprehensive technique in terms of the prediction models. It is vital to de-
velop an integrated framework that can model the discriminations and simultaneously
develop a predictive model.
To handle such problems, the methods for modeling the data should go beyond opti-
mizing a standard prediction metric and should simultaneously identify and model the
differences between two multivariate data distributions. Standard predictive models in-
duced on the datasets capture the characteristics of the underlying data distribution to a
certain extent. However, the main objective of such models is to accurately predict on
the future data (from the same distribution) and will not capture the differences between
two multivariate data distributions.

1.1 Existing Methods


To find the changes between multivariate data distributions, we need to understand (i)
the kind of changes and (ii) how to detect and model such changes. Prior work had
emphasized on measuring change in the dataset using the
– difference in probability distribution between individual attributes [15,18]
– difference in the support level of patterns (attribute-value combinations) [8,4]
We term these works as ‘unsupervised distribution change detection’. These methods
do not consider the underlying class distribution and how the class distribution changes
between the datasets. Existing methods like contrast set mining, emerging pattern min-
ing discussed in Section 2 provide rules with different support criteria (with statisti-
cal significance) within two classes. Contrast sets might provide class wise analysis in
terms of patterns which differ in support but cannot quantitatively determine whether
the overall class distribution between two datasets is different or to what extent the dif-
ference is. The requirement of discrete data for contrast set among many open issues
identified in [27] needs to be addressed as well.
In the case of univariate data, methods such as KolmogorovSmirnov (KS) test [18]
will provide information about whether two samples come from same distribution or
not. In the multivariate case, an approach to take maximum KS test statistic among all
possible orderings can provide some vital information, but again it’s univariate analysis
extended to multivariate data. Also the number of possible ordering increases exponen-
tially with higher dimensions making the test statistic computationally expensive. The
popular KL-divergence [15] also known as relative entropy does provide a change in
distribution although non-symmetric in nature (KL(AB)  = KL(BA)) and purely data
oriented approach. Thus, all these methods provide some kind of information about
patterns or test statistic. However, in this work, we are interested in finding whether the
available data with respect to a class distribution require different classification model
or not.
94 R. Anand and C.K. Reddy

Our approach is to consider the change between datasets as the change in underly-
ing class distributions. Our Supervised Distribution Difference (SDD) measure defined
in Sec. 3.2 aims to detect the change in the classification criteria. To understand the
kind of distribution changes we are supposedly trying to find can be illustrated using
an example. Figure 1(a) visualizes two binary datasets. Performing any univariate or
multivariate distribution difference analysis will give us the conclusion that these two
datasets are “different” or provide us with some rules which differ in support (con-
trast set mining). We agree with such analysis, but only to the extent of considering
these two datasets without their class labels. When we consider these two datasets hav-
ing two classes which need to be separated as much as possible using some classifi-
cation method. We conclude that these two datasets are not different in terms of
their classification criteria. A nearly similar Logistic Regression (LR) classifier (or
any other linear classification models) can be used to separate classes in both these
datasets. Thus, our criteria of finding distribution change is in terms of change in the
classification model.

1.2 Need for Constrained Models


The above discussion clearly states that the differences in multivariate data distribu-
tions based on “model” is different from previous “data” based approaches. As such,
inducing models on the datasets and finding difference between them can provide us
some information about the similarity of the datasets. However, there is one pertinent
question related to this discussion. Which model can accurately represent the data?
From Fig. 1(b), we can observe that there are many options for classification models
within the boundaries indicated by bold lines, representing the maximum possible width
of classification margin, whereas the dotted line represent the optimized LR classifier
model. Any classifier between these ‘bold’ boundaries will have the same accuracy
(100% in this case). Similarly, it is shown in Fig. 1(c) for the second dataset.
Based on the parameter values, the LR model can be located between any of the class
boundaries and yet represent the data accurately. Fig. 1(d) shows the LR model (bold
line) obtained using D1 and D2 combined together. A constrained LR model obtained
for each dataset will be nearly same, since the combined model itself is a reasonable
representation of both datasets individually. Thus, the supervised distribution difference
will be reported as zero (or close to zero). Whereas, using LR model obtained separately
on each dataset (dotted lines) will report significant difference between the two datasets
despite each individual LR model being close to the maximum margin classifier in this
case. Thereby, just using classification models directly to obtain SDD will vary in the
results when used for comparing two datasets.
In the case of high-dimensional datasets coupled with non-linearly separable case,
the number of potential classifier models required to represent dataset increase rapidly.
Thus, the model representing the data for comparison has to be chosen carefully. The
basic idea of building constrained models is to provide a baseline which fairly repre-
sents both the datasets for comparison. Then, this baseline model can be altered to gen-
erate specific model for each dataset. By doing this, we reduce the number of models
Constrained Logistic Regression for Discriminative Pattern Mining 95

10
Dataset 1 class 1
8 Dataset 1 class 2
Dataset 2 class 1
6 Dataset 2 class 2

−2

−3 −2 −1 0 1 2 3 4 5 6

(a)
10

Dataset 1 class 1
Dataset 1 class 2
8
LR classifier

−2

−3 −2 −1 0 1 2 3 4 5 6

(b)
10

6 Dataset 2 class 1
Dataset 2 class 2
4 LR classifier

−2

−3 −2 −1 0 1 2 3 4 5 6

(c)
10

Dataset 1 class 1
8
Dataset 1 class 2
Dataset 2 class 1
Dataset 2 class 2
6 Base LR model
LR model dataset 1
LR model dataset 2
4

−2

−3 −2 −1 0 1 2 3 4 5 6

(d)

Fig. 1. (a) Two binary datasets with similar classification criteria (b) Dataset 1 with linear class
separators (c) Dataset 2 with linear class separators and (d) Base LR model with Dataset 1 and
Dataset 2 model
96 R. Anand and C.K. Reddy

available for representing the datasets. By placing an accuracy threshold on the selected
models, we further reduce the number of such models and simultaneously ensure that
the new models are still able to classify each dataset accurately.
In this paper, we propose a new framework for constrained learning of predictive
models that can simultaneously predict and measure the differences between datasets
by enforcing some additional constraints in such a way that the induced models are as
similar as possible. Data can be represented by many forms of a predictive model, but
not all of these versions perform well in terms of their predictive ability. Each predic-
tive modeling algorithm will heuristically, geometrically, or probabilistically optimize
a specific criterion and obtains an optimal model in the model space. There are other
models in the model space that are also optimal or close to the optimal model in terms of
the specific performance metric (such as accuracy or error rate). Each of these models
will be different but yet will be a good representation of the data as long as its predic-
tive accuracy is close to the prediction accuracy of the most optimal model induced. In
our approach, we search for two such models corresponding to the two datasets under
the constraint that they must be as similar as possible. The distance between these two
models can then be used to quantify the difference between the underlying data dis-
tributions. Such constrained model building is extensively studied in the unsupervised
scenarios [3] and is relatively unexplored in the supervised cases. We chose to develop
our framework using LR models due to their popularity, simplicity, and interpret ability
which are critical factors for the problem that we are dealing with in this paper.
The rest of the paper is organized as follows: Section 2 discusses the previous works
related to the problem described. Section 3 introduces the notations and concepts use-
ful for understanding the proposed algorithm. Section 4 introduces the proposed con-
strained LR framework for mining distribution changes. The experimental results on
both synthetic and real-world datasets are presented in Section 5. And finally, Section 6
concludes our discussion.

2 Related Work
In this section, we describe some of the related topics available in the literature and
highlight some of the primary contributions of our work.
(1) Dataset Distribution Differences - Despite the importance of the problem, only
a small amount of work is available in describing the differences between two data
distributions. Earlier approaches for measuring the deviation between two datasets used
simple data statistics after decomposing the feature space into smaller regions using tree
based models [22,12]. However, the final result obtained is a data-dependent measure
and do not give any understanding about the features responsible for measuring that
difference. One of the main drawbacks of such an approach is that they construct a
representation that is independent of the other dataset thus making it hard for any sort
of comparison. On the contrary, if we incorporate the knowledge of the other class
while building models for both the subgroups, they provide more information about the
similarities and dissimilarities in the distributions. This is the basic idea of our approach.
Some other statistical and probabilistic approaches [25] measure the differences in the
data distributions in an unsupervised setting without the use of class labels.
Constrained Logistic Regression for Discriminative Pattern Mining 97

(2) Discriminative Pattern mining - Majority of pattern based mining for different,
unusual statistical characteristics [19] of the data fall into the categories of contrast
set mining [4,14], emerging pattern mining [8], discriminative pattern mining [10,21]
and sub-group discovery [11,16]. Applying most of these methods on a given dataset
with two subgroups will only give us the difference in terms of the attribute-value pair
combinations (or patterns) without any quantitative measures, i.e. difference of class
distribution within a small space of the data and does not provide a global view of the
overall difference. In essence, though these approaches attempt to capture statistically
significant rules that define the differences, they do not measure the data distribution
differences and also do not provide any classification model. The above pattern mining
algorithms do not take into account the change in distribution of class labels, instead
they define the difference in terms of change in attribute value combinations only.
(3) Change Detection and Mining - There had been some works on change detection
[17] and change mining [26,24] algorithms which typically assume that some previous
knowledge about the data is known and measure the change of the new model from a
data stream. The rules that are not same in the two models are used to indicate changes
in the dataset. These methods assume that we have a particular model/data at a given
snapshot and then measure the changes for the new snapshot. The data at the new snap-
shot will typically have some correlation with the previous snapshot in order to find any
semantic relations in the changes detected.
(4) Multi-task Learning and Transfer Learning - The other seemingly related family
of methods proposed in the machine learning community is transfer learning [7,23],
which adapts a model built on source domain DS (or distribution) to make a prediction
on the target domain DT . Some variants of transfer learning had been pursued under
different names: learning to learn, knowledge transfer, inductive transfer, and multi-task
learning. In multi-task learning [5], different tasks are learned simultaneously and may
benefit from common (often hidden) features benefiting each task. The primary goal of
our work is significantly different form transfer learning and multi-task learning, since
these methods do not aim to quantify the difference in the data distributions and they
are primarily aimed at improving the performance on a specific target domain. These
transfer learning tasks look for commonality between the features to enable knowledge
transfer or assume inherent distribution difference to benefit the target task.

2.1 Our Contributions

The major distinction of our work compared to the above mentioned methods is that
none of the existing methods explore the distribution difference based on a ‘model’
built on the data. The primary focus of the research available in the literature for com-
puting the difference between two data distributions had been ‘data-based’, whereas,
our method is strictly ‘model-based’. In other words, all of the existing methods utilize
the data to measure the differences in the distributions. On the contrary, our method
computes the difference using constrained predictive models induced on the data. Such
constrained models have the potential to simultaneously model the data and compare
multiple data distributions. Hence, a systematic way to build a continuum of predictive
models is developed in such a manner that the models for the corresponding two groups
98 R. Anand and C.K. Reddy

are at the extremes of the continuum and the model corresponding to the original data
is lying somewhere on this continuum. It should be highlighted that we compute the
distance between two datasets from the models alone; without referring back to the
original data. The major contributions of this work are:

– Develop a measure of the distance between two data distributions using the differ-
ence between predictive models without referring back to the original data.
– Develop a constrained version of logistic regression algorithm that can capture the
differences in data distributions.
– Experimental justification that the results from the proposed algorithm quantita-
tively capture the differences in data distributions.

3 Preliminaries

The notations used in this paper are described in Table 1. In this section, we will also
describe some of the basic concepts of the Logistic Regression and explain the notion
of supervised distribution difference.

Table 1. Notations used in this paper

Notation Description
Di ith dataset
Fi ith dataset classification boundary
C Regularization factor
L Objective function
wk kth component of weight vector w
Wj j th weight vector
diag(v) Diagonal matrix of vector v
sN Modified Newton N th step s
Z Scaling matrix
H Hessian Matrix
Jv Jacobian matrix of |v|
 Constraint on weight values
eps Very small value (1e-6)

3.1 Logistic Regression

In LR model, a binary classification problem is expressed by logit function which is a


linear combination of the attributes [13]. This logit function is also considered as the
log-odds of the class probabilities given an instance. Let us denote an example by x and
its k th feature as xk . If each example is labeled either +1 or −1, and there are l number
of features in each example, the logit function can be written as follows:
l
Pr (y = +1|x) 
log = wk xk = z (1)
Pr (y = −1|x)
k=0
Constrained Logistic Regression for Discriminative Pattern Mining 99

Here, x0 = 1 is an additional feature called ‘bias’, and w0 is the corresponding ‘bias


weight’. From Eq. (1), we have
ez
Pr (y = +1|x) = = g (z) (2)
1 + ez

where, g (z) = 1+e1−z . Let (x1 , x2 , ..., xn ) denotes a set of training examples and
(y1 , y2 , ..., yn ) be the corresponding labels. xik is the k th feature of the ith sample. The
joint distribution of the probabilities of class labels of all the n examples is:
n

Pr (y = y1 |x1 ) Pr (y = y2 |x2 ) ... Pr (y = yn |xn ) = Pr (y = yi |xi ) (3)
i=1

LR will learn weights by maximizing the log-likelihood of Eq. (3):


n
 n

L (w) = log Pr (y = yi |xi ) = log g (yi zi ) (4)
i=1 i=1

l
where zi = k=0 wk xik . To maximize Eq. (4), Newton’s method which iteratively
updates the weights using the following update equation is applied:

−1
(t+1) ∂2L
(t) ∂L
w =w − (5)
∂w∂w ∂w
 n  n
∂L ∂  
= log g (yi zi ) = yi xik g (−yi zi ) (6)
∂wk ∂wk i=1 i=1

n
∂2L
=− xij xik g (yi zi ) g (−yi zi ) (7)
∂wj ∂wk i=1

To reduce higher estimation of parameters and to reduce over-fitting, a regularization


term is added to objective function. By adding the squared L2 norm and negating Eq.
(4), the problem is converted to a minimization problem as shown in the following
objective function:
n
 l
C 2
L=− log g (yi zi ) + wk (8)
i=1
2
k=1

n
∂L
=− yi xik g (−yi zi ) + Cwk (9)
∂wk i=1
n
∂2L
=− x2ik g (−yi zi ) + C (10)
∂wk ∂wk i=1
100 R. Anand and C.K. Reddy

3.2 Supervised Distribution Difference

Let D1 , D2 be two datasets having the same number of features and the curve F1 and
F2 represents the decision boundary for the dataset D1 and D2 correspondingly. D
represents the combined dataset (D1 ∪ D2 ) and F is the decision boundary for the
combined dataset. For LR model, these boundaries are defined as a linear combination
of attributes resulting in a linear decision boundary. We induce constrained LR models
for D1 , D2 which are as close as possible to that of D and yet have significant accuracy
for D1 , D2 respectively. In other words, F1 and F2 having minimum angular distance
from F . Since, there exists many such decision boundaries, we optimize for minimum
angular distance from F that has higher accuracy. Supervised Distribution Difference
(SDD) is defined as the change in the classification criteria in terms of measuring the
deviation in classification boundary while classifying as accurately as possible.

Definition 1. Let wA and wB be the weight vectors corresponding to the constrained


LR models for D1 and D2 , then SDD is defined as follows:

A B
SDD(w , w ) = (wkA − wkB )2 (11)
k

4 Proposed Algorithm

We will now develop a constrained LR model which can measure the supervised distri-
bution difference between multivariate datasets. Figure 2 shows the overall framework
of the proposed algorithm. We start by building a LR model (using Eq.(8)) for the com-
bined dataset D and the weight vector obtained for this base model is denoted by R.
The regularization factor C for D is obtained using the best performance for 10-fold
cross validation (CV) and then the complete model is obtained using the best value of
C. Similarly, LR models on datasets D1 and D2 are also obtained. For datasets D1 and
D2 , the CV accuracy for the best C is denoted by Acc for each dataset. The best value
of C obtained for each dataset is used while building the constrained model. After all
the required input parameters are obtained, constrained LR models are separately learnt
individually for the datasets D1 and D2 satisfying the following constraint: the weight
vector of these new constrained models must be close to that of R (should not deviate
much from R). To enforce this constraint, we change the underlying implementation of
LR model to satisfy the following constraints:

|Rk − wk | ≤  (12)
where  is the deviation we allow from individual weight vectors of model obtained
from D. The upper and lower bound for each individual component of the weight vec-
tors is obtained from above equation. To solve this problem, we now use constrained
optimization algorithm in the implementation of constrained LR models.
The first derivative while obtaining LR model (Eq. (9)) is set to zero. In our model,
a scaled modified Newton step replaces the unconstrained Newton step [6]. The scaled
Constrained Logistic Regression for Discriminative Pattern Mining 101

Fig. 2. Illustration of our approach to obtain Supervised Distribution Difference between two
multivariate datasets

modified Newton step arises from examining the Kuhn-Tucker necessary conditions for
Equations (8) and (12).
∂L
(Z(w))−2 =0 (13)
∂w
Thus, we have an extra term (Z(w))−2 multiplied to the first partial derivative of the
optimization problem(L). This term can be defined as follows:
1
Z(w) = diag(|v(w)|− 2 ) (14)
The underlying term v(w) is defined below for 1 ≤ i ≤ k
∂Li (w)
vi = wi − (Ri + ) if ∂w
< 0 and (Ri + ) < ∞
∂Li (w)
vi = wi − (Ri − ) if ∂w ≥ 0 and (Ri − ) > −∞
Thus, we can see that the epsilon constraint is used in modifying the first partial deriva-
tive of L. The scaled modified Newton step for the nonlinear system of equations given
by Eq. (13) is defined as the solution to the linear system
ˆ
∂L
ÂZsN = − (15)
∂w

ˆ
∂L ∂L
= Z −1 (16)
∂w ∂w

∂L v
 = Z −2 H + diag( )J (17)
∂w
The reflections are used to increase the step size and a single reflection step is defined
as follows. Given a step η that intersects a bound constraint, consider the first bound
102 R. Anand and C.K. Reddy

constraint crossed by η; assume it is the ith bound constraint (either the ith upper or
lower bound). Then the reflection step η R = η except in the ith component, where
ηiR = ηi . In summary, our approach can be termed as constrained minimization with
box constraints. It is different from LR which essentially performs an unconstrained
optimization. After the constrained models for the two datasets D1 and D2 are induced,
we can capture the model distance by the Eq. (11). Algorithm 1 outlines our approach
for generating constrained LR models.

Algorithm 1. Constrained Logistic Regression


Input: Data (D), Accuracy of LR model on D (Acc), Threshold for accuracy loss (τ ), Threshold
for deviation (), Unconstrained LR model on D (R)
Output: Final model weight vector (W )
1: maxAcc ← 0, s ← 0, modelF ound ← f alse
2: while modelFound = true do
3: for a ← s+0.01 to s+0.05 step 0.01 do
4:  ← a× R
5: lower ← R − 
6: upper ← R + 
7: i←0
8: Li ← L(Ẃi )
9: repeat
10: argminw -lnL(W) to compute W´i+1 with constraints lower ≤ W´i+1 ≤ upper
11: Li+1 ← L(W´i+1 )
12: i← i+1
Li+1 −Li
13: until Li
< eps
14: if (Acc - Acc(W´i+1 ))/ Acc ≤ τ and maxAcc < Acc(W´i+1 ) then
15: W ← W´i+1
16: maxAcc ← Acc(W´i+1 )
17: modelFound ← true
18: end if
19: end for
20: s ← s + 0.05
21: end while

Most of the input parameters for the constrained LR algorithm are dataset dependent
and are obtained before running the algorithm as can be seen in the flowchart in Fig. 2.
The only parameter required is τ which is set to 0.15. However, depending on the appli-
cation domain of the dataset used, this value can be adjusted as it’s main purpose is to
allow for tolerance by losing some accuracy while comparing datasets. The constraint
 is varied systematically using variable a on line 4. This way, we gradually set bounds
for weight vector to be obtained (lines 5, 6). The weight vector for the optimization is
initialized with uniform weights (line 8). Line 10 employs constrained optimization
using bounds provided earlier and terminates when the condition on line 13 is satisfied.
The tolerance value eps is set to 1e-6. After the weight vector for a particular constraint
is obtained, we would like to see if this model can be considered for representing the
Constrained Logistic Regression for Discriminative Pattern Mining 103

dataset. Line 14 checks whether the accuracy of the current model is within the thresh-
old. It also checks if the accuracy of the current model with previously obtained model
and the better one is chosen for further analysis. The best model in the range of 1%
to 5% constraint of base weight vector R is selected. If no such model is found within
this range, then we gradually increase the constraint range (line 20) until we obtain the
desired model. The final weight vector is updated in line 15 and is returned after the
completion of full iteration. The convergence proof for the termination of constrained
optimization on line 10 is similar to the one given in [6].

5 Experimental Results
We conducted our experiments on five synthetic and five real-world datasets [2]. The bi-
nary datasets are represented by triplet (dataset, attributes, instances). The UCI datasets
used are (blood, 5, 748), (liver, 6, 345), (diabetes, 8, 768), (gamma, 11, 19020), and
(heart, 22, 267). Synthetic datasets used in our work have 500,000 to 1 million tuples.

5.1 Results on Synthetic Datasets


First, a synthetic dataset with 10 attributes is generated using Gaussian distribution with
a predefined mean and standard deviation (μ, σ). Here, two datasets D1 , D2 are created
with the same feature space, but the features that are important for class separation are
different in both the datasets. These “significant” features are known a priori. Obtaining
unconstrained LR models on each of the datasets independently will provide weightage
for features, but only the highly significant features can be found using such models
(normally the features that are already familiar in the application domain). Identify-
ing the ‘differential features’, which we define as features that are more important in
one dataset but less in the other dataset, was not possible using unconstrained models.
Using our constrained models, we were able to identify the differential features by rank-
ing them in order of high magnitude by calculating the difference between the weight
vectors for each feature. Since the ground truth is known in this case, we were able
to identify most of the differential features correctly. We repeated our experiments by
varying the differential features in both the datasets to remove any bias for a particular
experiment.
Table 2 highlights the difference in the weight vectors obtained from one such ex-
periment. The difference is between individual component of the weight vectors for LR
and constrained LR model on the two datasets. Bold numbers correspond to the highly
differential features obtained from constrained LR based on ground truth. We can notice
that LR model does not necessarily produce high absolute scores for these attributes and
gives higher absolute scores for other attributes while our method accurately captures
the differential features.
Based on the ground truth available to us, we highlighted (in Fig. 3) the significance
of features given by LR and constrained LR method based on Table 2. The features are
ranked from 10 to 1 where 10 being most highly differential and 1 being least differ-
ential. From the figure, it can be observed that LR models were only able to capture
features 2, 5 and 1 which are similar to the ground truth. The constrained LR model on
the other hand was much closer to the ground truth most of the times.
104 R. Anand and C.K. Reddy

Table 2. Difference between weight vectors for constrained and unconstrained LR models

LR Constrained LR LR Constrained LR
-3.3732 -0.8015 1.2014 0.4258
-0.8693 0 0.0641 0.0306
-1.2061 -0.0158 -0.5393 0.1123
-1.6274 0 -3.5901 0
5.0797 0.9244 0.7765 0.0455

Fig. 3. Feature Ranking vs. Number of Features

5.2 The Comparison of the Distance Measure

Let N M.Fnum denote a dataset with N million tuples generated by classification func-
tion num. After computing the distance between the datasets, the main issue to be ad-
dressed is: how large a distance should be there in order to ensure that the two datasets
were generated by different underlying processes? The technique proposed in [12] an-
swers this question as follows: If we assume that the distribution G of distance values
(under the hypothesis that the two datasets are generated by the same process) is known,
then we can compute G using bootstrapping technique [9], and we can use standard sta-
tistical tests to compute the significance of the distance d between the two datasets. The
datasets were generated using the functions F1 , F2 , and F4 respectively. One of the
datasets is constructed by unifying D with a new block of 50,000 instances generated
by F4 where D = 1M.F 1. D1 = D ∪ 0.05M.F4 , D2 = 0.5M.F1 , D3 = 1M.F2 , and
D4 = 1M.F4 .
Prior work [12] devised a “data-based” distance measure along with derived a method
for measuring statistical significance of the derived distance. The experiments con-
ducted on synthetic datasets are explained in [1]. The distance value computed on these
datasets by [12] can be taken as the ground truth and our experiments on these datasets
follow the same pattern as that of earlier results. Table 3 highlights that relative ranking
Constrained Logistic Regression for Discriminative Pattern Mining 105

among datasets for distance is same. Note that the distances are not directly compara-
ble ([12] and Constrained LR), only ranking can be compared based using the distance
computed.

Table 3. The distances of all four datasets by constrained LR and Ganti’s method [12]

Dataset Dist by [12] Dist by Constrained LR


D1 0.0689 0.00579
D2 0.0022 0.004408
D3 1.2068 0.022201
D4 1.4819 0.070124

−3
D1
−3.5 D2
Log (Euclidean Distance)

D3
−4 D4
−4.5

−5

−5.5

−6

−6.5

−7
10 20 30 40 50 60 70 80 90 100
Sampling Percentage

Fig. 4. Log(Euclidean distance) vs. sampling percentage

5.3 The Sensitivity of the Distance Measure


We will first show that the distance measure calculated by our algorithm precisely
captures the differences between data distributions. In [20], the authors developed a
systematic way of generating datasets with varying the degree of differences in data
distributions. To achieve similar goal, we generate datasets exhibiting varying degrees
of similarity. We created random subsamples of a given dataset, D of the size p, where
p is varied as 10%, 20%, ..., 100%, with a stepsize of 10%. Each subsample is randomly
chosen 5 times and model distances calculated are averaged to remove bias in the mod-
els. Each of these subsamples is denoted by Dp , where D100 = D. Now, using the
proposed algorithm, we calculated the distance between D and Dp using Algorithm 1.
We expect the calculated distance between D and Dp to decrease as p increases and to
approach zero when p = 100%. Figure 4 shows the result on synthetic datasets used
above (Sec. 5.2). These datasets are big and thus there is a significant change in the
class distribution even at small sampling levels. However, the distance is still small as
expected and decreases monotonically. In Figure 5, we plot the model distances against
the sampling size p for real-world datasets. Here, as we can observe that the class dis-
tribution is nearly uniform and thus SDD metric does not change much except for the
case of less than 10% samples. The constant model distance and sudden drop in SDD
after 10% sampling indicate that more than 10% samples of the data resemble class
distribution closely since the induced models are nearly similar with low value of SDD.
Thus, we can say that our metric captures the class distribution difference accurately.
106 R. Anand and C.K. Reddy

1.8 Liver

1.6
Diabetes
Euclidean Distance

1.4
Heart

Gamma Telescope
1.2

Blood Transfusion
1

0.8

0.6

0.4

0.2

0
10 20 30 40 50 60 70 80 90 100

Sampling Percentage

Fig. 5. Euclidean distance vs. sampling percentage

6 Conclusion
Standard predictive models induced on multivariate datasets capture certain characteris-
tics of the underlying data distribution. In this paper, we developed a novel constrained
logistic regression framework which produces accurate models of the data and simulta-
neously measures the difference between two multivariate datasets. These models were
built by enforcing additional constraints to the standard logistic regression model. We
demonstrated the advantages of the proposed algorithm using both synthetic and real-
world datasets. We also showed that the distance between the models obtained from
proposed method accurately captures the distance between the original multivariate data
distributions.

References
1. Agrawal, R., Imielinski, T., Swami, A.: Database mining: A performance perspective. IEEE
Trans. Knowledge Data Engrg. 5(6), 914–925 (1993)
2. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://archive.ics.uci.edu/ml/
3. Basu, S., Davidson, I., Wagstaff, K.L.: Constrained Clustering: Advances in Algorithms,
Theory, and Applications. CRC Press, Boca Raton (2008)
4. Bay, S.D., Pazzani, M.J.: Detecting group differences: Mining contrast sets. Data Mining and
Knowledge Discovery 5(3), 213–246 (2001)
5. Caruana, R.: Multitask learning. Machine Learning 28(1), 41–75 (1997)
6. Coleman, T.F., Li, Y.: An interior trust region approach for nonlinear minimizations subject
to bounds. Technical Report TR 93-1342 (1993)
7. Dai, W., Yang, Q., Xue, G., Yu, Y.: Boosting for transfer learning. In: ICML 2007: Proceed-
ings of the 24th International Conference on Machine Learning, pp. 193–200 (2007)
8. Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences.
In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discov-
ery and Data Mining, pp. 43–52 (1999)
Constrained Logistic Regression for Discriminative Pattern Mining 107

9. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman and Hall, London
(1993)
10. Fang, G., Pandey, G., Wang, W., Gupta, M., Steinbach, M., Kumar, V.: Mining low-support
discriminative patterns from dense and high-dimensional data. IEEE Transactions on Knowl-
edge and Data Engineering (2011)
11. Gamberger, D., Lavrac, N.: Expert-guided subgroup discovery: methodology and applica-
tion. Journal of Artificial Intelligence Research 17(1), 501–527 (2002)
12. Ganti, V., Gehrke, J., Ramakrishnan, R., Loh, W.: A framework for measuring differences in
data characteristics. J. Comput. Syst. Sci. 64(3), 542–578 (2002)
13. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd edn. Springer, Heidelberg (2009)
14. Hilderman, R.J., Peckham, T.: A statistically sound alternative approach to mining contrast
sets. In: Proceedings of the 4th Australasian Data Mining Conference (AusDM), pp. 157–172
(2005)
15. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86
(1951)
16. Lavrač, N., Kavšek, B., Flach, P., Todorovski, L.: Subgroup discovery with cn2-sd. Journal
of Machine Learning Research 5, 153–188 (2004)
17. Liu, B., Hsu, W., Han, H.S., Xia, Y.: Mining changes for real-life applications. In: Data Ware-
housing and Knowledge Discovery, Second International Conference (DaWaK) Proceedings,
pp. 337–346 (2000)
18. Massey, F.J.: The kolmogorov-smirnov test for goodness of fit. Journal of the American
Statistical Association 46(253), 68–78 (1951)
19. Novak, P.K., Lavrac, N., Webb, G.I.: Supervised descriptive rule discovery: A unifying sur-
vey of contrast set, emerging pattern and subgroup mining. Journal of Machine Learning
Research 10, 377–403 (2009)
20. Ntoutsi, I., Kalousis, A., Theodoridis, Y.: A general framework for estimating similarity of
datasets and decision trees: exploring semantic similarity of decision trees. In: SIAM Inter-
national Conference on Data Mining (SDM), pp. 810–821 (2008)
21. Odibat, O., Reddy, C.K., Giroux, C.N.: Differential biclustering for gene expression analysis.
In: Proceedings of the First ACM International Conference on Bioinformatics and Compu-
tational Biology (BCB), pp. 275–284 (2010)
22. Palit, I., Reddy, C.K., Schwartz, K.L.: Differential predictive modeling for racial dispari-
ties in breast cancer. In: IEEE International Conference on Bioinformatics and Biomedicine
(BIBM), pp. 239–245 (2009)
23. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowledge and
Data Engineering 22(10), 1345–1359 (2010)
24. Pekerskaya, I., Pei, J., Wang, K.: Mining changing regions from access-constrained snap-
shots: a cluster-embedded decision tree approach. Journal of Intelligent Information Sys-
tems 27(3), 215–242 (2006)
25. Wang, H., Pei, J.: A random method for quantifying changing distributions in data streams.
In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS
(LNAI), vol. 3721, pp. 684–691. Springer, Heidelberg (2005)
26. Wang, K., Zhou, S., Fu, A.W.C., Yu, J.X.: Mining changes of classification by correspon-
dence tracing. In: Proceedings of the Third SIAM International Conference on Data Mining
(SDM), pp. 95–106 (2003)
27. Webb, G.I., Butler, S., Newlands, D.: On detecting differences between groups. In: Proceed-
ings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD), pp. 256–265 (2003)
α -Clusterable Sets

Gerasimos S. Antzoulatos and Michael N. Vrahatis

Computational Intelligence Laboratory (CILAB)


Department of Mathematics
University of Patras Artificial Intelligence Research Center (UPAIRC)
University of Patras, GR-26110 Patras, Greece
antzoulatos@upatras.gr, vrahatis@math.upatras.gr

Abstract. In spite of the increasing interest into clustering research within the
last decades, a unified clustering theory that is independent of a particular algo-
rithm, or underlying the data structure and even the objective function has not
be formulated so far. In the paper at hand, we take the first steps towards a theo-
retical foundation of clustering, by proposing a new notion of “clusterability” of
data sets based on the density of the data within a specific region. Specifically, we
give a formal definition of what we call “α -clusterable” set and we utilize this
notion to prove that the principles proposed in Kleinberg’s impossibility theorem
for clustering [25], are consistent. We further propose an unsupervised clustering
algorithm which is based on the notion of α -clusterable set. The proposed algo-
rithm exploits the ability of the well known and widely used particle swarm op-
timization [31] to maximize the recently proposed window density function [38].
The obtained clustering quality is compared favorably to the corresponding clus-
tering quality of various other well-known clustering algorithms.

1 Introduction
Cluster analysis is an important human process associated with the human ability to dis-
tinguish between different classes of objects. Furthermore, clustering is a fundamental
aspect of data mining and knowledge discovery. It is the process of detecting homoge-
nous groups of objects without any priori knowledge about the clusters. A cluster is a
group of objects or data that are similar to one another within the particular cluster and
are dissimilar to the objects that belong to another cluster [9, 19, 20].
The last decades, there exists an increasing scientific interest in clustering and nu-
merous applications, in different scientific fields have appeared, including statistics [7],
bioinformatics [37], text mining [43], marketing and finance [10, 26, 33], image seg-
mentation and computer vision [21] as well as pattern recognition [39], among others.
Many clustering algorithms have been proposed in the literature, which can be cate-
gorised into two major categories, hierarchical and partitioning [9, 22].
Partitioning algorithms consider the clustering as an optimization problem. There are
two directions. The first one discovers clusters through optimizing a goodness criterion
based on the distance of the dataset’s points. Such algorithms are k-means [27], ISO-
DATA [8] and fuzzy c-means [11]. The second one utilizes the notion of density and
considers clusters as high-density regions. The most characteristic algorithms of this
approach are DBSCAN [18], CLARANS [28] and k-windows [41].

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 108–123, 2011.

c Springer-Verlag Berlin Heidelberg 2011
α -Clusterable Sets 109

Recent approaches for clustering apply population based globalized search algo-
rithms exploiting the capacity (cognitive and social behaviour) of the swarms and the
ability of an organism to survive and adjust in a dynamically changing and competitive
environment [1, 6, 12, 13, 14, 29, 32]. Evolutionary Computation (EC) refers to the
computer-based methods that simulate the evolution process. Genetic algorithms (GA),
Differential Evolution (DE) and Particle Swarm Optimization (PSO) are the main algo-
rithms of EC [16]. The principal issues of these methods consist of the representation
of the solution of the problem and the choice of the objective function.
Despite of the considerably progress and innovations that the last decades have been
occurred, there is a gap between practical and theoretical clustering foundation [2, 3,
25, 30]. The problem is getting worse due to the lack of a unified definition of what a
cluster is, which will be independent of the measure of similarity/ dissimilarity or the
algorithm of clustering. Going a step further, it is difficult to answer questions such
as how many clusters exist in a dataset, without having any priori knowledge for the
underlying structure of the data, or whether a k-clustering of a dataset is meaningful.
All these weaknesses led to the development of the study of theoretical background
of clustering aiming to develop a general theory. Thus, Puzicha et al. [35], consid-
ered the proximity-based data clustering as a combinatorial optimisation problem and
moreover their proposed theory aimed to face two fundamental problems: (i) the spec-
ification of suitable objective functions, and (ii) the derivation of efficient optimisation
algorithms.
In 2002 Kleinberg [25] developed an axiomatic framework for clustering and showed
that there is no clustering function that could satisfy simultaneously three simple prop-
erties, the scale–invariance, the richness and the consistency condition. Kleinberg’s
goal was to develop a theory of clustering that would not be dependent on any particu-
lar algorithm, cost function or data model. To accomplish that, a set of axioms was set
up, aiming to define what the clustering function is. Kleinberg’s result was that there is
no clustering function satisfying all three requirements.
After some years, Ackerman and Ben-David [2] disagreed with Kleinberg’s impos-
sibility theorem claiming that Kleinberg’s result, was to a large extent, the outcome of a
specific formalism rather than being an inherent feature of clustering. They focused on
the clustering-quality framework rather than to attempt to define what a clustering func-
tion is. They developed a formalism and consistent axioms of the quality of a given data
clustering. This lead to a further investigation of interesting measures of clusterability of
data sets [3]. Clusterability is a measure of clustered structure in a data set. Although,
in the literature, several notions of clusterability [17, 35] have been proposed and in
addition they share the same intuitive concept, however these notions are pairwise in-
compatible, as Ackerman et al., have proved in [3]. Furthermore, they concluded that
the finding a close-to-optimal clustering for well clusterable data set is computationally
easy task comparing with the common clustering task which is NP-hard [3].

Contribution: All the aforementioned theoretical approaches refer to the distance-


based clustering and implicitly mention that the dissimilarity measure is a distance
function. Thus, the concept of clusterability is inherent in the concept of a distance.
In the paper at hand, we try to investigate if the notion of clusterability could be ex-
tended in density-based notion of clusters. To attain this goal, we introduce the notion
110 G.S. Antzoulatos and M.N. Vrahatis

of α -clusterable set, that is based on the window density function [38]. We aim to cap-
ture the dense regions of points in the data set, given an arbitrary parameter α , which
presents the size of a D-range, where D is the dimensionality of the data set. Intuitively,
a cluster can be considered as a dense area of data points, which is separated from other
clusters with sparse areas of data or areas without any data point. Under this conside-
ration, a cluster can be seen as an α -clusterable set or as an union of all intersecting
α -clusterable sets. Then, a clustering, called α -clustering, will be comprised of the set
of all the clusters. In this theoretical framework, we are able to show that the proper-
ties of Kleinberg’s impossibility theorem are satisfied. Particularly, we prove that in the
class of window density functions there exist clustering functions satisfying the proper-
ties of scale-invariance, richness and consistency. Furthermore, a clustering algorithm
can be found utilising the theoretical framework and having as the goal to detect the
α -clusterable sets.
Thus, we propose an unsupervised clustering algorithm that exploits the benefits of
a population-based algorithm, known as particle swarm optimisation, in order to detect
the centres of the dense regions of data points. These regions are actually what we call
α -clusterable sets. When all the α -clusterable sets have been identified, the merging
procedure is executed in order to merge the regions that have an overlap each other.
After this process, the final clusters will have been formed and the α -clustering will has
been detected.
The rest of the paper is organized as follows. In the next section we briefly present
the background work that our theoretical framework, which is analysed in Section 3,
is based on. In more detail, we present and analyse the proposed definitions of α -
clusterable set and α -clustering, and futhermore we show that, using these concepts
the conditions of Kleinberg’s impossibility theorem for clustering are hold and are con-
sistent. Section 4 gives a detailed analysis of the experimental framework and the pro-
posed algorithm. In Section 5 the experimental results are demonstrated. Finally, the
paper ends in Section 6 with conclusions.

2 Background Material

For completeness purposes, let us briefly describe the Kleingberg’s axioms [25] as well
as the window density functions [38].

2.1 Kleinberg’s Axioms

As we have already mentioned above, Kleingberg, in [25], proposed three axioms for
clustering functions and claimed that this set of axioms is inconsistent, meaning that
lack of clustering function that satisfies all the three axioms. Let X = {x1 , x2 , . . . , xN }
be a data set with cardinality N and let d : X × X → R be a distance function over
X, that means ∀ xi , x j ∈ X, d(xi , x j ) > 0 if and only if xi 
= x j and d(xi , x j ) = d(x j , xi )
otherwise. It is worth observing that the triangle inequality is not necessary to be ful-
filled, i.e. distance function should not be considered as a metric function. Furthermore,
a clustering function is a function f which, given a distance function d, separates the
data set X into a set of Γ clusters.
α -Clusterable Sets 111

The first axiom, scale-invariance, is concern with the requirement that the clustering
function have to be invariant to changes in the units of a distance measure. Formally,
for any distance function d and any λ > 0, a clustering function f is scale-invariant if
f (d) = f (λ d).
The second property, called richness, deals with the outcome of the clustering func-
tion, and it requires that every possible partition of the data set can be obtained. Typi-
cally, a function f is rich if for each partition Γ of X, there exist a distance function d
over X such that f (d) = Γ .
The consistency property requires that if the distances between the points laid in the
same cluster are decreased and the distances between points laid in a different clusters
are increased, then the clustering result does not change. Kleinberg gave the following
definition:
Definition 1. Let Γ be a partition of X and d, d  are two distance functions on X.
Then, a distance function d  is a Γ -transformation of d if (a) ∀ xi , x j ∈ X belonging
to the same cluster of Γ , it holds d  (xi , x j )  d(xi , x j ) and (b) ∀ xi , x j ∈ X belonging
to different clusters of Γ , it holds d  (xi , x j )  d(xi , x j ). Furthermore, a function f is
consistent if f (d) = f (d  ), whenever a distance function d  is a Γ -transformation of d.
Using the above axioms, Kleinberg stated the impossibility theorem [25]:
Theorem 1 (Impossibility Theorem). For each N  2, there is no clustering function
f that satisfies scale-invariance, richness and consistency.

2.2 Window Density Function


In [38] the authors proposed a window density function as an objective function, so
as to discover the optimum clustering. Assume that the data set comprise a set X =
{x1 , x2 , . . . , xN }, where x j is a data point in the D–dimensional Euclidean space RD .
Then we give the following definition:

Definition 2 (Window Density Function). Let a D-range of size α ∈ R and center


z ∈ RD be the orthogonal range [z1 − α , z1 + α ] × · · ·× [zD − α , zD + α ]. Assume further,
that the set Sα ,z , with respect to the set X, is defined as:

Sα ,z = {y ∈ X : zi − α  yi  zi + α , ∀ i = 1, 2, . . . , D} .

Then the Window Density Function (WDF) for the set X, with respect to a given size
α ∈ R is defined as:
WDFα (z) = |Sα ,z | , (1)
where | · | indicates the cardinality of the set Sα ,z .

WDF is a non-negative function that expresses the density of the region (orthogonal
range) around the point. The points that are included in this region can be effectively
estimated using Computational Geometry methods [5, 34]. For a given α , the value of
WDF increases continuously as the density of the region within the window increases.
Furthermore, for low values of α , WDF has many local maxima. While the value of α
112 G.S. Antzoulatos and M.N. Vrahatis

increases, WDF reveals the number of local maxima that corresponds to the number of
clusters. However for higher values of the parameter, WDF becomes smoother and the
clusters are not distinguished.
Thus, it is obvious, that the determination of the dense region depends on the size
of the window. Actually, the parameter α captures our inherent view for the size of
the dense regions that there exist in the data set. To illustrate the effect of parameter
α , we employ the following dataset Dset1 which contains 1600 data points in the 2-
dimensional Euclidean space (Fig. 1(a)).
In the following figures the behaviour of WDF function is exhibited over distinct
values of the α parameter. As we can conclude, when the value of parameter α is
increasing more dense and smooth regions of data points is detected. When α = 0.05
or α = 0.075 there are many maxima inside the real clusters of data points, Fig. 1(b),
Fig. 1(c) respectively. As α increases there is a clear improvement on the formation of
groups, namely the dense regions are more distinct and separate, so between the values
α = 0.1 and α = 0.25 we can detect the four real clusters of data points, Fig 1(d),
Fig. 1(e) respectively. If the parameter α continue to grow, then the four maximum
of the WDF function corresponding to the four clusters of data points, which were
detected previously, merge into one single maximum leading to the formation of one
cluster, Fig 1(f).

3 Proposed Theoretical Framework

In this section, we give the definitions needed to support the proposed theoretical frame-
work for clustering. Based on the observation that a good clustering is one that separates
the points of all data in high-density areas, which are separated by areas of sparse points
or areas with no points, we define the notion of an α –clusterable set as well as the no-
tion of α –clustering. To do this, we exploit the benefits of window density function
and its ability to find local dense regions of data points without investigate the whole
dataset.

Definition 3 (α –Clusterable Set). Let X be the data set that is comprised of the set of
points {x1 , x2 , . . . , xN }. A set of data points xm ∈ X is defined as an α –clusterable set
if there exist a positive real value α ∈ R, a hyper–rectangle Hα of size α and a point
z ∈ Hα in which the window density function centered at z is unimodal. Formally,
 
Cα ,z = xm | xm ∈ X ∧ ∃ z ∈ Hα : WDFα (z)  WDFα (y), ∀ y ∈ Hα . (2)

Remark 1. It is worth to mention that although the points y and z are laid in the hyper–
rectangle Hα , however it is not necessary to be points of the data set. Also, the hyper–
rectangle Hα is a bounding box of the data set X and a set Cα ,z is a subset of X. In
addition, the α –clusterable set is a highly dense region due to the fact that the value of
WDF function is maximised. Furthermore, the point z could be considered as the centre
of the α –clusterable set. Thus, given an α and a sequence of points zi ∈ Hα , a set that
comprises of a number of α –clusterable sets could be considered as a close to optimal
clustering of X.
α -Clusterable Sets 113

(a) Dataset DSet1 of 1600 (b) WDF with α = 0.05


points

(c) WDF with α = 0.075 (d) WDF with α = 0.1

(e) WDF with α = 0.25 (f) WDF with α = 0.5

Fig. 1. WDF with different values of parameter α

Definition 4 (α –Clustering). Given a real value α , an α –clustering of a data set X is


a partition of X, that is a set of k disjoint α –clusterable sets of X such that their union
is X. Formally, an α –clustering is a set:
 
C = Cα ,z1 ,Cα ,z2 , . . . ,Cα ,zk ,

where zi ∈ Hα ⊂ RD , i = 1, 2, . . . , k are the centres of the dense regions Cα ,zi .

We explain the above notions by given an example. Let X be the dataset of 1000 random
data points that drawn from the normal (Gaussian) distribution (Figure 2). The four
clusters have the same cardinality thus each one of them contains 250 points. As we
can notice, there exist a proper value for the parameter α , α = 0.2, so as the hyper–
rectangles Hα captures the whole clusters of points. These hyper–rectangles can be
considered as the α –clusterable sets. Also, it is worth to mention that there is only one
point z inside the α –clusterable set, such that the window density function is unimodal.
114 G.S. Antzoulatos and M.N. Vrahatis

Fig. 2. Dataset of 1000 points. Parameter value is α = 0.2

Furthermore, we define an α –clustering function for a data set X, that takes a window
density function, with respect to a given size α , on X and returns a partition C of α –
clusterable sets of X.
Definition 5 (α –Clustering Function). A function fα (WDFα , X) is an α –clustering
function if for a given window density function, with respect to a real value parameter
α , returns a clustering C of X, such as each cluster of C is an α –clusterable set of X.
Next, we prove that the clustering function fα fulfills the properties of scale-invari-
ance, consistency and richness. Intuitively, the scale-invariance property describes that
in any uniform change in the scale of the domain space of the data, the high-density
areas will be maintained and furthermore they will be separated by sparse regions of
points. Richness means that there exist a parameter α and points z, such that an α -
clustering function f can be constructed, with the property of partitioning the dataset X
into α -clusterable sets. Finally, the consistency means that if we shrink the dense areas,
α -clusterable sets, and simultaneously expand the sparse areas between the dense areas,
then we can get the same clustering solution.
Lemma 1 (Scale-Invariance). Every α –clustering function is scale-invariant.
Proof. According to the definition of scale-invariance, every clustering function has
this property if for every distance measure dist and any λ > 0 it holds that f (dist) =
f (λ dist). Thus, in our case an α –clustering function, fα , is scale-invariant since it holds
that:
fα (WDFα (z), X) = fλ α (WDFλ α (λ z), X),
for every positive number λ . This is so because if a data set X is scaled by a factor
λ > 0, then the window density function of each point will be remain the same. Indeed,
if a uniform scale is applied to the dataset, then we can find a scale factor λ , such that
a scaled window, with size λ α , contains the same amount of points as the window of
size α . More specifically, for each data point y ∈ X that belongs to a window which has
center the point z and size α , it holds that:

z−α  y  z+α ⇔ λz−λα  λy  λz+λα .

So, if the point y ∈ X belongs to the window of size α and center z, then the point
y = λ y, y ∈ X  will belong to the scaled window, which has size λ α and center the
point z = λ z. Thus the lemma is proved. 
α -Clusterable Sets 115

Lemma 2 (Richness). Every α –clustering function satisfies the richness property.

Proof. It is obvious that for each non-trivial α –clustering C of X, there exist a window
density function for the set X, with respect to a size α , such that:

f (WDFα (z), X) = C .

In other words, given a data set of points X we can find a WDF and a size α , such that
each window with size α and center the point z will be an α –clusterable set. Thus the
lemma is proved. 

Lemma 3 (Consistency). Every α –clustering function is consistent.

Proof. Suppose that fα is an α –clustering function. By definition, there exist α –cluster-


able sets of X that constitute a set

C = {Cα ,z1 ,Cα ,z2 , . . . ,Cα ,zk },

where each zi ∈ Hα , i = 1, 2, . . . , k is the centre of each α –clusterable set, Cα ,zi . Ac-


cording to the definition of α –clusterable set, the window density function is unimodal
for each set Cα ,zi . Thus, for each y ∈ Hα it holds that WDFα (z)  WDFα (y).
Furthermore, if we reduce the value of window density function by decreasing the
value of parameter α to a smaller value α  , then for the set Cα  ,zi , where α  < α , the
WDF is also unimodal centered at the point zi . Assume that there exists another point
zi 
= zi such that WDFα  (zi )  WDFα (zi ), then the WDF function would be a multi-
modal function for the set Cα ,zi , implies that the set Cα ,zi is not an α –clusterable set,
which is in contrary to our assumption. So, Cα ,zi is an α –clusterable set for each value
α  < α , that means
fα  (WDFα  (zi ), X) = C ,
which implies that fα is consistent. Thus the lemma is proved. 

In the contrary of the general framework of Kleinberg’s impossibility theorem, we ob-


tain the following theorem:
Theorem 2. For each N  2 there is an α –clustering function that satisfies the proper-
ties of scale-invariance, richness and consistency.

Proof. The proof follows using Lemmata 1, 2 and 3. 

4 Experimental Framework

In this section we propose an unsupervised algorithm, in the sense that it doesn’t require
a predefined number of clusters in order to detect the α –clusterable sets laiding in the
dataset X. Define the correct number of clusters is a critical open issue in cluster analy-
sis, Dubes refer to it as “the fundamental problem of cluster analysis” [15], because the
number of clusters is often tough to determine or, even worse, impossible to define.
116 G.S. Antzoulatos and M.N. Vrahatis

Thus, the main goal of the algorithm is to identify the dense regions of points,
in which the window density function is unimodal. These regions constitute the α –
clusterable sets that enclose the real clusters of the dataset. The algorithm runs itera-
tively identifying the centre of the α –clusterable set, removing the data points that lie
within it. The above process continues until no data points left in the dataset. In order
to detect the centre of the dense regions we utilised a well-known population-based op-
timisation algorithm, called Particle Swarm Optimisation (PSO) [23]. PSO is inspired
by swarm behaviour, such as flocking birds collaboately searching for food. In the last
decades there has been a rapid increase of the scientific interest around Swarm Intelli-
gence and particularly around Particle Swarm Optimization and numerous approaches
have been proposed in many application fields [16, 24, 31]. Recently, Swarm Intelli-
gence and especially Particle Swarm Optimisation have been utilised in Data Mining
and Knowledge Discovery, producing promising results [1, 40].
In [6] an algorithm, called IUC, has been proposed, which utilises as objective func-
tion the window density function and Differential Evolution algorithm in order to evolve
the clustering solution of the data set reaching the best position of the data set. Also,
they use an enlargment procedure in order to detect all the points that laying in the same
cluster. In the paper at hand, we exploit the benifits of the Particle Swarm Optimisation
algorithm to search the space of potential solutions efficiently, so as to find the global
optimum of a window density function. Each particle presents the centre of a dense
region of the dataset, so the particles are flying through the seach space forming folks
around peaks of window density function. Thus, the algorithm detects the centre of the
α –clusterable set one each time.
It is worth to say that the choice of the value of the parameter α seems to play
an important role of the identifcation of the real number of clusters and depends on
several factors. For instance, if the value of the parameter α is too small so the hyper-
rectangle is not able to capture the whole cluster, or if the data points shape dense
regions with various cardinality, then again the hyper-rectangles with constant size α
are difficult to capture the whole clusters of the datasets. The following figures describe
the above cases more clearly. We conclude that the small choice of parameter α leads
to the detection of small dense regions that are the α –clusterable sets. However, as we
can be noticed, even for the detection of small clusters of data points, it needs more than
one α –clusterable set (Fig. 3(a)). On the other hand, increasing α causes the detection
of small clusters of the data sets by using only one α –clusterable set. However, the
detection of the big cluster needs more α –clusterable sets, the union of them describes
the whole cluster. It has to mentioned here, that the union of overlapping α –clusterable
sets is still an α -clusterable set, hence we can find a point z which will be the centre
of the set and its window density function value is maximum, in a hyper-rectangle size
α  > α , means that the WDFα  (z) is unimodal.
In order to avoid the above situations, we propose and implement a merging pro-
cedure that merges the overlapping α –clusterable sets, so that the outcome of the al-
gorithm represents the real number of clusters in the data set. Specifically, two dense
regions (α –clusterable sets) are going to merge if and only if the overlap between them
contains at least one data point.
α -Clusterable Sets 117

(a) Effect of parameter value α = 0.2 (b) Effect of parameter value α = 0.25

Fig. 3. Dataset WDF with different values of parameter α

Subsequently, we summarise the above analysis and we propose the new clustering
algorithm. It is worth to refer that the detection α –clusterable sets, which are highly
density regions of the datasets, through the window density function is a maximiza-
tion problem, however Particle Swarm Optimisation is a minimization algorithm, hence
−WDFα (z) is utilised as the fitness function.

Algorithm 1. PSO for the Unsupervised Detection α –Clusterable Sets


repeat
Create a data structure that holds all unclustered points
Perform the PSO algorithm returning the center z of an α –clusterable set
Mark the points that lie in the window w as clustered
Remove the clustered points from the dataset
until no left unclustered points
Mark the points that lie in overlapping windows as members of the same cluster and merge
these windows to form the clusters.

It needs be stressed that the proposed algorithm clusters a dataset in an unsuper-


vised manner, since it detects the clusters without a priori knowledge of their number.
It is based solely on the density of a region. Still for the execution of the algorithm a
user must determine the parameter α , this user-defined parameter is easily regulated,
in contrast with the number of clusters that is an invariant feature characterising the
underlying structure of the dataset and furthermore it is difficult to define. Also, Parti-
cle Swarm Optimization’s search space dimension is fixed to the dimensionality of the
dataset, in contrast to the majority of other approaches in the literature that increase
the dimensionality of the optimisation problem by a factor of the maximum number of
estimated clusters.

5 Experimental Results
The objective of the conducted experiments was three-fold. First, we want to investi-
gate the behaviour of the algorithm regarding the resizing of the window. Second, we
118 G.S. Antzoulatos and M.N. Vrahatis

compare the proposed algorithm with well-known partitioning clustering algorithms,


k-means, DBSCAN, k-windows, DEUC and IUC. Third, we want to examine the scal-
ability of the proposed algorithm.
In order to evaluate the performance of the clustering algorithms the Entropy and
Purity measures are utilised. The Entropy function [43] represents the dissimilarity of
the points lying in a cluster. Higher homogeneity means that entropy’s values con-
verge to zero. However, for the usage of the entropy function, the knowledge of the
real classification/categorization of the points is required. Let, C = {C1 ,C2 , . . . ,Ck }
be a clustering provided by a clustering algorithm and L = {L1 , L2 , . . . , Lm } be the
target classification of the patterns, then the entropy of each cluster Ci is defined as
Hi = − ∑mj=1 P(x ∈ L j |x ∈ Ci ) log P(x ∈ L j |x ∈ Ci ). For a given set of n patterns, the en-
tropy of the entire clustering is the weighted average of the entropy of each cluster. The
Purity is defined as r = 1n ∑ki=1 αi , where k denotes the number of clusters found in the
dataset and αi represents the number of patterns of the class to which the majority of
points in cluster i belongs to it. The larger the values of purity, the better the clustering
solution is [42].

5.1 Investigate the Effect of the Parameter α


The aim of these experiments is to investigate the effect of the parameter α to the perfor-
mance of the proposed clustering algorithm. To do this, we utilised three 2-dimenstional
artificial datasets (Fig. 4). The first one, Dset1 has 1600 points that form four spherical
clusters each one with different size. The second one, Dset2 has 2761 points grouping
into four arbitrary shape clusters, three of them are convex and one is non-convex. The
final dataset, Dset3 contain 5000 points contains 2 randmly created clusters whereby
one cluster is located at the centre area of a quadratic grid and the other surrounds it, as
described in [1].
The following three plots Fig. 5 present the entropy and the purity of the clustering
plotted against the increase of the window size. As we can conlude, the clustering qual-
ity is getting worse while the parameter α takes higher values. This is rational due to
the fact that higher values of parameter α , lead to the creation of sets that contains data
from different groups.

Fig. 4. Datasets Dset1 , Dset2 and Dset3


α -Clusterable Sets 119

(a) Entropy and Purity vs Window Size (b) Entropy and Purity vs Window (c) Entropy and Purity vs Window Size
α for the Dset1 Size α for the Dset2 α for the Dset3

Fig. 5. Entropy and Purity vs Window Size α

5.2 Comparing Proposed Algorithm against Well-Known Clustering Algorithms


In this series of experiments, we investigate the performance of the proposed algo-
rithm versus the performance of other well known clustering algorithms, such as k-
means [27], DBSCAN [18], k-windows [41] and two evolutionary clustering algorithms
called DEUC [38] and IUC [6]. All algorithms are implemented using the C++ pro-
gramming language on the Linux operating system. For each dataset, the algorithms
are executed 100 times, except DBSCAN that due to its deterministic nature was exe-
cuted once. For the k-means algorithm the parameter k is set equal to the real number
of clusters in each dataset. For the other algorithms, their parameters were determined
heuristically. Finally, for the algorithms DEUC and IUC all the mutation operators are
utilized in order to investigate their effects on clustering quality.
In this series of experiments, apart from the two datasets Dset1 and Dset2 , we utilise
two more datasets of 15000 points each one and which are randomly generated from
multivariate normal distributions. The first of them, denoted as Dset4 , was created as
described in [32], with unary covariance matrix and different mean vectors, forming six
clusters with various cardinality. To form the latter dataset (Dset5 ) we utilised random
parameters based on [36]. It contains eight clusters. All the datasets are normalized in
the [0, 1]D range.
The experimental results (Table 1) for the datasets show that the proposed algorithm,
called PSOα -Cl, attains to find a good clustering solution in the majority of the experi-
ments, as the average entropy tends to be zero and the average purity tends to 100%.

5.3 Investigate the Scalability of the Algorithm


In order to examine the scalability of the proposed algorithm, we created artificial
datasets which are randomly generated from multivariate normal distribution with dif-
ferent mean value vectors and convariance matrices. The data of each one of these
datasets are clustered into eight groups with various cardinality. Also, the dimension-
ality of the datasets vary between 3 to 10 dimensions. All the datasets contain 15000
points and are normalized in the [0, 1]D range. We tested the performance of the pro-
posed algorithm in different values of parameter α and in each case we calculate the
entropy and the purity measures. Observing the results (Table 2), we can conclude that
the proposed algorithm exhibits good scalability properties since the entropy tends to
zero and the purity tends to one when the dimensionality and the cardinality of the
datasets increase. Moreover, it is worth to note that for the higher dimensional datasets
120 G.S. Antzoulatos and M.N. Vrahatis

Table 1. The mean values and standard deviation of entropy and purity for each algorithm over
the four datasets

Dset1 Dset2
Entropy Purity Entropy Purity
IUC DE1 8.55e-3(0.06) 99.7%(0.02) 4.54e-2(0.11) 98.9%(0.03)
IUC DE2 1.80e-2(0.1) 99.4%(0.03) 3.08e-2(0.09) 99.2%(0.03)
IUC DE3 1.94e-4(0.002) 100%(0.0) 7.16e-2(0.13) 98.2%(0.03)
IUC DE4 6.01e-3(0.06) 99.8%(0.01) 4.21e-2(0.10) 99.0%(0.02)
IUC DE5 2.46e-2(0.01) 99.2%(0.03) 6.95e-2(0.13) 98.3%(0.03)
DEUC DE1 1.70e-1(0.1) 91.0%(0.05) 3.39e-2(0.02) 90.5%(0.01)
DEUC DE2 1.36e-1(0.09) 92.3%(0.05) 3.22e-2(0.02) 90.3%(0.01)
DEUC DE3 1.66e-1(0.09) 90.4%(0.05) 2.90e-2(0.02) 90.8%(0.01)
DEUC DE4 1.45e-1(0.09) 91.1%(0.04) 3.16e-2(0.02) 90.4%(0.01)
DEUC DE5 1.39e-1(0.1) 92.9%(0.05) 2.88e-2(0.02) 90.6%(0.01)
k-means 1.10e-1(0.21) 96.7%(0.06) 3.45e-1(0.06) 90.5%(0.03)
k-windows 0.00e-0(0.0) 99.2%(0.02) 2.20e-2(0.08) 95.4%(0.01)
DBSCAN 0.00e-0(—) 100%(—) 3.74e-1(—) 100.0%(—)
PSO 0.05-Cl 0.00e-0(0.0) 100%(0.0) 6.44e-2(0.11) 98.2%(0.03)
PSO 0.075-Cl 0.00e-0(0.0) 100%(0.0) 1.86e-1(0.16) 95.1%(0.04)
PSO 0.1-Cl 0.00e-0(0.0) 92.048%(0.01) 3.07e-1(0.08) 92.0%(0.0)
PSO 0.2-Cl 5.54e-2(0.17) 98.2%(0.06) 3.68e-1(0.01) 91.4%(0.0)
PSO 0.25-Cl 4.3e-2(0.15) 98.6%(0.05) 3.66e-1(0.01) 91.4%(0.0)
Dset4 Dset5
Entropy Purity Entropy Purity
IUC DE1 2.52e-3(0.02) 94.7%(0.05) 2.7e-3(0.03) 99.7%(0.01)
IUC DE2 7.59e-3(0.04) 96.0%(0.04) 7.9e-3(0.04) 99.5%(0.02)
IUC DE3 1.02e-2(0.05) 95.5%(0.04) 8.0e-3(0.04) 99.6%(0.02)
IUC DE4 0.00e+0(0.0) 96.6%(0.01) 1.06e-3(0.05) 99.4%(0.02)
IUC DE5 5.04e-3(0.03) 97.0%(0.01) 2.12e-3(0.07) 99.0%(0.02)
DEUC DE1 6.86e-3(0.01) 90.7%(0.02) 2.63e-3(0.21) 87.4%(0.07)
DEUC DE2 6.04e-3(0.01) 91.0%(0.02) 2.90e-3(0.19) 86.4%(0.06)
DEUC DE3 6.16e-3(0.07) 91.2%(0.01) 2.94e-3(0.21) 86.4%(0.07)
DEUC DE4 7.17e-3(0.01) 89.9%(0.02) 3.09e-3(0.24) 86.0%(0.07)
DEUC DE5 6.38e-3(0.01) 90.1%(0.02) 2.79e-3(0.22) 86.8%(0.07)
k-means 2.69e-1(0.18) 89.9%(0.07) 3.99e-3(0.25) 86.8%(0.09)
k-windows 4.18e-5(0.0) 98.3%(0.003) 0.00e-0(0.0) 99.7%(0.006)
DBSCAN 8.54e-4(—) 99.2%(—) 0.00e-0(0.0) 100%(—)
PSO 0.05-Cl 0.00e-0(0.0) 99.9%(0.0) 0.00e-0(0.0) 99.0%(0.0)
PSO 0.075-Cl 1.03e-2(0.05) 99.5%(0.02) 0.00e-0(0.0) 100.0%(0.0)
PSO 0.1-Cl 7.95e-2(0.12) 96.9%(0.05) 0.00e-0(0.0) 100.0%(0.0)
PSO 0.2-Cl 4.62e-1(0.15) 81.9%(0.06) 1.02e-2(0.05) 99.5%(0.02)
PSO 0.25-Cl 1.76e-0(0.17) 45.5%(0.05) 1.30e-1(0.06) 94.7%(0.06)
α -Clusterable Sets 121

Table 2. The mean values and standard deviation of entropy, purity and number of clusters

Size of the window α


D N Measures
0.2 0.25 0.3 0.35 0.4 0.45 0.5
Entropy 0.01(0.06) 0.129(0.16) 0.485(0.07) 0.890(0.21) 2.340(0.47) 2.24(0.49) 2.68(0.0)
3 15K Purity 0.995(0.02) 0.946(0.07) 0.834(0.02) 0.703(0.06) 0.337(0.09) 0.356(0.1) 0.266(0.0)
#Clusters 8.2(0.56) 7.6(0.56) 5.6(0.48) 3.9(0.39) 1.3(0.5) 1.45(0.5) 1.0(0.0)
Entropy 0.0(0.0) 0.0(0.0) 0.0(0.0) 0.264(0.0) 0.380(0.02) 0.655(0.0) 0.780(0.07)
5 15K Purity 0.999(0.0) 1.0(0.0) 1.0(0.0) 0.906(0.0) 0.874(0.006) 0.793(0.0) 0.769(0.02)
#Clusters 8.3(0.51) 8.2(0.46) 8.0(0.17) 7.0(0.0) 6.04(0.19) 5.0(0.0) 3.9(0.02)
Entropy 0.0(0.0) 0.0(0.0) 0.0(0.0) 0.0(0.0) 0.043(0.04) 0.078(0.03) 0.114(0.09)
10 15K Purity 0.996(0.007) 0.999(0.0) 0.999(0.0) 0.999(0.003) 0.990(0.01) 0.982(0.007) 0.971(0.03)
#Clusters 8.1(0.40) 8.09(0.29) 8.05(0.22) 8.03(0.22) 7.57(0.56) 7.17(0.40) 6.97(0.30)

the performance of the clustering is increased when the size of the window becomes
larg. However, if the size of the window exceeds a specific value, related to the dataset,
the quality of the clustering deteriorates.
The scalability of the algorithm depends on the window density function and specifi-
cally depends on the complexity of determining the points that lie in a specific window.
This is the well-known orthogonal range search problem that have been studied and
many algorithms have been proposed in the literature to address it [4, 34]. A preprocess-
ing phase is employed so as to construct the data structure that stores the data points.
For high dimensional applications data structures like Multidimensional Binary Tree
[34] is preferable, while for low dimensional applications with large number of points
Alevizos’s approach [4] is more suitable. In this work, we utilise the Multidimensional
Binary Tree so the preprocessing time is O(DN log N), while the data structure demands
O(s + DN 1−1/D ) time to answer it to a query [38].

6 Conlusions
Although clustering is a foundamental process to discover knowledge from data, how-
ever it still difficult to give a clear, coherent and general definition of what is a cluster, or
whether a dataset is clusterable or not. Furthermore, many researches focused on prac-
tical aspect of clustering and leave almost untouched the theoretical background. In this
study, we have presented a theoretical framework of clustering and we introduced a
new notion of clusterability, called “α –clusterable set”, which is based on the notion of
window densiy function. Particularly, an α –clusterable set is considered as the dense
region of points of a dataset X and also inside of this area the window density function
is unimodal. The set of these α –clusterable sets forms a clustering solution, denoted
as α –clustering. Moreover, we prove, in contrary to the general framework of Klein-
berg’s impossibility theorem, that this α –clustering solution of a data set X satisfies
the properties of scale-invariance, richness and consistency. Furthermore, to validate
the theoretical framework, we propose an unsupervised algorithm based on the particle
swarm optimisation. The experimental results are promising since its performance is
better or similar to other well-known algorithms and in addition the proposed algorithm
exhibits good scalability properties.
122 G.S. Antzoulatos and M.N. Vrahatis

References

[1] Abraham, A., Grosan, C., Ramos, V.: Swarm Intelligence in Data Mining. Springer, Hei-
delberg (2006)
[2] Ackerman, M., Ben-David, S.: Measures of clustering quality: A working set of axioms for
clustering. In: Advances in Neural Information Processing Systems (NIPS), pp. 121–128.
MIT Press, Cambridge (2008)
[3] Ackerman, M., Ben-David, S.: Clusterability: A theoretical study. Journal of Machine
Learning Research - Proceedings Track 5, 1–8 (2009)
[4] Alevizos, P.: An algorithm for orthogonal range search in d ≥ 3 dimensions. In: Proceed-
ings of the 14th European Workshop on Computational Geometry (1998)
[5] Alevizos, P., Boutsinas, B., Tasoulis, D.K., Vrahatis, M.N.: Improving the orthogonal range
search k-windows algorithms. In: 14th IEEE International Conference on Tools and Artifi-
cial Intelligence, pp. 239–245 (2002)
[6] Antzoulatos, G.S., Ikonomakis, F., Vrahatis, M.N.: Efficient unsupervisd clustering through
intelligent optimization. In: Proceedings of the IASTED International Conference Artificial
Intelligence and Soft Computing (ASC 2009), pp. 21–28 (2009)
[7] Arabie, P., Hubert, L.: An overview of combinatorial data analysis. In: Clustering and Clas-
sification, pp. 5–64. World Scientific Publishing Co., Singapore (1996)
[8] Ball, G., Hall, D.: A clustering technique for summarizing multivariate data. Behavioral
Sciences 12, 153–155 (1967)
[9] Berkhin, P.: Survey of data mining techniqes. Technical report, Accrue Software (2002)
[10] Berry, M.J.A., Linoff, G.: Data mining techniques for marketing, sales and customer sup-
port. John Willey & Sons Inc., USA (1996)
[11] Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Aca-
demic Publishers, Norwell (1981)
[12] Chen, C.Y., Ye, F.: Particle swarm optimization algorithm and its application to clustering
analysis. In: IEEE International Conference on Networking, Sensing and Control, vol. 2,
pp. 789–794 (2004)
[13] Cohen, S.C.M., Castro, L.N.: Data clustering with particle swarms. In: IEEE Congress on
Evolutionary Computation, CEC 2006, pp. 1792–1798 (2006)
[14] Das, S., Abraham, A., Konar, A.: Automatic clustering using an improved differential evo-
lution algorithm. IEEE Transactions on Systems, Man and Cybernetics 38, 218–237 (2008)
[15] Dubes, R.: Cluster Analysis and Related Issue. In: Handbook of Pattern Recognition and
Computer Vision, pp. 3–32. World Scientific, Singapore (1993)
[16] Engelbrecht, A.P.: Computational Intelligence: An Introduction. John Wiley & Sons, Ltd.,
Chichester (2007)
[17] Epter, S., Krishnamoorthy, M., Zaki, M.: Clusterability detection and initial sees selection
in large datasets. Technical Report 99-6, Rensselaer Polytechnic Institute, Computer Sci-
ence Dept. (1999)
[18] Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clus-
ters in large spatial databases with noise. In: Proceedings of 2nd International Conference
on Knowledge Discovery and Data Mining, pp. 226–231 (1996)
[19] Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publish-
ers, San Francisco (2006)
[20] Jain, A.K., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs
(1988)
[21] Jain, A.K., Flynn, P.J.: Image segmentation using clustering. In: Advances in Image Under-
standing: A Festschrift for Azriel Rosenfeld, pp. 65–83. Willey - IEEE Computer Society
Press, Singapore (1996)
α -Clusterable Sets 123

[22] Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Sur-
veys 31, 264–323 (1999)
[23] Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of IEEE Interna-
tional Conference on Neural Networks, vol. 4, pp. 1942–1948 (1995)
[24] Kennedy, J., Eberhart, R.C.: Swarm Intelligence. Morgan Kaufmann Publishers, San Fran-
cisco (2001)
[25] Kleinberg, J.: An impossibility theorem for clustering. In: Advances in Neural Information
Processing Systems (NIPS), pp. 446–453. MIT Press, Cambridge (2002)
[26] Lisi, F., Corazza, M.: Clustering financial data for mutual fund managment. In: Mathe-
matical and Statistical Methods in Insurance and Finance, pp. 157–164. Springer, Milan
(2007)
[27] MacQueen, J.B.: Some methods for classification and analysis of multivariate observations.
In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probabil-
ity, vol. 1, pp. 281–297. University of California Press (1967)
[28] Ng, R., Han, J.: CLARANS: A method for clustering objects for spatial data mining. IEEE
Transactions on Knowledge and Data Engineering 14(5), 1003–1016 (2002)
[29] Omran, M.G.H., Engelbrecht, A.P.: Self-adaptive differential evolution methods for un-
supervised image classification. In: Proceedings of IEEE Conference on Cybernetics and
Intelligent Systems, pp. 1–6 (2006)
[30] Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, S.: The effectiveness of lloyd-type meth-
ods for the k-means problem. In: Proceedings of the 47th Annual IEEE Symposium on
Foundations of Computer Science, pp. 165–176. IEEE Computer Society, Washington, DC
(2006)
[31] Parsopoulos, K.E., Vrahatis, M.N.: Particle Swarm Optimization and Intelligence: Ad-
vances and Applications. Information Science Publishing (IGI Global), Hershey (2010)
[32] Paterlini, S., Krink, T.: Differential evolution and particle swarm optimisation in partitional
clustering. Computational Statistics & Data Analysis 50, 1220–1247 (2006)
[33] Pavlidis, N., Plagianakos, V.P., Tasoulis, D.K., Vrahatis, M.N.: Financial forecasting
through unsupervised clustering and neural networks. Operations Research - An Interna-
tional Journal 6(2), 103–127 (2006)
[34] Preparata, F., Shamos, M.: Computational Geometry: An Introduction. Springer, New York
(1985)
[35] Puzicha, J., Hofmann, T., Buhmann, J.: A theory of proximity based clustering: Structure
detection by optimisation. Pattern Recognition 33, 617–634 (2000)
[36] Tasoulis, D.K., http://stats.ma.ic.ac.uk/d/dtasouli/public_html
[37] Tasoulis, D.K., Plagianakos, V.P., Vrahatis, M.N.: Unsupervised clustering in mRNA ex-
presion profiles. Computers in Biology and Medicine 36, 1126–1142 (2006)
[38] Tasoulis, D.K., Vrahatis, M.N.: The new window density function for efficient evolution-
ary unsupervised clustering. In: IEEE Congress on Evolutionary Computation, CEC 2005,
vol. 3, pp. 2388–2394. IEEE Press, Los Alamitos (2005)
[39] Theodoridis, S., Koutroubas, K.: Pattern Recognition. Academic Press, London (1999)
[40] van der Merwe, D.W., Engelbrecht, A.P.: Data clustering using particle swarm optimiza-
tion. In: Proceedings of the 2003 IEEE Congress on Evolutionary Computation, pp. 215–
220 (2003)
[41] Vrahatis, M.N., Boutsinas, B., Alevizos, P., Pavlides, G.: The new k-windows algorithm for
improving the k-means clustering algorithm. Journal of Complexity 18, 375–391 (2002)
[42] Xiong, H., Wu, J., Chen, J.: K-means clustering versus validation measures: A data-
distribution perspective. IEEE Transactions on Systems, Man and Cybernetics - Part B:
Cybernetics 39(2), 318–331 (2009)
[43] Zhao, Y., Karypis, G.: Criterion Functions for Clustering on High-Dimensional Data. In:
Grouping Multidimensional Data Recent Advances in Clustering, pp. 211–237. Springer,
Heidelberg (2006)
Privacy Preserving Semi-supervised Learning for
Labeled Graphs

Hiromi Arai1 and Jun Sakuma1,2


1
Department of Computer Science, University of Tsukuba,
1–1–1 Tenoudai, Tsukuba, Japan
arai.hiromi.ga@u.tsukuba.ac.jp, jun@cs.tsukuba.ac.jp
2
Japan Science and Technology Agency, 5–3, Yonban-cho, Chiyoda-ku, Tokyo, Japan

Abstract. We propose a novel privacy preserving learning algorithm


that achieves semi-supervised learning in graphs. In real world networks,
such as disease infection over individuals, links (contact) and labels (in-
fection) are often highly sensitive information. Although traditional semi-
supervised learning methods play an important role in network data
analysis, they fail to protect such sensitive information. Our solutions
enable to predict labels of partially labeled graphs without disclosure
of labels and links, by incorporating cryptographic techniques into the
label propagation algorithm. Even when labels included in the graph are
kept private, the accuracy of our PPLP is equivalent to that of label
propagation which is allowed to observe all labels in the graph. Empiri-
cal analysis showed that our solution is scalable compared with existing
privacy preserving methods. The results with human contact networks
showed that our protocol takes only about 10 seconds for computation
and no sensitive information is disclosed through the protocol execution.

Keywords: privacy preserving data mining, semi-supervised learning.

1 Introduction

Label prediction of partially labeled graphs is one of the major machine learning
problems. Graph-based semi-supervised learning is useful when link information
is obtainable with lower cost than label information. Prediction of protein func-
tions is one familiar example [12]. In this problem, the functions and similarities
of proteins correspond to the labels and node similarities, respectively. Amassing
information about the protein functions requires expensive experimental analy-
ses, while protein similarities are often obtained computationally, with a lower
cost. Taking advantage of this gap, the semi-supervised approach successfully
achieves better classification accuracy, even when only a limited number of la-
beled examples is obtainable.
The key observation of this study is that the difficulty of preparing label
information is not only its cost, but also its privacy. Even when a large number
of labels have already been collected, the labels might not be observed or may
require extreme caution in handling. In this paper, we consider label prediction

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 124–139, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Privacy Preserving Semi-supervised Learning for Labeled Graphs 125

in a situation where the entire graph cannot be observed by any entities, due to
privacy reasons. Such a situation is often found in networks among social entities,
such as individuals or enterprises, in the real world. The following scenarios pose
intuitive examples where privacy preservation is required in label prediction.
Consider a physical contact network of individuals, in which individuals and
their contacts correspond to nodes and links, respectively. Suppose an infectious
disease is transmitted by contact. Some of the individuals have tested their infec-
tion states. Regarding infection states as node labels, semi-supervised learning is
expected to predict the infection states of untested individuals, by exploiting the
existing test results and the contact network. However, the contact information
between individuals and their infection states can be too sensitive to disclose.
In this scenario, both the labels and the links must be kept private. In order
to formulate such situations, we consider three types of typical privacy models.
The first model, referred to as the public model, assumes that each node discloses
its label and links to all other nodes. One example of this model is social network
services such as facebook or LinkedIn, when the privacy policy is set as “everyone
can see everything”. Users (nodes) disclose their friends (links) and status (node
labels), such as their education or occupations, to every user.
The second model, referred to as the label-aware model, assumes each node
discloses its label and links only to the nodes that it is linked to. Let the node
labels correspond to the flu infection states in the contact network. We can
naturally assume that each individual (node) is aware of the individuals with
whom he/she had contact before (links) and whether or not they had the flu
(labels of the nodes it is linking to), but he/she would never know the contact
information or the infection states of individuals he/she has never met.
The third model, referred to as the label-unaware model, assumes that each
node does not disclose any links or labels to others. Consider the contact net-
work again. Let links and labels correspond to sexual relationships and sexually
transmitted infections, respectively. In such a case, no one would disclose their
links and label to others; the label may not be disclosed even to individuals with
whom he/she had a relationship.
In addition, asymmetries of relationships need to be considered. For example,
in a business network whose links correspond to the ratio of the stock holdings,
a directed graph should be used to represent the scope of observation. Thus, the
privacy model to be employed depends on the nature and the sensitivity of the
information in the graphs.
Related Works. Existing label prediction methods are designed with the im-
plicit assumption that a supervisor exists who can view anything in the graph
(the public model). If we could introduce a trusted third party (TTP)1 as a
supervisor, then any label prediction algorithms, such as TSVM [8] or Clus-
ter kernel [12], would immediately work in the label-(un)aware model; however,
facilitating such a party is unrealistic in general (Table 1, the first line).

1
TTP is a party which never deviates from the specified protocol and does not reveal
any auxiliary information.
126 H. Arai and J. Sakuma

Table 1. Comparison of label prediction methods. Δ: the maximum number of links


of a node, T : the number of iterations of each algorithm.

method privacy model comp. cost


TSVM , Cluster kernel public —
k-NN label-aware O(Δ log Δ)
LP label-aware O(ΔT )
LP/SFE label-unaware O(poly)
PPLP (proposal) label-unaware O(ΔT )

The k-nearest neighbor (kNN) method predicts labels from the labels of the
k-nearest nodes; kNN works in the label-aware model (Table 1, the second line).
Label propagation (LP) [15] achieves label prediction in the label-aware model,
when algorithms are appropriately decentralized (Table 1, the third line, see
Sect. 3.2 for details). Note that even if it is decentralized, each node has to
disclose its labels to the neighboring nodes. That is, kNN and LP do not work
in the label-unaware model.
Secure function evaluation (SFE) [13], also referred to as Yao’s garbled cir-
cuits, is a methodology for secure multiparty computation. Using SFE, any func-
tion, including label prediction methods, can be carried out without revealing
any information except for the output value. That is, SFE allows execution of
label prediction in the label-unaware model (Table 1, the fourth line). Although
the computational cost of SFE is polynomially bounded, it can be too inefficient
for practical use. We implemented label prediction on SFE (LP/SFE), and the
efficiency of SFE is discussed with experiments in Sect. 6.
Privacy-preserving computations for graph mining are discussed in [4,10].
Both of them calculate HITS and PageRank from networks containing private
information. They compute the principal eigenvector of the probability transi-
tion matrix of networks without learning the link structure of the network. Our
solution for label prediction is similar to above link analyses in the sense that
both of them consider computation over graphs containing private graph struc-
tures. However, the target of their protocols and ours are different; their protocol
computes node ranking while our protocol aims at node label prediction.
Our Contribution. As discussed above, no label prediction methods that work
efficiently in the label unaware model have been presented. We propose a novel
solution for privacy preserving label prediction in the label-unaware model. Com-
parisons between our proposal and existing solutions are summarized in Table
1. First, we formulate typical privacy models for labeled graphs (Sect. 2). Then,
a label propagation algorithm that preserves the privacy in the label-unaware
model is presented (Sect. 5). Our proposal theoretically guarantees that (1) no
information about links and labels is leaked through the entire process, (2) the
predicted labels as the final output are disclosed only to the node itself, and
(3) the final output is exactly equivalent to that of label propagation. Con-
nections to our protocol and differential privacy are also discussed in Sect. 5.3.
Privacy Preserving Semi-supervised Learning for Labeled Graphs 127

The experiments show that the computational time of our proposal is more than
100 times shorter than that of LP/SFE. We also examined our proposal using
the real social network data (Sect. 6).

2 Privacy in Labeled Graphs


In this section, we first formulate the label prediction problems with graphs.
Then we define privacy models of graphs; the secure label prediction problem is
formulated using these models.

2.1 Labeled Graph


In a network of social entities, the links and labels can be private. In order to
formulate the privacy in graphs, we introduce graph privacy models, assuming
that each node in the graph is an independent node whose computational power
is polynomially bounded.
Let G = (V, E) be a graph where V = {1, . . . , n} is a set of nodes and
E = {eij } is a set of links. Link eij has nonnegative weight wij ; the weight
matrix is W = (wij ). If eij ∈ E, wij = 0. In directed graphs, we denote the
nodes linked from node i as Nout (i) = {j|j ∈ V, eij ∈ E}; the nodes linking to
node i are denoted as Nin (i) = {j|j ∈ V, eji ∈ E}. In undirected graphs, the
nodes linking to node i and the nodes linked from node i are identical. So we
denote N (i) = Nin (i) = Nout (i).
In this study we consider wegited labeled graphs without disconnected sin-
gleton nodes. Typically, the node label is represented as an element of a finite
set {1, . . . , h}. To facilitate the computations introduced later, we introduce the
label matrix F = (fik ) ∈ Rn×h to represent the node labels. The i-th row of F
represents the label of node i. The node is labeled as s, if s = arg max1≤k≤h fik .
Thus, links, link weights, and labels of graphs are represented as matrices. In
order to define which part of a matrix is observable from a node and which is
not, three different matrix partitioning models are introduced. Then, the graph
privacy models are defined, based on the matrix partitioning models.

2.2 Matrix Partitioning Model


Let there be a set of n nodes V and a matrix M ∈ Rn×r . Let the ith row of M
by mi∗ ; the ith column by m∗i . Suppose M is partitioned into n parts, and each
node in V is allowed to observe a different part of the matrix privately. We used
the following typical patterns of partitioning [10].
Definition 1. (row private) Let there be a matrix M ∈ Rn×r and n nodes. For
all i, if the ith node knows the ith row vector mi∗ , but does not know other row
vectors mp∗ where p = i, then M is row private.
Definition 2. (symmetrically private) Let there be a square matrix M ∈ Rn×n
and n nodes. For all i, if the ith node knows the i-th row vector mi∗ and the
i-th column vector m∗i , but does not know other elements mpq where p or q = i,
then M is symmetrically private.
128 H. Arai and J. Sakuma

Label-aware undir. PWG Label-unaware undir. PWG

Label-aware dir. PWG Label-unaware dir. PWG

Fig. 1. Graph privacy models. Links, link weights wij , and label fik that are private
from node i are depicted as gray.

We define an expansion of row private, in which node i is allowed to observe


specified rows of M.
Definition 3. (U (i)-row private) Let there be a matrix M ∈ Rn×r and n nodes.
Let U (i) ⊆ V . For all i, if the ith node knows the jth row vector mj∗ where
j ∈ U (i) ∪ {i}, but does not know other row vectors mp∗ where p ∈ U (i) ∪ {i},
then M is U (i)-row private.

2.3 Graph Privacy Model


Graph privacy models define the restriction of the observation of link weights W
and labels F from each node, based on these matrix partitioning models.

Definition 4. (label-aware undirected PWG) Let G = (V, E) be an undirected


graph with weight matrix W and label matrix F. If W is symmetrically private
and F is N (i)-row private, then G is a label-aware undirected private weighted
graph (PWG).

In this model, W is symmetrically private. It follows that node i is allowed to


learn the nodes that node i is linking to/linked to from and the link weights
thereof. Besides, label matrix F is row-N (i) private. It follows that node i is
allowed to learn the labels of the nodes that node i is linking to/linked to from.
Fig 1 gives the schematic description of this graph privacy model.
Privacy Preserving Semi-supervised Learning for Labeled Graphs 129

Table 2. Matrix partitioning of graph privacy models

graph privacy model W F


label-aware undirected PWG symmetry private N (i)-row private
label-unaware undirected PWG symmetry private row private
label-aware directed PWG row private Nout (i)-row private
label-unaware directed PWG row private row private

Definition 5. (label-unaware undirected PWG) In Definition 4, if F is row pri-


vate, then G is a label-unaware undirected private weighted graph.
The difference in the above two models is in the matrix partitioning model of label
matrix F. Node i is allowed to learn its neighbors’ labels in label-aware undirected
PWGs, while node i is allowed to learn only its own node label in label-unaware
undirected PWGs. Recall the contact network whose nodes are labeled with in-
fection states. If the labels correspond to the flu state, then the label-aware undi-
rected PWG would be appropriate. On the other hand, if the labels correspond to
disease state of sexually transmitted infection, then the label information should
be handled in the label-unaware undirected PWG.
Table 2 summarizes the definitions of graph privacy models. In the following,
we consider undirected graphs, unless specifically mentioned. The treatment of
label-(un)aware directed PWGs is revisited in Sect. 5.4.

3 Our Approach
Our aim is to develop label prediction that securely works under given graph
privacy models. First, we state the problem, based on graph privacy models and
label propagation [15]. Then, we show that a decentralization of label propa-
gation makes the computation secure in label-aware PWGs, but not secure in
label-unaware PWGs. At the end of this section, we clarify the issues to be
addressed to achieve secure label prediction in label-unaware PWGs.

3.1 Problem Statement


We consider the label prediction problem of node labeled graphs. Suppose some
of the nodes are labeled (initial labels), while the remaining nodes are unlabeled.
Then, the problem of label prediction is to predict the labels of the unlabeled
nodes, using the initial labels and the weight matrix as input.
Let P = D−1 W, which corresponds to the probability transition matrix of
the random walk on graph G with weight  matrix W. D is the degree matrix,
where D = diag(d1 , . . . , dn ) and di = j∈V wij . Let Y = (yij ) ∈ Rn×c be an
initial label matrix, where yip = 1 if the label of node i is p; otherwise yip = 0.
Label propagation predicts the labels of unlabeled nodes by propagating the
label information of labeled nodes through links following the transition proba-
bilities. The label matrix is repeatedly updated by
F(t+1) = αPF(t) + (1 − α)Y (1)
130 H. Arai and J. Sakuma

where α (0 ≤ α < 1). If α is closer to 1, then more label information is propagated


from its neighbors (the first term). If α is closer to 0, then the prediction is
affected more by its initial labels (the second term). With the condition 0 ≤ α <
1, the following lemma shows the convergence property of label propagation [14].
Lemma 1. Let F∗ = (1 − α)(I − αP)−1 Y. When an arbitrary matrix F(0) is
updated iteratively by eq. 1, limt→∞ F(t) = F∗ holds.
The convergent matrix F∗ gives the prediction for unlabeled nodes. F∗ can be
viewed as a certain type of quantity that refers to the probabilities of a node
belonging to a label. This label propagation follows neither label-aware PWGs
nor label-unaware PWGs because W and Y need to be observable from a node
to execute eq. 1.
Suppose a partially labeled graph has to follow the label-(un)aware PWG,
and the nodes in the graph wish to perform label prediction. In this study,
we define the security of label prediction in the manner of secure multiparty
computation [6], using the graph privacy models defined previously.
Statement 1. Secure label prediction: Let there be a label-(un)aware PWG,
where W and Y are private link weights and labels, respectively. After the exe-
cution of secure label prediction among the nodes, the final output F∗ is correctly
evaluated and distributed such that F∗ is row private. Each node learns nothing
but the number of iterations; the graph is still kept as a label-(un)aware PWG.
For label prediction to be secure in the sense of Statement 1, the graph privacy
models cannot be changed before and after label prediction.

3.2 Decentralized Label Propagation


Before considering our approach, we first review a decentralization of eq. 1, in
terms of privacy. Looking at an element of F, the update of fik is described as:
  
(t) (t−1)
fik ← α pij fjk + (1 − α)yik . (2)
j∈N (i)

 fjk for j ∈ N (i), yik for all


In order for node i to update eq. 2, node i needs
k, and pij for all j. Noting that pij = wij / j∈N (i) wij , node i can have pij
for all j, even if W is row private. yik is also obtained if Y is row private. On
(t−1)
the other hand, fjk has to be obtained from the neighbor nodes N (i); thus
F has to be N (i)-row private. It follows that node i can securely update eq. 2
in label-aware PWGs. Consequently, decentralized label propagation is secure in
the label-aware PWGs, while it is not secure in the label-unaware PWGs, in the
sense of Statement 1.
In order to perform label prediction securely in the label-unaware PWG, node
i needs to evaluate eq. 2 with F keeping row private. In other words, node i needs
to compute the first term of eq. 2 without observing fjk , where j ∈ N (i). This
is seemingly impossible, but it can be achieved by introducing a cryptographic
tool, homomorphic encryption. This enables to evaluate addition of encrypted
values without decryption. In next section, we review homomorphic encryption.
Privacy Preserving Semi-supervised Learning for Labeled Graphs 131

4 Cryptographic Tools

In a public key cryptosystem, encryption uses a public key that can be known
to everyone, while decryption requires knowledge of the corresponding private
key. Given a corresponding pair of (sk, pk) of private and public keys and a
message m, then c = Encpk (m; r) denotes the random encryption of m, and m =
Decsk (c) denotes the decryption. The encrypted value c uniformly distributes
over ZN = {0, ..., N−1}, if r is taken from ZN randomly. An additive homomorphic
cryptosystem allows the addition of encrypted values, without knowledge of the
private key. There is some operation · such that for any plaintexts m1 and m2 ,

Encpk (m1 + m2 mod N ; r) = Encpk (m1 ; r1 ) · Encpk (m2 ; r2 ),

where r is uniformly random, provided that at least one of r1 and r2 is uniformly


random. Based on this property, it also follows that given a constant k and the
encryption epk (m1 ; r), we can compute multiplications by k via repeated ap-
plications of ·, denoted as Encpk (km mod N ; kr) = Encpk (m; r)k . The random
encryption prevents inference of intermediate computation from messages pass-
ing in our solution. In non-probabilistic public cryptosystem, m corresponds to
one-to-one with its cipher c. Therefore, when the size of message space is small,
as in label prediction, regular public cryptosystem is not adequate for privacy
preserving protocol including label prediction. In the probabilistic encryption,
encryption of message m is randomized so that the encryption does not released
information about m. This random number is not required for decryption or
addition of the message, so in the following, we omit the random number r from
our encryptions, for simplicity.
In an (n, θ)-threshold cryptosystem, n nodes share a common public key pk,
while each node holds a private key sk1 , ..., skn . Each node can encrypt any mes-
sage with the common public key. Decryption cannot be performed by fewer
than θ nodes, and can be performed by any group of at least θ nodes, us-
ing a recovery algorithm based on the public key and their decryption shares
Decsk1 (c), ..., Decskn (c). Our solution makes use of (n, θ)-threshold additively ho-
momorphic cryptosystem, such as the generalized Paillier cryptosystem [2].

5 The Main Protocol

As discussed in Sect. 3.2, the decentralized label prediction algorithm is not


secure in the label-unaware undirected PWG. We show privacy preserving la-
bel propagation (PPLP) for label-unaware undirected PWGs, by incorporating
homomorphic encryption into the decentralized label propagation.
The main body of our protocol is presented in Fig. 2. In principle, our protocol
behaves almost equivalently to the decentralized label propagation, except that
every message is encrypted so that it outputs the identical result to the regular
label propagation. In what follows, we explain the details of our protocol and
the security in label-unaware undirected PWGs.
132 H. Arai and J. Sakuma

5.1 Privacy Preserving Label Propagation

The protocol is presented in Fig. 2.


Setup. All nodes jointly prepare a key set, so that pk is commonly known to
all nodes and ski is held only by the ith node. The key set is securely prepared
by means of distributed key generation schemes. See [3] for examples. In this
protocol, we assume all nodes can refer a global clock so that they can execute
iteration synchronously.
Recall that W is symmetrically private, and F and Y are row private in
the label-unaware undirected PWG. Private inputs of node i are wi∗ and yi∗ .
With these, node i can compute pij locally in Step 1 (a). αpij can be rational.
Considering the homomorphic encryption takes only integers as inputs, a large
(0)
integer L is multiplied so that p̃ij ∈ ZN . αfik and (1 − α)yik are similarly
(0)
magnified so that f˜ik , ỹik ∈ ZN in Step 1 (b), and then encrypted in Step 1 (c).
Iteration and Convergence. Step 2 updates the label vectors on the homo-
(t−1)
morphic encryption. In Step 2 (a), node i sends its encrypted label vector cik
to its neighboring nodes N (i); in Step 2 (b), node i receives the encrypted label
(t−1)
vectors cjk from the nodes it is linked to. Then, the label vectors are updated
by the following equation:
(t)
  (t−1) p̃ij
  
Encpk (f˜ik ) ← (cjk ) · Encpk Lt−1 ỹik . (3)
j∈N(i)

(t)
Let F̃(t) = (f˜ij ). The convergence of iterations of eq. 3 is shown as follows;

Lemma 2. Let F∗ = (1 − α)(I − αP)−1 Y. When an arbitrary matrix F̃(0) is


updated iteratively by eq. 3, limt→∞ F̃(t) /Lt = F∗ holds.

Proof. From the homomorphic property of cryptosystem, eq. 3 is:


    
(t) (t−1) p̃ij
Encpk (f˜ik ) ← Encpk (f˜jk ) · Encpk Lt−1 ỹik
j∈N(i)
       
(t−1) (t−1)
= Encpk (p̃ij f˜jk ) · Encpk Lt−1 ỹik = Encpk p̃ij f˜jk + Lt−1 ỹik .(4)
j∈N(i) j∈N(i)

Although eq. 4 cannot be decrypted by any nodes in the actual protocol, here
we assume that a node could decrypt both terms of eq. 4. Then, we have
(t)
 (t−1)
 (t−1)
f˜ik ← p̃ij f˜jk + Lt−1 ỹik = L( αpij f˜jk + Lt−1 (1 − α)yik ). (5)
j∈N(i) j∈N(i)

We prove f˜ik = Lt fik holds by the inductive method. When t = 1, f˜ik = Lfik
(t) (t) (1) (1)

obviously holds. Assuming f˜ik = Lu fik for any u ∈ Z, f˜ik = Lu+1 fik
(u) (u) (u+1) (u+1)

is readily derived using eq. 5 and the assumption. Thus, f˜ik = Lt fik holds.
(t) (t)

Consequently, F̃(t) /Lt = F(t) holds and the lemma is proved by Lemma 1.
Privacy Preserving Semi-supervised Learning for Labeled Graphs 133

Privacy-preserving label propagation


– Public input: α and L ∈ ZN s.t αLpij ∈ ZN and (1 − α)Lyik ∈ ZN for all i, j, k.
– Private input of node i: link weights wi∗ and label vector yi∗ (W and Y are row
private).
– Key setup: All nodes share public key pk; node i holds secret key ski for threshold
decryption.

 all j ∈ N (i), node i computes:


1. (Initialization) For
(a) pij ← wij / j∈N(i) wij , p̃ij ← αLpij ,
(0)
(b) For k = 1, ..., c, f˜ik ← yik , ỹik ← (1 − α)Lyik ,
(0) (0)
(c) cik ← Encpk (f˜ik ) for all k ∈ {1, ..., h} and t ← 1.
2. (Iteration) For all k ∈ {1, ..., h}:
(t−1)
(a) Node i sends cik to all j ∈ N (i),
(t−1) (t−1)
(b) Node i receives cjk from all j ∈ N (i) and updates cik by eq. 3,
(c) All nodes jointly perform the convergence detection and normalization using
SFE if needed. If convergence is detected, go to step 3. Else, t ← t + 1 and
go to step 2 (a).
3. (Decryption) Node i and arbitrary (θ − 1) nodes jointly perform recovery scheme
∗ ∗ ∗
and output fi∗ = (fi1 , . . . , fih ).

Fig. 2. Privacy-preserving label propagation

In the protocol, two computations that are not executable on homomorphic


encryption are not mentioned. One is convergence detection [11]. From Lemma 2,
the convergent value F∗ is identical to that of label prediction. The convergence
˜ (t) t
of eq. 3 has to be determined by having fik /L − fik /L ˜(t−1) t−1
<  for all i
(t)
and k in Step 2 (c). The other is normalization [11]. As shown in Lemma 2, f˜ik
is multiplied by L at each iteration (although no node can observe this). Since
(t)
the addition of homomorphic encryption is modulo N arithmetic, f˜ik has to be
normalized so as not to exceed N after addition. The normalization does not
have to be invoked so often since typically parameter is set to a big integer, such
as N = 21024 . To realize this normalization, we use private division to divide
(t)
encryption without spilling the value of f˜ik .
Values exchanged in the protocol are all encrypted, while SFE requires un-
encrypted values as inputs. To securely evaluate normalization and convergence
detection with encrypted values, we used the following scheme:
1. Encrypted values are randomly partitioned by means of the homomorphic
property of encryption so that these form random shares if decrypted,
2. Shares are independently decrypted and are taken as inputs of SFE,
3. SFE for normalization or convergence detection is executed with recovering
the values in execution of SFE.
134 H. Arai and J. Sakuma

Note that the computation time of SFE is usually large; however, both compu-
tations are not necessarily required at every update, the cost per iteration can
be small, which will not create a bottleneck. We discuss this further in Sect. 6.

5.2 Security of the protocol


The security of privacy preserving label prediction is proved as follows:
Lemma 3. Let there be a label-unaware undirected PWG whose nodes behave
semi-honestly2 . Assuming that more than θ nodes do not collude, node i learns
the ith row of F∗ and the number of iterations but nothing else, after the execu-
tion of privacy preserving label prediction. Furthermore, the graph is kept as the
label-unaware undirected PWG.
The proof should show the indistinguishability of the simulation view and nodes’
views as in [6]. However, due to space limitations, we will explain the security
of our protocol in an intuitive way. All of the messages exchanged among nodes
throughout the execution of the PPLP protocol are:
(t) (t)
1. Step 2(a): c = Encpk (f˜ )
jk jk
(t)
2. Step 2(c): random shares of f˜jk for inputs of SFE

3. Step 2(c) and 3: decryption shares of Encpk (f˜jk ) for j = 1, . . . , n, k =
1, . . . , h.
If nothing can be learned from these messages, other than the output, we can
conclude that the protocol is privacy preserving. The messages exchanged in
Step 2(a) are all encrypted; these cannot be decrypted without the collusion of
more than θ parties. Random shares of Step 2(c) do not leak anything obviously.

In Step 3 or Step 2(c), node i receives (θ − 1) decryption shares of Encpk (f˜jk ). If
˜∗
i = j, then node i can recover Encpk (fjk ), but this is the final output of node i

itself. If i = j, then Encpk (f˜jk ) cannot be decrypted by node i, unless more than
θ nodes collude.
Based on the above discussion, node i can learn nothing but its final output
and the number of iterations throughout the execution of the PPLP protocol.
From Lemma 2 and Lemma 3, the following theorem is readily proved.

Theorem 1. Assuming all nodes behave semi-honestly and there is no collu-


sion of more than θ nodes, PPLP executes label propagation in label-unaware
undirected PWG, in the sense of Statement 1.

5.3 Output Privacy of Label Propagation


Our protocol enables computation of label prediction without sharing links and
labels while the output of our protocol might leak information about links or
2
This assumes that the nodes follow their protocol properly, but might also use their
records of intermediate computations in order to attempt to learn other nodes’ pri-
vate information.
Privacy Preserving Semi-supervised Learning for Labeled Graphs 135

labels of neighboring nodes, especially when the neighboring nodes have only
small a number of links. Differential privacy provides a theoretical privacy def-
inition in terms of outputs. Output perturbation using Laplace mechanism can
guarantee the differential privacy for a specified function under some conditions.
Each node can guess only a limited amount of information if such a mech-
anism is applied to the output, even when each node has strong background
knowledge about the private information. The security of the computation and
secure disclosure of outputs are mutually independent in label prediction. There-
fore, output perturbation can be readily combined with our protocol while the
design of perturbation for label propagation is not straightforward. We do not
pursue this topic in this paper any further and this problem is remained for our
future works.

5.4 Expansion to Directed Graphs


Let us expand PPLP protocol so that it can be applied to directed graphs. In
the label-unaware directed PWGs, the labels are propagated only to the node it
links to. Then, the following update is used, instead of eq. 3.
    
(t) (t−1) p̃ij
Encpk (f˜ik ) ← (cjk ) · Encpk Lt−1 ỹik .
j∈Nout (i)

(t−1)
In order to update the above, note that: (1) node i has to send cik to Nout (i)
(t−1)
in Step 2 (a), and (2) node i has to receive cjk from j ∈ Nin (j) in Step 2 (b).
In the directed graph, since W is row-private, node i can know to whom it is
linking to (Nout (i)), but cannot know from whom it is linked to (Nin (i)). Each
node in Nin (i) can send a request for connection to node i. However, the request
itself violates the given privacy model. Thus, in directed graphs, the problem is
not only in the secrecy of messages but also in the anonymity of connections. For
this, we make use of the onion routing [7], which provides anonymous connection
over a public network. By replacing every message passing that occurs in the
protocol with the onion routing, we obtain the PPLP protocol for label-unaware
directed PWGs (PPLP-D). Techniques to convert the protcool in the label-aware
model to that in the label-unaware model can be found in [10], too.

6 Experimental Analysis
We compared the label prediction methods in terms of the accuracy, privacy
loss, and computation cost.
Datasets. Two labeled graphs were taken from the real-world examples. Roman-
tic Network (ROMN) is a network in which students and their sexual contacts
correspond to the nodes and links, respectively [1]. We used the largest com-
ponent (288 nodes) of the original network. 5 randomly selected nodes and the
136 H. Arai and J. Sakuma

nodes within 5 steps of them were labeled as “infected” (total 80 nodes); other
nodes were labeled as “not-infected” (208 nodes)3 . ROMN is undirected; the
weight matrix wij = 1 if there is a link between i and j, and wij = 0 otherwise.
MITN is a network in which the mobile phone users, their physical proximities,
and their affiliations correspond to the nodes, links, and node labels, respectively.
The proximity is measured by Bluetooth devices of the mobile phones, in MIT
Reality Mining Project [5]. 43 nodes are labeled as “Media Lab” and 24 nodes
are labeled as “Sloan” (total 67 nodes). MITN is a directed graph; the weight
wij is set as the time length of user i detecting user j.
Settings. Three types of label propagation were implemented, decentralized
label propagation (LP), label propagation implemented by SFE (LP/SFE), and
our proposal, PPLP and PPLP-D. For comparison, kNN was also tested. In
PPLP and PPLP-D, we set the parameters as L = 103 and α = 0.05 and
normalization was performed once per 100 updates of eq. 3. For kNN, we set
k = 1, 3, 5. Results were averaged over 10 trials.
In Sect. 6.1, we evaluated trade-off between prediction accuracy and privacy
loss. In Section 6.2, we evaluated computational efficiency of PPLP, PPLP-D,
and LP/SFE with complexity analysis. The generalized Paillier cryptosystem [2]
with 1024-bit keys was used in PPLP and PPLP-D. For SFE implementation,
FairPlay [9] was used. Experiments were performed using Linux with 2.80GHz
(CPU), 2GB (RAM).
We empirically compared computational cost of our solutions because com-
plexity analysis of computations implemented by SFE is difficult. Furthermore,
implementation details are as follows. In PPLP and PPLP-D, all computational
procedure including normalization using SFE is implemented. In a computa-
tional cost of an update, one percent of normalization cost is accounted because
normalization is executed once per 100 times. If LP/SFE is implemented in a
naive manner, the computational time can be so large that computational analy-
sis is unrealistic. Instead, we decomposed LP/SFE into single summation of the
decentralized label propagation (eq. 2) and implemented each summation using
SFE. This implementation allows the nodes to leak elements of the label matrix
computed in the middle of label propagation, while the required computation
time can be largely reduced. We regarded the computation time of this relaxed
LP/SFE as the lower bound of the computation time of LP/SFE in experiments.

6.1 Privacy-Accuracy Trade-Off

As discussed, PPLP securely works in the label-unaware model. In other words,


PPLP does not require observation of the labels of the neighbor nodes N (i) for
label prediction. On the other hand, both kNN and LP require the labels of
neighbor nodes N (i) for label prediction. Although it is impossible to run kNN
and LP in label-unaware model, these can be executed if not all but some of the
nodes in N (i) dare to disclose their labels to node i. Considering this, we also
3
Although this setting might not appropriate in terms of epidemiology, we employed
this setting in order to examine the efficiency and practicality of our solution.
Privacy Preserving Semi-supervised Learning for Labeled Graphs 137

(a) Error rate in ROMN dataset (b) Error rate in MITN dataset

1 0.8 PPLP-D, LP/SFE


LP
kNN(k=1)
kNN(k=3)
0.8 PPLP, LP/SFE 0.6 kNN(k=5)
Error rate

Error rate
LP
kNN(k=1)
0.6 0.4

0.4 0.2

0.2 0

0 5 10 15 20 25 30 0 5 10 15 20 25 30
Number of labeled nodes Number of labeled nodes

(c) Single updete in an undirected graph (d) Single update in a directed graph
108 108
PPLP D=8
Computational time (msec)

Computational time (msec)


107 SFE-LP 107 D=64
D=512

106 106

105 105

4
10 104

103 103

2
10 102
1 10 100 1000 10 100 1000 10000
Node degree Number of nodes, n

Fig. 3. Accuracies and computational costs of PPLP and PPLP-D vs. other methods.
(a) and (b) are the accuracy changes with respect to the number of label disclosure .
(c) shows the scalability with respect to the maximum number of links per node Δ.
(d) is the scalability of PPLP-D with respect to Δ and network size n.

tested intermediate models between the label-aware and label-unaware models.


Let ini be the number of nodes initially labeled. In these intermediate models,
 (0 ≤  ≤ ini ) nodes disclose their labels to neighbor nodes. If  = ini , it
corresponds to the label-aware model because all initially labeled nodes disclose
their labels. If  = 0, it corresponds to the label-unaware model because all nodes
keep there labels secret.
Fig. 3 (a) and (b) show the change of the prediction errors in ROMN and
MITN with respect to , respectively. kNN and LP require larger  for better
prediction, and cannot be executed when =0. kNN shows the higher error rate
as compared to the other methods for all k. kNN cannot predict labels for nodes
that connected to nodes which disclose their labels. Since we treated this as
misclassification, the error rate of kNN can be large. The results of 3NN and
5NN are not shown in Fig. 3 (a) because they cannot predict labels for most of
nodes. Error rates of 3NN and 5NN in Fig. 3 (b) increase with  when  is large.
The reason for these results is considered as follows. Large k accounts for more
distant data points than small k. Therefore kNN of large k may not perform
well with unbalanced data.
Thus, if label information could be partially disclosed and its running time
is crucial, LP could be a good choice. In contrast, when privacy preservation of
138 H. Arai and J. Sakuma

Table 3. The computation time until convergence (10 itr.) in ROMN (Δ = 9) and
disclosed information

Method disclosed info. comp. time


kNN yjk (j ∈ N (i)) 7.4 × 10−5 (ms)
(t)
LP fjk (j ∈ N (i)) 3.0 × 10−4 (ms)
(t)
LP/SFE (relaxed) pij fjk (j ∈ N (i)) 88.4 (min.)
LP/SFE (strict) none >88.4 (min.)
PPLP none 10.7 (sec.)

labels and links is essential, PPLP or LP/SFE could be a good choice, because
the accuracies of PPLP and LP/SFE which observe no labels are always equiv-
alent to that of LP which is allowed to observe all labels.

6.2 Computational Efficiency


Complexity and Scalability. We evaluated the computation times of LP/SFE,
PPLP, and PPLP-D with artificially generated networks where the maximum
number of links is Δ. In PPLP, the computational complexity of a single update
of eq. 3 is O(Δ). In PPLP-D, the computational cost of onion routing O(log n)
is additionally required (in total O(Δ + log n)).
Fig. 3 (c) shows the change of the computational times for a 2 class classifi-
cation of artificially generated labeled graphs, with respect to Δ. We evaluated
PPLP and PPLP-D including normalization and convergence detection. The re-
sults show that the computational cost of LP/SFE is large; at least a few hours
are required for a single update when Δ = 210 . On the other hand, the compu-
tation time of PPLP is a few seconds even when Δ = 210 . Fig. 3 (d) shows the
change of the computational times of PPLP-D with respect to network size n.
The single update of PPLP-D needs a few minutes when Δ = 23 , but a few hours
when the Δ = 29 . This fact indicates that PPLP-D can work efficiently even in
large-scale networks if networks are sparse. Computational cost of LP/SFE in
directed graphs become larger than that in undirected graphs because of onion
routing. In addition, onion routing for LP/SFE have to be executed by SFE.
The computational cost of this will become unrealistically large, so LP/SFE for
directed graphs is not evaluated. We can see that computational cost of PPLP-D
is even smaller than that of LP/SFE for undirected graph in Fig.3 (c) and (d).
Completion Time in the Real World Network Data. Table 3 summarizes
the computational costs and the information disclosure until convergence of label
prediction methods in ROMN. In the experiments, all labels of initially labeled
30 nodes are disclosed to other nodes for execution of LP and kNN. The results
showed that PPLP achieved both efficient computations and privacy preserva-
tion, while the others did not. Recall that computational cost of PPLP does not
increase drastically with the maximum number of links as shown in Fig 3 (a). In
addition, the maximum number of links is not often very large, typically in the
networks containing private information, such as sexual contacts. From above,
we can conclude that PPLP is practical even for large-scale networks.
Privacy Preserving Semi-supervised Learning for Labeled Graphs 139

7 Conclusion
In this paper, we introduced novel privacy models for labeled graphs, and then
stated secure label prediction problems with these models. We proposed solu-
tions for secure label prediction, PPLP and PPLP-D, which allow us to execute
label prediction without sharing private links and node labels. Our methods are
scalable compared to existing privacy preserving methods. We experimentally
showed that our protocol completed the label prediction of a graph with 288
nodes in about 10 seconds. In undirected graphs, the complexity is proportional
to the maximum number of links, rather than the network size. We can con-
clude that our protocol achieves both privacy preservation and scalability, even
in large-scale networks. Scalability in directed graphs is relatively larger than
that in undirected graphs. Our future work will involve the implementation of
our proposal in real social problems.

References
1. Bearman, P., Moody, J., Stovel, K.: Chains of affection: The structure of adolescent
romantic and sexual networks. American J. of Sociology 110(1), 44–91 (2004)
2. Dåmgard, I., Jurik, M.: A Generalisation, a Simplification and Some Applications
of Paillier’s Probabilistic Public-Key System. In: Kim, K.-c. (ed.) PKC 2001. LNCS,
vol. 1992, pp. 119–136. Springer, Heidelberg (2001)
3. Damgård, I.B., Koprowski, M.: Practical threshold RSA signatures without a
trusted dealer. In: Pfitzmann, B. (ed.) EUROCRYPT 2001. LNCS, vol. 2045, pp.
152–165. Springer, Heidelberg (2001)
4. Duan, Y., Wang, J., Kam, M., Canny, J.: Privacy preserving link analysis on dy-
namic weighted graph. Comp. & Math. Organization Theory 11(2), 141–159 (2005)
5. Eagle, N., Pentland, A., Lazer, D.: Inferring social network structure using mobile
phone data. In: PNAS (2007)
6. Goldreich, O.: Foundations of cryptography: Basic applications. Cambridge Uni-
versity Press, Cambridge (2004)
7. Goldschlag, D., Reed, M., Syverson, P.: Onion routing. Communications of the
ACM 42(2), 39–41 (1999)
8. Joachims, T.: Transductive inference for text classification using support vector
machines. In: Proc. ICML (1999)
9. Malkhi, D., Nisan, N., Pinkas, B., Sella, Y.: Fairplay: secure two-party computation
system. In: Proc. of the 13th USENIX Security Symposium, pp. 287–302 (2004)
10. Sakuma, J., Kobayashi, S.: Link analysis for private weighted graphs. In: Proceed-
ings of the 32nd International ACM SIGIR, pp. 235–242. ACM, New York (2009)
11. Sakuma, J., Kobayashi, S., Wright, R.: Privacy-preserving reinforcement learning.
In: Proceedings of the 25th International Conference on Machine Learning, pp.
864–871. ACM, New York (2008)
12. Weston, J., Leslie, C., Ie, E., Zhou, D., Elisseeff, A., Noble, W.: Semi-supervised
protein classification using cluster kernels. Bioinformatics 21(15), 3241–3247 (2005)
13. Yao, A.: How to generate and exchange secrets. In: Proc. of the 27th IEEE Annual
Symposium on Foundations of Computer Science, pp. 162–167 (1986)
14. Zhou, D., Bousquet, O., Lal, T., Weston, J., Schölkopf, B.: Learning with local
and global consistency. In: Advances in Neural Information Processing Systems
16: Proceedings of the 2003 Conference, pp. 595–602 (2004)
15. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian
fields and harmonic functions. In: ICML (2003)
Novel Fusion Methods for Pattern Recognition

Muhammad Awais, Fei Yan, Krystian Mikolajczyk, and Josef Kittler

Centre for Vision, Speech and Signal Processing (CVSSP) University of Surrey, UK
{m.rana,f.yan,k.mikolajczyk,j.kittler}@surrey.ac.uk

Abstract. Over the last few years, several approaches have been pro-
posed for information fusion including different variants of classifier level
fusion (ensemble methods), stacking and multiple kernel learning (MKL).
MKL has become a preferred choice for information fusion in object
recognition. However, in the case of highly discriminative and comple-
mentary feature channels, it does not significantly improve upon its triv-
ial baseline which averages the kernels. Alternative ways are stacking and
classifier level fusion (CLF) which rely on a two phase approach. There
is a significant amount of work on linear programming formulations of
ensemble methods particularly in the case of binary classification.
In this paper we propose a multiclass extension of binary ν-LPBoost,
which learns the contribution of each class in each feature channel. The
existing approaches of classifier fusion promote sparse features combina-
tions, due to regularization based on 1 -norm, and lead to a selection of
a subset of feature channels, which is not good in the case of informative
channels. Therefore, we generalize existing classifier fusion formulations
to arbitrary p -norm for binary and multiclass problems which results
in more effective use of complementary information. We also extended
stacking for both binary and multiclass datasets. We present an extensive
evaluation of the fusion methods on four datasets involving kernels that
are all informative and achieve state-of-the-art results on all of them.

1 Introduction
The goal of this paper is to investigate machine learning methods for combin-
ing different feature channels for pattern recognition. Due to the importance
of complementary information in feature combination, much research has been
undertaken in the field of low level feature design to diversify kernels, leading to
a large number of feature channels (kernels) in typical pattern recognition tasks.
Kernels are often computed independently of each other, thus may be highly
redundant. On the other hand, different kernels capture different aspects of in-
traclass variability while being discriminative at the same time. Proper selection
and fusion of kernels is, therefore, crucial to optimizing the performance and to
addressing the efficiency issues in large scale pattern recognition applications.
The key idea of MKL [10,15,20], in the case of SVM, is to learn a linear com-
bination of given base kernels by maximizing the soft margin between classes
using 1 -norm regularization on weights. In contrast to MKL, the main idea of
classifier level fusion [8] is to construct a set of base classifiers and then classify

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 140–155, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Novel Fusion Methods for Pattern Recognition 141

a new test sample by a weighted combination of their predictors. CLF methods


attracted much attention, with AdaBoost [5] in particular, after being successful
in many practical applications; this led to linear programming (LP) formulation
of AdaBoost [16]. Inspired by the soft margin SVM, a soft margin LP for boost-
ing, ν − LP Boost, was proposed in [16]. Similar to ensemble methods, the aim of
stacking [2] is to combine the prediction labels of multiple base classifiers using
another classifier often referred as meta-level classifier.
Information fusion methods for MKL and CLF favor sparse feature/kernel
selection due to 1 -norm regularization, arguing that the sparse models have
intuitive interpretation [10] as a method of filtering out irrelevant information.
However, in practical applications sparse models do not always perform well
(c.f. [9] and references therein). In fact ,1 regularization hardly outperforms
trivial baselines, such as average of kernels. Furthermore, sparseness may lead
to poor generalization due to discarding useful information, especially in case of
features encoding orthogonal characteristics of a problem. On the other hand ∞
regularization promotes combinations with equal emphasis on all feature chan-
nels, which leads to poor performance in case of noisy channels. To address these
problems, different regularization norms [9] are considered for MKL. Similarly,
among the classifier fusion approaches, ν-LPBoost with 1 regularization favors
sparse solutions, or suffers from noisy channels in the case of ∞ regulariza-
tion. In contrast to MKL, there is a lack of intermediary solutions with different
regularization norms in ensemble methods.
In this paper, we present a novel multiclass classifier fusion scheme (NLP-
νMC) based on binary ν−LP Boost, which incorporates arbitrary norms {p , p ≥
1} and optimizes the contribution from each class in each feature channel. The
proposed optimization problem is a nonlinear separable convex problem which
can be solved using off-the-shelf solvers. We also incorporate nonlinear con-
straints in previously proposed binary ν − LP Boost and multiclass LPBoost [6]
and show empirically that nonlinear variants perform consistently better than
their sparse counterparts, as well as baseline methods. It is important to note
that both LP-β and LP-B [6] are different from NLP-νMC. In Particular, the
number of constraints in the optimization problems and the concept of margin
are significantly different (see Section 3.1 for more details). For example, LP-B
is not applicable to large multiclass datasets due to large number of constraints.
We use SVM as a base classifier in stacking and instead of using prediction
labels from the base classifier we propose to use its real valued output. We also
incorporate SVM as a base learner for stacking in case of multiclass datasets. We
finally use SVM with RBF kernel as a meta-level classifier. The last contribution
is an extensive evaluation and comparison of state-of-the-art fusion approaches.
We perform experiments on multi-label and multiclass problems using standard
benchmarks. Our multiclass formulation and nonlinear extensions of CLF consis-
tently outperforms the state-of-the-art MKL and sparse CLF schemes. The best
results are achieved with stacking, especially when the stacking kernel is com-
bined with base kernels using CLF. Note that the datasets used for evaluation
are visual category recognition datasets, however, the proposed fusion schemes
142 M. Awais et al.

can be applied to any underlying pattern recognition problems provided that we


have multiple feature channels. The proposed methods can also be applied to
multi-model pattern recognition problems.
The remainder of this paper is organized as follows. We start with a review
of two widely used information fusion schemes, the multiple kernel learning in
Section 2 and linear programming (LP) formulation of ensemble methods for
classifier fusion in Section 3 which also extends LP formulation of binary classifier
fusion to incorporate arbitrary norms. Our proposed multiclass classifier fusion
and schemes are presented in Section 3.1 and Section 4. In Section 5 we present
the evaluation results and conclude in Section 6.

2 Multiple Kernel Learning


In this section, we review state-of-the-art MKL methods for classification. Con-
sider m training samples (xi , yi ), where xi is a sample in input space and yi is its
label, yi ∈ ±1 for binary classification and yi ∈ {1, . . . , NC }, for multiclass classi-
fication. We are given n training kernels (one kernel corresponding to each feature
channel) Kr of size m × m and corresponding n test kernels K̇r of size m × l,
with l being the number of test samples. Each kernel, Kr = Φr (xi ), Φr (xj ),
implicitly maps samples from the input space to a feature space with mapping
function Φr (xi ) and gives similarity between corresponding samples xi and xj
in the feature space. In the case of the SVM decision function for a single kernel
is the sign of real valued output gr (x):

gr (x) = K̇r (x)T Y α + b, (1)

where K̇r (x) is the column corresponding to test sample x, Y is an m×m matrix
with labels yi on the diagonal and α is a vector of lagrangian multipliers.
n
In MKL, the aim is to find a convex combination of kernels K = r=1 βr Kr
by maximizing the soft margin [1,10,15,20,25] using the following program:

1 T 
n m
min wr wr + C ξi (2)
wr ,ξ,b,β 2
r=1 i=1

n 
s.t. yi ( wr , βr Φr (xi ) + b) ≥ 1 − ξi , ξ  0, β  0, βpp ≤ 1
r=1

The dual of Eq. (2) can be derived easily using Lagrange multiplier techniques.
The MKL primal for linear combination and its corresponding dual are derived
for different formulations in [1,9,10,15,20,24] and compared in [25] which also
extended MKL to the multiclass case. The dual problem can be solved by us-
ing several existing MKL approaches, e.g, SDP [10], SMO [1], SILP [20] and
simpleMKL [15]. The decision function for MKL SVM is the sign of f (x):

n
f (x) = βr K̇r (x)T Y α + b. (3)
r=1
Novel Fusion Methods for Pattern Recognition 143

The weight vector β ∈ Rn , Lagrange multiplier α ∈ Rm , and bias b are learnt


together by maximizing the soft margin. We can consider f (x) as a linear com-
bination of real valued output gr (x) of the base classifier with the same α and b
shared across all base classifiers.

3 Classifier Fusion with Non-Linear Constraints

In this section we review the linear programming formulation of ensemble meth-


ods for classifier level fusion (CLF) based on boosting. We also extend the ν-LP-
AdaBoost [16] formulation for binary classification with nonlinear constraints.
This is a significant extension as it avoids discarding channels with complemen-
tary information while keeping it robust to noisy feature channels.
The empirical work has shown that boosting, and other related ensemble
methods [5,6,16] for combining predictors, can lead to a significant reduction in
the generalization error and, hence, improves performance. Our focus is on the
linear programming (LP) formulations of AdaBoost [5] and its soft margin LP
formulations [16] over a set of base classifiers G = {gr : x → ±1, ∀r = 1, . . . , n}.
For a test example x, the output label generated by such ensemble is a weighted
majority vote and is given by the sign of f (x):


n
f (x) = βr gr (x). (4)
r=1

Note that for the SVM, f (x) is a linear combination of the real valued output
of n SVMs, where gr (x) is given by Eq. (1). The decision function of MKL in
Eq. (3) shows that the same set of parameters {α, b} is shared by all participating
kernels. In contrast to MKL, the decision function of CLF methods in Eq. (4)
uses separate sets of SVM parameters, since different {α, b} embedded in gr (x)
can be used for each base learner. In that sense, MKL can be considered as a
restricted version of CLF [6]. The aim of the ensemble learning is to find optimal
weight vector β for the linear combination of base classifiers given by Eq. (4).
We define  the margin (or classification confidence) for an example xi as ρi :=
n
yi f (xi ) = yi r=1 βr gr (xi ) and the normalized (smallest) margin as:


n
ρ := min yi f (xi ) = min yi βr gr (xi ). (5)
1≤i≤m 1≤i≤m
r=1

It has been argued that AdaBoost maximizes the smallest margin ρ on the train-
ing set [16]. Based on this idea and the idea of soft margin SVM formulations,
the ν-LP-AdaBoost formulation has been proposed in [16]. The ν-LPBoost per-
forms a sparse selection of feature channels due to 1 regularization, which is
suboptimal if all feature channels carry complementary information. Similarly,
in the case of ∞ norm, noisy features channels may have significant impact on
the results. To address these problems, we generalize binary classifier fusion for
arbitrary norms {p , p ≥ 1}.
144 M. Awais et al.

The input to classifier fusion are predictions corresponding to each feature


channel, which are real valued outputs of base classifiers. To obtain these pre-
dictions for a training set we can use leave one out or v-fold cross validation.
In contrast to AdaBoost, we consider n to be a fixed number of base classifiers
{gr , ∀r = 1, . . . , n} which are independently trained. Given the base classifiers,
we learn the optimal weights βr for their linear combination (Eq. (4)) by maxi-
mizing the smallest margin ρ in the following optimization problem:

1 
m
max ρ − ξi (6)
β,ξ,ρ νm i=1

n
s.t. yi βr fr (xi ) ≥ ρ − ξi ∀ i = 1, ..., m
r=1
βpp ≤ 1, β  0, ξ  0, ρ ≥0

where ξi are slack variables which accommodate negative margins. The regular-
ization constant is given by νm1
, which corresponds to the C constant in SVM.
Problem (6) is a nonlinear separable convex optimization problem and can be
solved efficiently for global optimal solution by standard optimization toolboxes1 .

3.1 Multiclass Classifier Fusion with Non-Linear Constraints

In this section we propose a novel multiclass extension of ν-LP-AdaBoost and


compare it with other existing multiclass variants. We also incorporate nonlinear
constraints in two existing multiclass classifier fusion schemes: LP-β [6] and LP-
B [6]. The empirical results show that the nonlinear constraints improve the
performance of these methods.
Nonlinear Programming ν-Multiclass (NLP-νMC): We consider one-vs-
all formulation for multiclass case with NC classes, i.e., for each feature channel
we solve NC binary problems, one corresponding to each class. Therefore, the
set of base classifiers G = {gr : x → RNC , ∀r = 1, . . . , n} consists of n base
hypotheses (weak learners) gr , where each base classifier maps into an NC di-
mensional space gr (x) → RNC . The output of gr corresponding to c’th class
is denoted by gr,c (x). Recently it has been shown that One-vs-All is as good
as any other approach [18], moreover it fits naturally to the proposed CF and
computational complexity for other methods are higher, even prohibitive in case
of many classes. Note that in practice the predictions for all base classifiers can
be computed in parallel as they are independent of each other, which makes
this approach appealing. We learn the weights for every class in each feature
channel and, therefore, instead of n dimensional weight vector β ∈ Rn as in
case of binary classifier fusion, we have an n × NC dimensional weight vector
β ∈ Rn×NC . The first NC entries of vector β correspond to weights of classes
1
We have used MATLAB and MOSEK (http://www.mosek.com) and found that
interior-point based separable convex solver in MOSEK is faster by an order of
magnitude of time.
Novel Fusion Methods for Pattern Recognition 145

in first feature channel and last NC entries correspond to weights in feature


channel n. After finding the optimal weights, the decision function for a test
sample x corresponding to each class is given by weighted sum and the overall
decision function of multiclass classifier fusion is obtained by picking the class
with maximum response.
We extend the definition of margin (classification confidence) for binary clas-
sifier fusion given in Eq. (5) to multiclass case as follows.


n 
n 
NC
ρi (xi , β) := β(NC (r−1)+yi ) gr,yi (xi ) − β(NC (r−1)+yj ) gr,yj (xi ) (7)
r=1 r=1 j=1,j
=i

The classification confidence for examples xi depends upon β and scores from
base classifiers. The main difference between the two margins is that here, we are
taking responses (scores multiplied with corresponding weights) from all nega-
tive classes, sum them and subtract this sum from the response of positive class.
This is done for all n feature channels. Normalized (smallest) margin can then
be defined as ρ := min1≤i≤m ρ(xi , β). Inspired by LP formulations of AdaBoost
(cf. [16] and references therein) we propose to maximize the normalized margin
ρ to learn linear combination of base classifiers. However, generalization perfor-
mance of LP formulation of AdaBoost based on maximizing only normalized
margin is inferior to AdaBoost for noisy problems [16]. Moreover, theorem 2
in [16] highlights the fact that minimum bound on generalization error is not
necessarily achieved with a maximum margin. To address these issues, soft mar-
gin SVM based formulation with slack variable is introduced in Eq. (8). This
formulation does not force all the margins to be greater than zero. To avoid
penalization of informative channels and to gain robustness against noisy fea-
ture channels, we change the regularization norm to handle any arbitrary norm
p , ∀p ≥ 1. The final optimization problem is (replacing ρi with Eq. (7)):

1 
m
max ρ − ξi (8)
β,ξ,ρ νm i=1

n 
n 
NC
s.t. β(NC (r−1)+yi ) gr,yi (xi ) − β(NC (r−1)+yj ) gr,yj (xi )
r=1 r=1 j=1,j
=i

≥ ρ − ξi i = 1, ..., m, (9)
βpp ≤ 1, ρ ≥ 0, β  0 ξ  0 ∀i = 1, ..., m
1
where νm is the regularization constant and gives a trade-off between minimum
classification confidence ρ and the margin errors. This formulation looks similar
to Eq. (6), in fact we are using the same objective function but the main dif-
ference is the definition of margin which is used in the constraints in Eq. (9).
Eq. (9) employs a lower bound on the differences between the classification con-
fidence (margin) of the true class and the joint confidence of all other classes.
It is important to note that the total number of constraints is equivalent to the
number of training examples m plus one regularization constraint for lp -norm
146 M. Awais et al.

(ignoring the positivity constraints on variables). Therefore, the difference in


complexity, compared to the binary classifier fusion, is the increased number of
variables in weight vector β, while having the same number of constraints. Note
that the problem in Eq. (8) is a nonlinear separable convex optimization problem
and can be solved efficiently using MOSEK. We now extend LP-β and LP-B by
introducing arbitrary regularization norms p , ∀p ≥ 1, which avoids rejection of
informative feature channels while being robust against noisy features channels.
Generalized optimization problems for, LP-β and LP-B, are separable convex
programs and can be solved efficiently by MOSEK.
Nonlinear Programming-β (NLP-β): We generalize LP-β [6] by incorpo-
rating p , ∀p ≥ 1 norm constraints. The optimization problem is given by:

1 
m
min −ρ+ ξi (10)
β,ξ,ρ νm i=1

n 
n
s.t. βr gr,yi (xi ) − max βr gr,yj (xi ) ≥ ρ − ξi , ∀ i = 1, ..., m (11)
r=1 yj 
=yi ,r=1

βpp ≤ 1, βr ≥ 0, ξi ≥ 0, ρ ≥ 0, ∀r = 1, . . . , n, ∀ i = 1, . . . , m.

Note that weight vector β lies in an n dimensional space β ∈ Rn as in binary


classifier fusion. After finding the weight vector β, the decision function of gen-
eralized LP-β is simply the maximum response of the weighted sum of all classes
in all feature channels.
Nonlinear Programming-B (NLP-B): We also propose an extension of mul-
ticlass LP-B [6] with arbitrary regularization norms p , ∀p ≥ 1. Instead of having
a weight vector β, LP-B has a weight matrix B ∈ Rn×NC . For learning weights
in matrix B, we propose the following convex optimization problem:

1 
m
min −ρ+ ξi (12)
B,ξ,ρ νm i=1

n 
n
s.t. Bryi gr,yi (xi ) − Bryj gm,yj (xi ) ≥ ρ − ξi i = 1, ..., m, (13)
r=1 yj 
=yi ,r=1

Bpp ≤ 1, Brc ≥ 0, ξ  0, ρ ≥ 0, ∀ r = 1, ..., n, c = 1, ..., NC

The first set of constraints (Eq. (13)) gives a lower bound on the pairwise differ-
ence between classification confidences (margins) of the true class and non-target
class. Note that in this formulation NC − 1 constraints are added for every train-
ing example and the total number of constraints is m × (NC − 1) + 1.
Discussion: The main difference between the three multiclass approaches dis-
cussed in this section is in the definition of the feasible region which is defined
by Eq. (9), Eq. (11) and Eq. (13) for NLP-νMC, NLP-β and NLP-B respec-
tively. In NLP-β and Lp-β [6] the feasible region depends on the difference
between the classification confidence of the true class and the closest non-target
Novel Fusion Methods for Pattern Recognition 147

class only. The total number of constraints in this case is m + 1. The feasible
region of NLP-B and LP-B [6] is defined by the pairwise difference between class
confidence of the true class and non-target class added as one constraint at a
time. In other words each difference pair is added as an independent constraint
without having any interaction among each other. There are NC constraints for
each example and the total number of constraints is m × (NC − 1) + 1. The large
number of constraints makes this approach less attractive for datasets with a
large number of classes. For example, for Caltech101 [4] with only 15 images per
class for training, the number of constraints for LP-B is more than 150 thousand
(15 × 101 × 100 + 1 ∼ = 1.5 × 105 ). In case of our NLP-νMC, the feasible re-
gion depends upon the joint classification confidence of all the non-target classes
subtracted from the class confidence of the true class. Thus, the feasible region
of NLP-νMC is much smaller than the feasible region of NLP-B. Due to these
joint constraints the total number of constraints for NLP-νMC is m + 1, e.g., for
Caltech101 [4] with 15 images per class for training, the number of constraints
for NLP-νMC is only 1516 (15*101+1) which is only 1% of the constraints in
NLP-B. We, therefore, can apply NLP-νMC to large multiclass datasets, as op-
posed to NLP-B, especially for norms greater than 1. Note that the difference
in complexity between NLP-νMC and NLP-β or binary classifier fusion is the
extended weight vector β.

4 Extended Stacking

In this section we give a brief overview of stacking proposed in [21]. We then


present an extension to the stacking framework. The main aim of stacking [2]
is to combine the prediction labels of multiple base classifiers Cr using another
classifier, often referred to as meta-level classifier. In the first phase, prediction
labels yir for example xi of base classifiers are obtained by leave-one-out or by
v-fold cross validation on the training set. The input to the meta-level classifier
are these prediction labels together with the output label for example i and form
a tuple of the form ((yi1 , . . . , yin ), yi ). By the use of meta-level classifier, stacking
tries to infer reliable and unreliable base classifiers. By using output probabilities
corresponding to each label, the performance of stacking can be improved. The
size of the meta-level training tuple is multiplied by the number of classes in
this case. It has been shown empirically that stacking does not perform better
than selecting the best classifier in ensemble by cross validation [2]. To improve
the performance of stacking they replaced meta-level classifier by a new multi-
response model tree and empirically showed enhancement in performance as
compared to stacking or selecting the best base classifier by cross validation.
We have used SVM as a base classifier. Instead of using the prediction labels,
we use the real valued outputs gr (xi ) of the SVM classifier. The input training
tuple for meta-level classifier is of the form (g1 (xi ), . . . , gn (xi ), yi ). For multiclass
case we use one vs all formulation within base classifiers, therefore gr maps into
an NC dimensional space, gr (x) → RNC . The input tuple in this case is multiplied
by the number of classes. We concatenate the outputs of all base SVM classifiers
148 M. Awais et al.

corresponding to example xi and consider it as an input feature vector for a


meta-level SVM classifier. We build an RBF kernel by using euclidean distance
between these feature vectors. We refer to this as stacking kernel. To the best of
our knowledge the use of real valued SVM output for base classifiers in stacking
is novel, for both binary and multiclass datasets. We consider the stacking kernel
as a separate feature channel and can then apply MKL or any proposed CLF
scheme, discussed in Section 3, to combine it with base kernels.

5 Experiments and Discussion


This section presents the experimental evaluation of the methods investigated in
this paper on different object recognition benchmarks. These datasets include a
large variety of objects under different poses, scale and lighting condition with
cluttered background in real world scenario. We first discuss the results of the
multi-label datasets, namely, Pascal VOC 2007 and then present the results
for three multiclass datasets, namely, Flower17, Flower102 and Caltech101. In
multi-label, classification each example can be associated with a set of labels
as opposed to a single label. We use binary relevance [17], a well know method
for multi-label classification, as it is recommended by the organizers of Pascal
VOC challenge [3]. The MKL results on Pascal VOC 2007 are reported using
binary MKL from SHOGUN toolbox2 , and for CLF we have used ν-LP-AdaBoost
given in Eq. (6). For multiclass dataset we have used multiclass MKL from the
SHOGUN toolbox. For classifier level fusion we use three CLF schemes proposed
in this paper namely, NLP-νMC, NLP-β and NLP-B given by Eq.(8), Eq.(10)
and Eq.(12), respectively. We do not have results for higher values of norms
in case of NLP-B, and for some values of norms in case of MKL because their
optimization problems take several days. On the other hand NLP-β and NLP-
νMC are very fast as compared to multiclass MKL and NLP-B and take few
seconds and few minutes, respectively. Stacking results are presented using the
approach described in section 4. Finally, we present results by combining the
stacking kernel with the base kernels using MKL, NLP-β and NLP-νMC.

5.1 Pascal VOC 2007


Pascal VOC 2007 [3] is a challenging dataset consisting of 20 object classes with
9963 image examples (2501 training, 2510 validation, and 4952 testing images).
Images include indoor and outdoor scenes, truncated and occluded objects at
various scales and different lighting conditions. Classification of 20 object cate-
gories is handled as 20 independent binary classification problems. We present
results using average precision (AP) [3] and mean average precision (MAP).
In general, kernels can be obtained from various feature extractors. To pro-
duce state-of-the-art results we use 5 kernels from various descriptors introduced
in [12,19] computed for 2 sampling strategies (i.e., dense and interest points) and
spatial location grids [11]: entire image (1x1), horizontal bars (1x3), vertical bars
2
http://www.shogun-toolbox.org/
Novel Fusion Methods for Pattern Recognition 149

(3x1) and image quarters (2x2). The descriptors are clustered using k-means to
form a codebook of 4000 visual words. Each spatial grid is then represented by
histograms of codebook occurrences and a separate kernel matrix is computed
for each grid. The kernel function to compute entry (i, j) of the kernel matrix is
based on χ2 distance between features Fi and Fj .
1
K(Fi , Fj ) = e− A dist(Fi ,Fj ) (14)
where, A is a scalar for normalizing the distance, and is set to average χ2 distance
between all features.
We apply Support Vector Machine (SVM) as base classifiers for nonlinear
classifier level fusion schemes and the stacking proposed in this paper and com-
pare them with MKL schemes. The regularization parameter for SVM is in the
set {2(−2,0,3,7,10,15) }. The regularization parameter ν for different CF methods
is in the range ν ∈ [.05, .95] with the step size of 0.05. Both SVM and CF regu-
larization parameters are selected on the validation set. The values for norms for
generalized classifier fusion are in the range p ∈ {1, 1 + 2−5,−3,−1 , 2, 3, 4, 8, 104}.
We consider each value of p as a separate fusion scheme. Note for p = 10000
we get uniform weights which corresponds to unweighted sum or ∞ . Figure 1
shows learnt weights on the training set of aeroplane category of Pascal VOC
2007 for several values of p using CLF. The plotted weights are corresponding to
the optimal value of regularization parameter C of SVM. The sparsity of learnt
weights can be observed easily for low values of p. The sparsity decreases with
increased p, up to uniform weights (corresponding to ∞ ) achieved at p = 10000.
Weights can also be learnt corresponding to best performing p on validation set.
The mean average precision for several fusion methods are given in Table 1.
Row MKL shows the results for nine MKL methods with different regularization
norms applied to 5 base kernels. Note that MAP increases with the decrease
in sparsity at higher values of norms. Similar trend can be found in CLF. Low
performance of MKL-1 -norm, which leads to sparse selection, indicates that
base kernels carry complementary information. Therefore, the non-sparse MKL
or CLF methods such as 2 -norm and ∞ -norm, give better results as reported in
Table 1. Unweighted sum in the case of MKL is performing better than any other
MKL methods which reflects that in case of all informative channels, learning
the weights for MKL does not improve much on this dataset. The proposed
non-sparse CLF (2 ) schemes outperform the state-of-the-art MKL (2 -norm,
∞ -norm) by 2 % and 1.1% respectively. The stacking is performing the best
among all the methods and outperforms MKL by 1.5%. Further improvements
can be gained by fusing the stacking kernel together with 5 base kernels in case
of both MKL and CLF. The combination of base plus the stacking kernel under
MKL produced state-of-the-art result on this dataset with a MAP of 66.24%,
and outperforms MKL and CLF by 3.3% and 2.3% respectively.

5.2 Flower 17
Flower 17 [14] consists of 17 categories of flowers common in UK with 80 im-
ages in each category. The dataset is split into training (40 images per class),
150 M. Awais et al.

Table 1. Mean Average Precision of PASCAL VOC 2007


hhhh
hh norms 1 + 2−3 1 + 2−2 1 + 2−1
Fusion Methods hhh
1 2 3 4 8 ∞
h
MKL 55.42 56.42 58.53 61.07 61.98 62.45 62.61 62.81 62.93
CLF 63.71 63.94 63.97 63.98 63.97 63.97 63.77 63.69 63.11
Stacking 64.44
MKL (Base + Stacking) 64.39 64.55 65.06 65.75 66.06 66.23 66.24 66.09 65.93
CLF (Base + Stacking) 65.18 65.20 65.45 65.57 65.65 65.63 65.59 65.54 65.48

−5 −3 −1 4
p=1 p=1+2 p=1+2 p=1+2 p=2 p=3 p=4 p=8 p=10
1 1 1 1 1 1 1 1 1

0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

0 0 0 0 0 0 0 0 0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Fig. 1. Pascal VOC 2007. Feature channels weights learned with various p for CLF(p ).

validation (20 images per class) and test (20 images per class) using 3 prede-
fined random splits by the authors of the dataset. There are large appearance
variations within each category and similarities with other categories. For ex-
periments we have used 7 RBF kernels from the 7 χ2 distance matrices provided
online3 . The features used to compute these distance matrices include different
types of shape, texture and color based descriptors whose details can be found
in [14]. We have used SVM as a base classifier and its regularization parameter
is in the range {10(−2,−1,...,3) }. The Regularization parameter for different CLF
is in the range ν ∈ {0.05, 0.1, . . . , 0.95}. Both SVM and CLF regularization pa-
rameters are selected on the validation set. To carry out a fair comparison, the
regularization parameters and other setting are the same as in [6].
The results given in Table 2, show that the baseline for MKL, i.e., MKL-
avg(∞ ) gives 84.9% [6], and baseline for classifier level fusion, i.e., CLF(∞ )
gives 86.7%. The MKL results are obtained using the SHOGUN multiclass MKL
implementation for different norms. Nonlinear versions of classifier fusion per-
form better than their sparse counterparts as well as state-of-the-art MKL. The
best result in CLF is obtained by the proposed NLP-νMC (2 ) and NLP-β (4 ).
They outperform the MKL baseline by more than 2.5% and multiclass MKL
by 0.6%. Stacking yields the best results on this dataset, outperforming MKL
baseline by more than 4.5%, MKL by more than 2% and the best CLF method
by more than 1.5%. Combining the stacking kernel with the 7 base kernels us-
ing multiclass MKL also shows similar results. Note that the performance drops
when the stacking kernel is combined with the 7 base kernels using MKL (∞ )
or CLF (∞ ). This highlights the importance of learning in fusion methods.
However, when the stacking kernel is combined with the 7 base kernels using
classifier fusion, it produces state-of-the-art results on this dataset, and outper-
forms MKL, the best in CLF and stacking by 3%, 2.3% and 0.8%, respectively.
The second half of Table 2 shows comparison with published state-of-the-art
results. According to our knowledge the best performing method using the 7
3
http://www.robots.ox.ac.uk/~ {}vgg/data/flowers/17/index.html
Novel Fusion Methods for Pattern Recognition 151

Table 2. Classification Rate on Flower17

ML-Methods 1 1 + 2−31 + 2−1 2 3 4 8


MKL 87.2±2.7 74.9±1.7
72.2±3.6 71.2±2.7 70.6±3.8 73.1±3.9 81.0±4.0
NLP-β 86.5±3.3 86.6±3.4
86.6±1.1 86.7±1.2 87.4±1.5 87.9±1.8 87.8±2.1
NLP-νMC 85.5±1.3 87.6±2.2 87.7±2.6 87.8±2.1
86.6±2.0 87.7±2.0 87.8±1.9
NLP-B 84.6±2.5 84.6±2.4
84.8±2.6 84.8±2.5 85.5±3.7 86.9±2.7 87.3±2.7
Stacking 89.4 ± 0.5
MKL(Base 89.3±0.9 79.7±2.7 77.6±1.2 74.7±2.4 73.8±2.6 77.8±4.3 86.3±1.9
+Stacking)
NLP-β(Base 90.2±1.5 89.3±0.7 89.6±0.5 89.2±1.6 89.3±1.2 89.1±1.4 89.0±1.0
+Stacking)
NLP-νMC(Base 86.1±2.5 87.3±1.4 88.5±0.5 88.6±0.9 88.6±0.9 88.8±1.1 88.9±1.2
+Stacking)
Comparison with State-of-the-Art
MKL-prod [6] (7 kernels) 85.5 ± 1.2
MKL-avg (∞ ) [6] (7 kernels) 84.9 ± 1.9
CLF (∞ ) (7 kernels) 86.7 ± 2.7
MKL-avg (∞ ) (7 kernels + Stacking kernel) 88.5 ± 1.1
CLF (∞ ) (7 kernels+ Stacking kernel) 88.8 ± 1.4
CG-Boost [6] (7 kernels) 84.8 ± 2.2
MKL (SILP or Simple) [6] (7 kernels) 85.2 ± 1.5
LP-β [6] (7 kernels) 85.5 ± 3.0
LP-B [6] (7 kernels) 85.4 ± 2.4
MKL-FDA (p ) [23] (7 kernels) 86.7 ± 1.2
L1 -BRD [22] (30 kernels) 89.0 ± 0.6

distance matrices provided by the authors is giving 86.7% which is similar to the
CLF baseline. Our best CLF method outperforms it by 1.2% while our stacking
approach outperforms it by 2.7% and our CLF combination of base plus stacking
outperforms it by 3.5%. It is important to note that while comparing fusion
methods, the base feature channels (kernels) must be the same across different
schemes. For example, the comparison of Flower 17 with state-of-the-art in [22]
is not justified as it uses 30 kernels while normally the results are reported using
the 7 kernels provided online. Nevertheless, our best method outperforms this
by 1.2% which can be considered as a significant improvement in spite of using
4 times fewer feature channels.

5.3 Flower 102


Flower 102 [13] is an extended multiclass dataset containing 102 flower categories
commonly present in UK. It consists of 8189 images with 40 to 250 images in
each class. The dataset is split into training (10 images per class), validation
(10 images per class) and test (with a minimum of 20 images per class) using a
split predefined by the authors of the dataset. For the experiments we have used
the 4 χ2 distance matrices provided online4 . The details of the features used to
compute these distance matrices can be found in [13]. RBF kernels are computed
using Eq. (14) and these four distance matrices. The experimental setup is the
same as for Flower 17.
The results are given in Table 3. We have not reported the variance of the
results as the authors of the dataset have given only 1 split online and for a fair
4
http://www.robots.ox.ac.uk/~ {}vgg/data/flowers/102/index.html
152 M. Awais et al.

Table 3. Mean accuracy on Flower 102 dataset

ML-Methods 1 1 + 2−3 1 + 2−1 2 3 4 8 ∞


MKL 69.9 64.7 65.3 65.9 65.7 - - 73.4
NLP-β 61.2 75.7 73.5 74.7 73.0 73.9 74.6 73.0
NLP-νMC 72.6 73.1 73.2 73.3 73.4 73.4 73.4 73.0
NLP-B 73.6 - - - - - - 73.0
Stacking 77.7
MKL(Base+ 79.8 65.9 66.2 65.8 65.5 - 68.9 76.4
Stacking)
NLP-β(Base 79.2 77.8 77.8 78.3 79.0 79.4 80.3 77.2
+Stacking)
NLP-νMC(Base 77.6 77.3 77.1 77.2 77.2 77.2 77.2 77.2
+Stacking)
Comparison with State-of-the-Art
MKL-prod 73.8
MKL-avg 73.4
MKL [13] 72.8

comparison with previously published results we use the same split as used by
other authors. The baseline for MKL gives 73.4%, and baseline for CLF gives
73.0%. Multiclass MKL is not performing well on this dataset with the best
result achieved by MKL (1 ) and performs 3.5% lower than the trivial baseline.
The best among classifier level fusion is the NLP-β (1+2−3 ) scheme. It performs
5.8% better than multiclass MKL and 2.3%, 2.7% better than MKL and CLF
baselines, respectively. Note that NLP-νMC is performing worse than NLP-β as
it has to estimate NC times more parameter than NLP-β in the presence of few
training example per category. We expect NLP-νMC to perform better in the
presence of more training data. Stacking achieves the best results on this dataset
and it performs 7.8% better than multiclass MKL and 4.3%, 4.7% better than
MKL and CLF baselines, respectively. The results can be further improved by
combining the stacking kernel with the 4 base kernels by using MKL or CLF.
However, the performance drops when the stacking kernel is combined with the
4 base kernels using MKL (∞ ) or CLF (∞ ). This highlights the importance of
learning in fusion methods. We achieve state-of-the-art results on this dataset
by combining the stacking kernel with the 4 base kernels using CLF. This com-
bination performs 10% better than multiclass MKL and 6.6%, 7% and 2.3%
better than MKL baseline, CLF baseline and stacking, respectively. Note that
we are unable to compute the mean accuracy for NLP-B, especially for p -norm
greater than 1, due to a large number of constraints in the optimization problem.
The results for MKL are reported from [13] for comparison. In comparison to
the published results, our best method has an improvement of 7.2% which is a
significant gain. given that we are not using any new information.

5.4 Caltech101

Caltech101 [4] is a multiclass dataset consisting of 101 object categories and a


background category. There are 31 to 800 images per category of medium reso-
lution (200 × 300). We follow the common practice used on this dataset, i.e., use
15 randomly selected images per category for training and validation, while up to
Novel Fusion Methods for Pattern Recognition 153

Table 4. Mean accuracy on Caltech101 dataset

ML-Methods 1 1 + 2−3 1 + 2−1 2 3 4 8


MKL 68.6±2.2 61.2±1.1 58.1±0.8 57.4±0.7 57.0±0.6 - 63.9±0.9
NLP-β 69.0±1.8 68.6±2.2 69.1±1.2 69.0±1.4 69.2±1.5 69.0±1.3 69.0±1.3
NLP-νMC 67.4±2.4 68.7±1.8 68.4±1.0 68.5±0.8 68.4±0.7 68.4±0.7 68.4±0.7
NLP-B 64.1±0.7 - - - - - -
Stacking 68.0 ± 2.4
MKL(Base+ 68.6±2.2 68.9±2.4 68.5±2.5 68.5±2.6 68.5±2.5 - 69.6±2.2
Stacking)
NLP-β(Base 69.7±1.7 69.3±2.3 70.0±1.7 70.6±1.8 70.4±1.4 70.7±1.9 70.6±1.9
+Stacking)
NLP-νMC(Base 68.1±3.0 69.0±1.3 69.4±1.3 69.5±1.4 69.6±1.4 69.6±1.3 69.7±1.3
+Stacking)
MKL-prod 62.2 ± 0.6
MKL-avg (∞ ) 67.4 ± 1.1
CLF (∞ ) 68.4 ± 0.7
MKL-avg (∞ ) (Base + Stacking) 69.0 ± 1.3
CLF (∞ ) (Base + Stacking) 69.7 ± 1.3

50 images per category are randomly selected for testing. The average accuracy
is computed over all 101 object classes. This process is repeated 3 times and the
mean accuracy over 3 splits is reported for each method. In this experiment, we
combine 10 features channels based on the features introduced in [12,19] with
dense sampling strategies. The RBF kernel function to compute kernel matrices
from the χ2 distance matrices is given in Eq. (14). The experimental setup is
the same as for Flower 17.
The results of the proposed methods are presented in Table 4 and compared
with other techniques. The baseline for MKL gives 67.4% and the baseline for
CLF gives 68.5%. The best result among MKL is achieved by multiclass MKL
(1 ). It performs 1.2% better than the MKL baseline and performs similar to CLF
baseline. Stacking does not perform well on this dataset. It performs 0.6% better
than the MKL baseline, however, it performs worse than both CLF baseline
and multiclass MKL. Classifier level fusion achieves best results on this dataset
(NLP-β3 )). It performs 1.8% and 0.7% better than MKL and CLF baselines and
performs 0.6% better than multiclass MKL. The results can be further improved
by using the stacking kernel with the 10 base kernels. We achieve state-of-the-art
results on this dataset by combining the stacking kernel with the 10 base kernels
using CLF. This combination performs 3.3%, 2.7%, 2.2% and 2.1% better than
the MKL baseline, stacking, the CLF baseline and multiclass MKL. Note that
we are unable to compute the Mean accuracy for NLP-B, especially for p -norm
greater than 1, due to a large number of constraints in the optimization problem.
It is well known that the type and the number of kernels have a large impact
on the overall performance. Therefore, a direct comparison of scores with the
published methods is not entirely fair. Nonetheless, it can be noted that the best
performing methods on Caltech101 in [7] and [6] using a single kernel are giving
60% and 61% respectively. The performance in [6] using 8 kernels is close to 63%
while the performance using 39 feature channels is 70.4%. Note that our best
method gives 70.7% using 10 feature channels only, which can be considered as a
significant improvement, given that we have used 4 times fewer feature channels.
154 M. Awais et al.

6 Conclusions
In this paper we proposed a nonlinear separable convex optimization formula-
tion for multiclass classifier fusion (NLP-νMC) which learns the weight for each
class in every feature channel. We have also extended linear programming for
binary and multiclass classifier fusion (ensemble methods) to nonlinear separa-
ble convex classifier fusion by incorporating arbitrary norms. Unlike the existing
methods, these formulations do not reject informative feature channels and make
the classifier fusion robust to both noisy and redundant feature channels which
results in an improved performance.
We also extended stacking in the case of both binary and multiclass datasets.
By considering stacking as a separate feature channel, we can combine the stack-
ing kernel with base kernels using any proposed fusion method. We have per-
formed comparative experiments on challenging object recognition benchmarks
for both multi-label and multiclass cases. Our results show that optimal p is
an intrinsic property of kernels set and can be different for different datasets.
It can be learnt systematically using validation set. In general if some channels
are noisy 1 -norm is better (sparse weights). For carefully designed features non-
sparse solutions, e.g., 2 -norm, are better. Note that both are special cases of
our approaches. The proposed methods perform better than the state-of-the-art
MKL methods. In addition to this, the non-sparse version of the classifier fu-
sion is performing better than sparse selection of feature channels. We achieve
state-of-the-art performance on all datasets by combining the stacking kernel
with base kernels using classifier level fusion.
The two step training of classifier fusion may seem as an overhead. However,
the first step is independent for each feature channel as well as each class and can
be performed in parallel. Independent training also makes the systems applicable
to large datasets. Moreover, in MKL one has to train an SVM classifier in α-step
before getting the optimal weights. As MKL is optimizing parameters jointly,
one may argue that the independent optimization of weights in case of classifier
fusion is less effective. However, as our consistently better results show, these
schemes seem to be more suitable for visual recognition problems. The proposed
classifier fusion schemes seem to be attractive alternatives to the state-of-the-
art MKL approaches for both binary and multiclass problems and address the
complexity issues of the MKL.

Acknowledgements. This research was supported by UK EPSRC EP/F0034


20/1, EP/F0694 21/1 and the BBC R&D grants.

References
1. Bach, F., Lanckriet, G., Jordan, M.: Multiple Kernel Learning, Conic Duality, and
the SMO Algorithm. In: ICML (2004)
2. Džeroski, S., Ženko, B.: Is combining classifiers with stacking better than selecting
the best one? ML 54(3), 255–273 (2004)
3. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The
pascal visual object classes (voc) challenge. IJCV 88(2), 303–338 (2010)
Novel Fusion Methods for Pattern Recognition 155

4. Fei-Fei, L., Fergus, R., Perona, P.: One-shot Learning of Object Categories. PAMI,
594–611 (2006)
5. Freund, Y., Schapire, R.: A Desicion-Theoretic Generalization of On-Line Learning
and an Application to Boosting. In: CLT (1995)
6. Gehler, P., Nowozin, S.: On Feature Combination for Multiclass Object Classifica-
tion. In: ICCV (2009)
7. Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Tech. Rep.
7694, California Institute of Technology (2007)
8. Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. PAMI 20(3),
226–239 (1998)
9. Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A., Laskov, P., Müller, K.: Efficient
and Accurate lp-norm MKL. In: NIPS (2009)
10. Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L., Jordan, M.: Learning the
Kernel Matrix with Semidefinite Programming. JMLR 5, 27–72 (2004)
11. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid
Matching for Recognizing Natural Scene Categories. In: CVPR (2006)
12. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors.
PAMI 27(10), 1615–1630 (2005)
13. Nilsback, M.E., Zisserman, A.: Automated Flower Classification over a Large Num-
ber of Classes. In: ICCVGIP (2008)
14. Nilsback, M., Zisserman, A.: A visual Vocabulary for Flower Classification. In:
CVPR (2006)
15. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: SimpleMKL. JMLR 9,
2491–2521 (2008)
16. Rätsch, G., Schölkopf, B., Smola, A., Mika, S., Müller, K., Onoda, T.: Robust
Ensemble Learning for Data Analysis. In: PACKDDM (2000)
17. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label
classification. In: MLKDD, pp. 254–269 (2009)
18. Rifkin, R., Klautau, A.: In defense of one-vs-all classification. JMLR 5, 101–141
(2004)
19. van de Sande, K., Gevers, T., Snoek, C.: Evaluation of color descriptors for object
and scene recognition. In: CVPR (2008)
20. Sonnenburg, S., Rätsch, G., Schafer, C., Schölkopf, B.: Large Scale Multiple Kernel
Learning. JMLR 7, 1531–1565 (2006)
21. Wolpert, D.: Stacked generalization. Neural Networks 5(2), 241–259 (1992)
22. Xie, N., Ling, H., Hu, W., Zhang, Z.: Use bin-ratio information for category and
scene classification. In: CVPR (2010)
23. Yan, F., Mikolajczyk, K., Barnard, M., Cai, H., Kittler, J.: Lp norm multiple kernel
fisher discriminant analysis for object and image categorisation. In: CVPR (2010)
24. Ying, Y., Huang, K., Campbell, C.: Enhanced protein fold recognition through a
novel data integration approach. BMCB 10(1), 267 (2009)
25. Zien, A., Ong, C.: Multiclass Multiple Kernel Learning. In: ICML (2007)
A Spectral Learning Algorithm for Finite State
Transducers

Borja Balle, Ariadna Quattoni, and Xavier Carreras

Universitat Politècnica de Catalunya


{bballe,aquattoni,carreras}@lsi.upc.edu

Abstract. Finite-State Transducers (FSTs) are a popular tool for mod-


eling paired input-output sequences, and have numerous applications in
real-world problems. Most training algorithms for learning FSTs rely
on gradient-based or EM optimizations which can be computationally
expensive and suffer from local optima issues. Recently, Hsu et al. [13]
proposed a spectral method for learning Hidden Markov Models (HMMs)
which is based on an Observable Operator Model (OOM) view of HMMs.
Following this line of work we present a spectral algorithm to learn FSTs
with strong PAC-style guarantees. To the best of our knowledge, ours is
the first result of this type for FST learning. At its core, the algorithm
is simple, and scalable to large data sets. We present experiments that
validate the effectiveness of the algorithm on synthetic and real data.

1 Introduction

Probabilistic Finite-State Transducers (FSTs) are a popular tool for modeling


paired input-output sequences, and have found numerous applications in areas
such as natural language processing and computational biology. Most training
algorithms for learning FSTs rely on gradient-based or EM optimizations which
can be computationally expensive and suffer from local optima issues [8,10].
There are also methods that are based on grammar induction techniques [5,3],
which have the advantage of inferring both the structure of the model and the
parameters.
At the same time, for the closely-related problem of learning Hidden Markov
Models (HMMs), different algorithms based on the Observable Operator Model
(OOM) representation of the HMM have been proposed [6,18,13,23]. The main
idea behind OOMs is that the probabilities over sequences of observations gener-
ated by an HMM can be expressed as products of matrix operators [22,4,11,14].
This view of HMMs allows for the development of learning algorithms which
are based on eigen-decompositions of matrices. Broadly speaking, these spectral

This work was partially supported the EU PASCAL2 Network of Excellence (FP7-
ICT-216886), and by a Google Research Award. B.B. was supported by an FPU
fellowship (AP2008-02064) of the Spanish Ministry of Education. The Spanish Min-
istry of Science and Innovation supported A.Q. (JCI-2009-04240) and X.C. (RYC-
2008-02223 and “KNOW2” TIN2009-14715-C04-04).

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 156–171, 2011.

c Springer-Verlag Berlin Heidelberg 2011
A Spectral Learning Algorithm for Finite State Transducers 157

decompositions can reveal relationships between observations and hidden states


by analyzing the dynamics between observations. Other approaches to language
learning, also based on linear algebra, can be found in the literature, e.g. [9,2].
In this paper we show that the OOM idea can also be used to derive learning
algorithms for parameter estimation of probabilistic non-deterministic FSTs.
While learning FSTs is in general hard, here we show that a certain class of FSTs
can be provably learned. Generalizing the work by Hsu et al. [13], we present
a spectral learning algorithm for a large family of FSTs. For this algorithm
we prove strong PAC-style guarantees, which is the first such result for FST
learning to the best of our knowledge. Our sample complexity bound depends
on some natural parameters of the input distribution, including the spread of
the distribution and the expected length of sequences. Furthermore, we show
that for input distributions that follow a Markov process, our general bound
can be improved. This is important for practical applications of FSTs where the
input distribution is well approximated by Markov models, such as in speech
and language processing [15].
Like in the case for HMMs [6,18,13], our learning algorithm is based on spec-
tral decompositions of matrices derived from estimated probabilities over triples
of symbols. The method involves two simple calculations: first, counting frequen-
cies on the training set; and second, performing SVD and inversions on matrices.
The size of the input alphabet only has an impact on the first step, i.e. comput-
ing frequencies. Therefore, the algorithm scales well to very large training sets.
Another good property of the method is that only one parameter needs to be
tuned, namely the number of hidden states of the FST. Our theoretical analysis
points to practical ways to narrow down the range of this parameter.
We present synthetic experiments that illustrate the properties of the algo-
rithm. Furthermore, we test our algorithm for the task of transliterating names
between English and Russian. In these experiments, we compare our method
with an Expectation Maximization algorithm, and we confirm the practical util-
ity of the spectral algorithm at learning FSTs on a real task.
The rest of the paper is organized as follows. Section 2 presents background
materials on FSTs, together with their OOM representation. Section 3 presents
the spectral learning algorithm, and Section 4 gives the main theoretical results
of the paper. In sections 5 and 6 we present experiments on synthetic data and
on a transliteration task. Section 7 concludes the paper.

2 Probabilistic Finite State Transducers


In this paper we use Finite-State Transducers (FSTs) that model the conditional
probability of an output sequence y = y1 · · · yt given an input sequence x = x1 · · · xt .
Symbols xs belong to an input alphabet X = {a1 , . . . , ak }, symbols ys belong
to an output alphabet Y = {b1 , . . . , bl }, and t is the length of both sequences.1
1
For convenience we assume that input and output sequences are of the same length.
Later in the paper we overcome this limitation by introducing special empty symbols
in the input and output alphabets.
158 B. Balle, A. Quattoni, and X. Carreras

We denote the cardinalities of these alphabets as k = |X | and l = |Y|. Through-


out the paper, we use x and y to denote two input-output sequences of length t,
we use a and a to denote arbitrary symbols in X and b to denote an arbitrary
symbol in Y. Finally, we use xr:s to denote the subsequence xr · · · xs .
A probabilistic non-deterministic FST — which we simply call FST — defines
a conditional distribution P of y given x using an intermediate hidden state
sequence h = h1 · · · ht , where each hs belongs to a set of m hidden states H =
{c1 , . . . , cm }. Then, the FST defines:

P(y|x) = PrP [y, h|x]
h∈Ht
 t

= PrP [h1 ] PrP [y1 |h1 ] PrP [hs |xs−1 , hs−1 ] PrP [ys |hs ] (1)
h∈Ht s=2

The independence assumptions are that Pr[hs |x, h1:s−1 ] = Pr[hs |xs−1 , hs−1 ] and
Pr[ys |x, h, y1:s−1 ] = Pr[ys |hs ]. That is, given the input symbol at time s − 1 and
the hidden state at time s − 1 the probability of the next state is independent
of anything else in the sequence, and given the state at time s the probability
of the corresponding output symbol is independent of anything else. We usually
drop the subscript when the FST is obvious from the context.
Equation (1) shows that the conditional distribution defined by an FST P can
be fully characterized using standard transition, initial and emission parameters,
which we define as follows. For each symbol a ∈ X , let Ta ∈ Rm×m be the state
transition probability matrix, where Ta (i, j) = Pr[Hs = ci |Xs−1 = a, Hs−1 = cj ].
Write α ∈ Rm for the initial state distribution, and let O ∈ Rl×m be the emission
probability matrix where O(i, j) = Pr[Ys = bi |Hs = cj ]. Given bi ∈ Y we write Dbi
to denote an m × m diagonal matrix with the ith row of O as diagonal elements.
Similarly, we write Dα for the m × m diagonal matrix with α as diagonal values.
To calculate probabilities of output sequences with FSTs we employ the notion
of observable operators, which is commonly used for HMMs [22,4,11,14]. The
following lemma shows how to express P(y|x) in terms of these quantities using
the observable operator view of FSTs.

Lemma 1. For each a ∈ X and b ∈ Y define Aba = Ta Db . Then, the following


holds:
P(y|x) = 1 Ayxtt · · · Ayx11 α . (2)

To understand the lemma, consider a state-distribution vector αs ∈ Rm , where


αs (i) = Pr[y1:s−1 , Hs = ci |x1:s−1 ]. Initially, α1 is set to α. Then αs+1 = Ayxss αs
updates the state distribution from positions s to s + 1 by applying the appro-
priate operator, i.e. by emitting symbol ys and transitioning with respect to xs .
The lemma computes αt+1 by applying a chain of computations  to the sequence
pair x and y. Then, the probability of y given x is given by i αt+1 (i). Matrices
Aba are the observable operators that relate input-output observations with state
dynamics.
A Spectral Learning Algorithm for Finite State Transducers 159

Our learning algorithms will learn model parameterizations that are based
on this observable operators view of FSTs. As input they will receive a sample
of input-output pairs sampled from an input distribution D over X ∗ ; the joint
distribution will be denoted by D ⊗ P. In general, learning FSTs is known to be
hard. Thus, our learning algorithms need to make some assumptions about the
FST and the input distribution. Before stating them we introduce some notation.
 a ∈ X let pa = Pr[X1 = a], and define an “average” transition matrix
For any
T = a pa Ta for P and μ = mina pa , which characterizes the spread of D.
Assumptions. An FST can be learned when D and P satisfy the following:
(1) l ≥ m, (2) Dα and O have rank m, (3) T has rank m, (4) μ > 0.
Assumptions 1 and 2 on the nature of the target FST have counterparts in HMM
learning. In particular, the assumption on Dα requires that no state has zero
initial probability. Assumption 3 is an extension for FSTs of a similar condition
for HMM, but in this case depends on D as well as on P. Condition 4 on the input
distribution ensures that all input symbols will be observed in a large sample.
For more details about the implications of these assumptions, see [13].

3 A Spectral Learning Algorithm


In this section we present a learning algorithm for FSTs based on spectral de-
compositions. The algorithm will find a set of operators B for an FST which
are equivalent to the operators A presented above, in the sense that they de-
fine the same distribution. In section 4 we will present a theoretical analysis for
this algorithm, proving strong generalization guarantees when the assumptions
described in the previous section are fulfilled.
In addition, we also present an algorithm to directly retrieve the observation,
initial and transition probabilities. The algorithm is based on a joint decompo-
sition method which, to the best of our knowledge, has never been applied to
OOM learning before.
We start by defining probabilities over bigrams and trigrams of output sym-
bols generated by an FST. Let P ∈ Rl×l be a matrix of probabilities over bigrams
of output symbols, where

P (i, j) = Pr[Y1:2 = bj bi ] . (3)

Furthermore, for each two input-output symbols a ∈ X and b ∈ Y we define a


matrix Pab ∈ Rl×l of marginal probabilities over output trigrams as follows:

Pab (i, j) = Pr[Y1:3 = bj bbi |X2 = a] . (4)

Some algebraic manipulations show the following equivalences (recall that T =



a Ta Pr[X1 = a]):

P = O T Dα O  , (5)
Pab = O T a Db T Dα O 
. (6)
160 B. Balle, A. Quattoni, and X. Carreras

Algorithm LearnFST(X , Y, S, m)

Input:
– X and Y are input-output alphabets
– S = {(x1 , y 1 ), . . . , (xn , y n )} is a training set of input-output sequences
– m is the number of hidden states of the FST
Output:
– Estimates of the observable parameters β1 , β∞ and B ab for all a ∈ X and b ∈ Y

1. Use S to compute an empirical estimate of the probability matrices ρ, P , and


Pab for each pair of input-output symbols a ∈ X and b ∈ Y
2. Take U  to be the matrix of top m left singular vectors of P
3. Compute the observable FST parameters as: β1 = U   ρ, β∞
   P )+
= ρ (U
 b   b    +
and Ba = (U Pa )(U P ) for each a ∈ X and b ∈ Y

Fig. 1. An algorithm for learning FST

Note that if Assumptions 1–3 are satisfied, then P has rank m. In this case we
can perform an SVD on P = U ΣV ∗ and take U ∈ Rl×m to contain its top m
left singular vectors. It is shown in [13] that under these conditions the matrix
U  O is invertible. Finally, let ρ ∈ Rl be the initial symbol probabilities, where
ρ(i) = Pr[Y1 = bi ].
Estimations of all these matrices can be efficiently computed from a sam-
ple obtained from D ⊗ P. Now we use them to define the following observable
representation for P:
β1 = U  ρ , (7)
   +
β∞ = ρ (U P ) , (8)
Bab = (U 
Pab )(U  P )+ . (9)
Next lemma shows how to compute FST probabilities using these new observable
operators.
Lemma 2 (Observable FST representation). Assume D and P obey As-
sumptions 1–3. For any a ∈ X , b ∈ Y, x ∈ X t and y ∈ Y t , the following hold.
β1 = (U  O)α , (10)
   −1
β∞ = 1 (U O) , (11)
Bab = (U O)Aba (U  O)−1 ,

(12)
 yt
P(y|x) = β∞ Bxt · · · Bxy11 β1 . (13)
The proof is analogous to that of Lemma 3 of [13]. We omit it for brevity.

3.1 Recovering the Original FST Parameters


We now describe an algorithm for recovering the standard FST parameters,
namely O, α and Ta for a ∈ X . Though this is not necessary for computing
A Spectral Learning Algorithm for Finite State Transducers 161

sequence probabilities, it may be an appealing approach for applications that


require computing quantities which are not readily available from the observable
representation, e.g. state marginal probabilities.
Similar to before, we will define some probability matrices. Let P3b ∈ Rl×l be
a matrix of probabilities over output trigrams, where

P3b (i, j) = Pr[Y1:3 = bj bbi ] . (14)

Let P3 ∈ Rl×l account for probabilities of output trigrams, marginalizing the


middle symbol,
P3 (i, j) = Pr[Y1 = bj , Y3 = bi ] . (15)

This matrix can be expressed as



P3 = Pab Pr[X2 = a] . (16)
a b

Finally, let Pa ∈ Rl×l be probabilities of output bigrams, where

Pa (i, j) = Pr[Y1:2 = bj bi |X1 = a] . (17)



Now, for every b ∈ Y define Qb = P3b P3+ . Writing T2 = a Ta Pr[X2 = a], one
can see that
Qb = (OT2 )Db (OT2 )+ . (18)

The equation above is an eigenvalue-eigenvector decomposition of the matrices


Qb . These matrices allow for a joint eigen-decomposition where the eigenvalues
of Qb correspond to the row of O associated with b.
Our algorithm first computes empirical estimates ρ, P3b , P3 and Pa , and builds
 b
Q . Then it performs a Joint Schur Decomposition of the matrices Q  b to retrieve

the joint eigenvalues and compute O. We use the optimization algorithm from
[12] to perform the joint Schur decomposition. Finally, estimates of transition
matrices for all a ∈ X and initial state probabilities are obtained as:

=O
α  + ρ , (19)
 + Pa (Dα O)
Ta = O  + . (20)

The correctness of these expressions in the error-free case can be easily verified.
Though this method is provided without an error analysis, some experiments
in Section 5 demonstrate that in some cases the parameters recovered with this
algorithm can approximate the target FST better than the observable represen-
tation obtained with LearnFST.
Note that this method is different from those presented in [18,13] for recov-
ering parameters of HMMs. Essentially, their approach requires to find a set of
eigenvectors, while our method recovers a set of joint eigenvalues. Furthermore,
our method could also be used to recover parameters from HMMs.
162 B. Balle, A. Quattoni, and X. Carreras

4 Theoretical Analysis

In this section the algorithm LearnFST is analyzed. We show that, under some
assumptions on the target FST and the input distribution, it will output a good
hypothesis with high probability whenever the sample is large enough. First we
discuss the learning model, then we state our main theorem, and finally we sketch
the proof. Our proof schema follows closely that of [13]; therefore, only the key
differences with their proof will be described, and, in particular, the lemmas
which are stated without proof can be obtained by mimicking their techniques.

4.1 Learning Model


Our learning model resembles that in [1] for learning stochastic rules, but uses a
different loss function. As in the well-known PAC model, we have access to ex-
amples (x, y) drawn i.i.d. from D⊗P. The difference with concept learning is that
now, instead of a deterministic rule, the learning algorithm outputs a stochastic
rule modelling a conditional distribution P that given an input sequence x can be
used to predict an output sequence y. As in all models that learn input-output
relations, the accuracy of the hypothesis is measured relatively to the same input
distribution that was used to generate the training sample. In our case, we are
interested in minimizing
 

 = EX∼D
dD (P, P) 
|P(y|X) − P(y|X)| . (21)
y


This loss function corresponds to the L1 distance between D ⊗ P and D ⊗ P.

4.2 Results
Our learning algorithm will be shown to work whenever D and P satisfy Assump-
tions 1–4. In particular, note that 2 and 3 imply that the mth singular values of
O and P , respectively σO and σP , are positive.
We proceed to state our main theorem. There, instead of restricting ourselves
to input-output sequences of some fixed length t, we consider the more general,
practically relevant case where D is a distribution over X ∗ . In this case the bound
depends on λ = EX∼D [|X|], the expected length of input sequences.

Theorem 1. For any 0 < , δ < 1, if D and P satisfy Assumptions 1–4, and
LearnFST receives as input m and a sample with n ≥ N examples for some N
in 
λ2 ml k
O 4 2 4 log , (22)
 μσO σP δ
 returned by the algorithm
then, with probability at least 1 − δ, the hypothesis P
 ≤ .
satisfies dD (P, P)
A Spectral Learning Algorithm for Finite State Transducers 163

Note that function N in Theorem 1 depends on D through λ, μ and σP — this


situation is quite different from the setting found in HMMs. The price we pay
for choosing a setting with input strings of arbitrary length is a dependence of
type O(λ2 /4 ) in the bound. A dependence of type O(t2 /2 ) can be obtained,
using similar techniques, in the setting where input strings have fixed length t.
However, we believe the latter setting to be less realistic for practical applica-
tions. Furthermore, a better dependence on  can be proved for the following
particular case of practical interest.
Recall that if X ∼ D is modeled by an HMM with an absorbing state —
equivalently, an HMM with stopping probabilities — the random variable |X|
follows a phase-type (PH) distribution [19]. It is well known that after a transient
period, these distributions present an exponential rate of decay. In particular,
for any such D there exist positive constants τ1 , τ2 such that if t ≥ τ1 , then
Pr[|X| ≥ t] = O(e−t/τ2 ). In this case our techniques yield a bound of type
O(τ12 τ22 /2 log(1/)). In many practical problems it is not uncommon to assume
that input sequences follow some kind of markovian process; this alternative
bound can be applied in such cases.
Though it will not be discussed in detail here due to space reasons, the de-
pendence on l in Equation 22 can be relaxed to take into account only the most
probable symbols; this is useful when D exhibits a power law behavior. Fur-
thermore, our algorithm provably works (cf. [13]) with similar guarantees in the
agnostic setting where P cannot be exactly modelled with an FST, but is close
to some FST satisfying Assumptions 1–3. Finally, in application domains where
k is large, there may be input symbols with very low probability that are not
observed in a sample. For these cases, we believe that it may be possible to soften
the (implicit) dependence of N on k through 1/μ using smoothing techniques.
Smoothing procedures have been used in practice to solve these issues in many
related problems [7]; theoretical analyses have also proved the validity of this
approach [21].
We want to stress here that our result goes beyond a naive application of
the result by Hsu et al. [13] to FST learning. One could try to learn an HMM
modeling the joint distribution D ⊗ P, but their result would require that this
distribution can be modeled by some HMM; we do not need this assumption in
our result. Another approach would be to learn k distinct HMMs, one for each
input symbol; this approach would miss the fact that the operator O is the same
in all these HMMs, while our method is able to exploit this fact to its advantage
by using the same U  for every operator. In Section 5 we compare our algorithm
against these two baselines and show that it behaves better in practice.

4.3 Proofs

The main technical difference between algorithm LearnFST and spectral tech-
niques for learning HMM is that in our case the operators Bab depend on the
input symbol as well as the output symbol. This implies that estimation er-
rors of Bab for different input symbols will depend on the input distribution; the
occurrence of μ in Equation 22 accounts for this fact.
164 B. Balle, A. Quattoni, and X. Carreras

First we introduce some notation. We will use  · p to denote the usual p


norms for vectors and the corresponding induced norms for matrices, and  · F
will be used to denote the Frobenius norm. Given n examples (x1 , y 1 ), . . . , (xn , y n )
drawn i.i.d. from D⊗P, we denote by na the number of samples such that x2 = a,
which measures how well Pab is estimated. Also define the following estimation

errors: ρ = ρ − ρ2 , P = P − P F and a = b Pab − Pab F . The first lemma
bounds these errors in terms of the sample size. The results follow from a simple
analysis using Chernoff bounds and McDiarmid’s inequality.

Lemma 3. With probability at least 1 − δ, the following hold simultaneously:





ρ ≤ 1/n 1 + log(4/δ) , (23)



P ≤ 1/n 1 + log(4/δ) , (24)



∀a a ≤ l/na 1 + log(4/δ) , (25)

∀a na ≥ npa − 2npa log(4k/δ) . (26)

Next lemma is almost identical to Lemma 10 in [13], and is repeated here for
completeness. Both this and the following one require that D and P satisfy
Assumptions 1–3 described above. These three quantities are used in both state-
ments:

  O)−1 (β1 − β 1 )1 ,


1 = (U (27)

∞ = (U  O) (β∞ − β ∞ )∞ , (28)


a =   O)−1 (B
(U ab − B ab )(U
  O)1 . (29)
b

Here, the definitions of β 1 , β ∞ and B


b correspond to substituting U by U
a
 in the
b
expressions for β1 , β∞ and Ba respectively.

Lemma 4. If P ≤ σP /3, then


√ √

1 ≤ (2/ 3)ρ m/σO , (30)
 
∞ ≤ 4 P /σP2 + ρ /(3σP ) ,
(31)
√ √  

a ≤ (8/ 3) m/σO P /σP2 + a /(3σP ) . (32)

The lemma follows from a perturbation analysis on the singular values of U  O,


     
U P and U P . In particular, the condition on P ensures that U O is invertible.
Our next lemma gives two inequalities useful for bounding the error between
P and the output from LearnFST. The proof extends that of Lemmas 11 and 12
from [13] and is omitted in this version; the main difference is that now bounds
depend on the input sequence. Note that the second inequality is a consequence
of the first one.
A Spectral Learning Algorithm for Finite State Transducers 165


Lemma 5. For all x ∈ X t , let εx = ts=1 (1 + xs ). The following hold:

(U  O)−1 (B  y β∞ − B
y β ∞ )1 ≤ (1 + 1 )εx − 1 , (33)
x x
y∈Y t


|P(y|x) − P(y|x)| ≤ (1 +
1 )(1 +
∞ )εx − 1 . (34)
y∈Y t

Now we proceed to prove our main theorem.


Proof (Proof of Theorem 1). First note that by the assumptions on D and P
all the above lemmas can be used. In particular, by Lemmas 3 and 4 we have
that, for some constants c1 , c2 , c3 , c4 > 0, the following hold simultaneously with
probability 1 − δ:
1. n ≥ c1 /σP2 log(1/δ) implies P ≤ σP /3,
2. n ≥ c2 m/(2 σO2
) log(1/δ) implies 1 ≤ /40,
3. n ≥ c3 /( σP ) log(1/δ) implies
2 4
∞ ≤ /40, and
4. n ≥ c4 λ2 lm/(4 μσO σP ) log(k/δ) implies ∀a
2 2
a ≤ 2 /(20λ).

Item 4 above uses the fact that u − cu ≥ u/2 for u ≥ 4c. Finally, we use that
(1 + u/t)t ≤ 1 + 2u for all t ≥ 0 and u ≤ 1/2 to obtain the bound:
  
 ≤
dD (P, P) D(x) 
|P(y|x) − P(y|x)| + 2 D(x) ≤  , (35)
|x|<4λ/ y∈Y |x| |x|≥4λ/

where the first term is at most /2 by Lemma 5, and the second is bounded
using Markov’s inequality.

5 Synthetic Experiments
In this section we present experiments using our FST learning algorithm with
synthetic data. We are interested in four different aspects. First, we want to eval-
uate how the estimation error of the learning algorithm behaves as we increase
the training set size and the difficulty of the target. Second, how the estimation
error degrades with the length of test sequences. In the third place, we want to
compare our algorithms with other, more naive, spectral methods for learning
FST. And four, we compare LearnFST with our other algorithm for recovering
the parameters of an FST using a joint Schur decomposition.
For our first experiment, we generated synthetic data of increasing difficulty
as predicted by our analysis, as follows. First, we randomly selected a distribu-
tion over input sequences of length three, for input alphabet sizes ranging from 2
to 10, and choosing among uniform, gaussian and power distributions with ran-
dom parameters. Second, we randomly selected an FST, choosing from output
alphabet sizes from 2 to 10, choosing a number of hidden states and randomly
generating initial, transition and observation parameters. For a choice of input
distribution and FST, we computed the quantities appearing in the bound ex-
cept for the logarithmic term, and defined c = (λ2 ml)/(μσO 2 4
σP ). According to
166 B. Balle, A. Quattoni, and X. Carreras

0.5 0.35

0.45
c ~ 106 32k
9 0.3 128k
c ~ 10
0.4 512k
c ~ 1010 0.25
0.35 11
c ~ 10
L1 distance

L1 distance
0.3
0.2
0.25
0.15
0.2

0.15 0.1
0.1
0.05
0.05

0 0
8K 32K 128K 512K 1024K 2048K 0 2 4 6 8 10
# samples sequence length

Fig. 2. (Left) Learning curves for models at increasing difficulties, as predicted by our
analysis. (Right) L1 distance with respect to the length of test sequences, for models
trained with 32K, 128K and 512K training examples (k = 3, l = 3, m = 2).

our analysis, the quantity c is an estimate of the difficulty of learning the FST. In
this experiment we considered four random models, that fall into different orders
of c. For each model, we generated training sets of different sizes, by sampling
 as a function
from the corresponding distribution. Figure 2 (left) plots dD (P, P)
of the training set size, where each curve is an average of 10 runs. The curves
follow the behavior predicted by the analysis.
The results from our second experiment can be seen in Figure 2 (right), which
plots the error of learning a given model (with k = 3, l = 3 and m = 2) as a
function of the test sequence lengths t, for three training set sizes. The plot
shows that increasing the number of training samples has a clear impact in the
performance of the model on longer sequences. It can also be seen that, as we
increase the number of training samples, the curve seems to flatten faster, i.e.
the growth rate of the error with the sequence length decreases nicely.
In the third experiment we compared LearnFST to another two baseline spec-
tral algorithms. These baselines are naive applications of the algorithm by Hsu
et al. [13] to the problem of FST learning. The first baseline (HMM) learns an
HMM that models the joint distribution D ⊗ P. The second baseline (k-HMM)
learns k different HMMs, one for each input symbol. This correponds to learning
an operator Bab for each pair (a, b) ∈ X × Y using only the observations where
X2 = a and Y2 = b, ignoring the fact that one can use the same U , computed
with all samples, for every operator Bab . In this experiment, we randomly created
an input distribution D and a target FST P (using k = 3, l = 3, m = 2). Then
we randomly sampled training sequence pairs from D ⊗ P, and trained models
using the three spectral algorithms. To evaluate the performance we measured
the L1 distance on all sequence pairs of length 3. Figure 3 (left) plots learn-
ing curves resulting from averaging performance across 5 random runs of the
experiment. It can be seen that with enough examples the baseline algorithms
are outperformed by our method. Furthermore, the fact that the joint HMM
A Spectral Learning Algorithm for Finite State Transducers 167

0.1
svd
0.08
joint schur

L1 distance
0.06
0.7
HMM
0.6 k−HMM 0.04
FST
0.5 0.02
L1 distance

0.4 0
8 16 32 64 128 256 512 1024 2048 4096
# training samples (in thousands)
0.3
0.8
svd
0.2 joint schur
0.6

L1 distance
0.1

0.4
0
32 128 512 2048 8192 32768
# training samples (in thousands)
0.2

0
8 16 32 64 128
# training samples (in thousands)

Fig. 3. (Left) Comparison with spectral baselines. (Right) Comparison with joint de-
composition method.

outperforms the conditional FST with small sample sizes is consistent with the
well-known phenomena in classification where generative models can outperform
discriminative models with small sample sizes [20].
Our last experiment’s goal is to showcase the behavior of the algorithm pre-
sented in Section 3.1 for recovering the parameters of an FST using a joint
Schur decomposition. Though we do not have a theoretical analysis of this algo-
rithm, several experiments indicate that its behavior tends to depend more on
the particular model than that of the rest of spectral methods. In particular, in
many models we observe an asymptotic behavior similar to the one presented by
LearnFST, and for some of them we observe better absolute performance. Two
examples of this can be found in Figures 3 (right), where the accuracy versus
the number of examples is plotted for two different, randomly selected models
(with k = 3, l = 3, m = 2).

6 Experiments on Transliteration
In this section we present experiments on a real task in Natural Language Pro-
cessing, machine transliteration. The problem consists of mapping named entities
(e.g. person names, locations, etc.) between languages that have different alpha-
bets and sound systems, by producing a string in the target language that is
phonetically equivalent to the string in the source language. For example, the
English word “brooklyn” is transliterated into Russian as “ ”. Because
168 B. Balle, A. Quattoni, and X. Carreras

Table 1. Properties of the transliteration dataset. “length ratio” is the average ratio
between lengths of input and output training sequences. “equal length” is the percent-
age of training sequence pairs of equal length.

number of training sequences 6,000 average length x 7.84


number of test sequences 943 average length y 8.20
size of X 82 length ratio 0.959
size of Y 34 equal length 53.42%

orthographic and phonetic systems across languages differ, the lengths of paired
strings also differ in general. The goal of this experiment is to test the perfor-
mance of our learning algorithm in real data, and to compare it with a standard
EM algorithm for training FSTs.
We considered the English to Russian transliteration task of the News shared
task [17]. Training and test data consists of pairs of strings. Table 1 gives addi-
tional details on the dataset.
A standard metric to evaluate the accuracy of a transliteration system is the
normalized edit distance (ned) between the correct and predicted transliter-
ations. It counts the minimum number of character deletions, insertions and
substitutions that need to be made to transform the predicted string into the
correct one, divided by the length of the correct string and multiplied by 100.
In order to apply FSTs to this task we need to handle sequence pairs of un-
equal lengths. Following the classic work on transliteration by Knight and Graehl
[16] we introduced special symbols in the output alphabet which account for an
empty emission and every combination of two output symbols; thus, our FSTs
can map an input character to zero, one or two output characters. However,
the correct character alignments are not known. To account for this, for every
training pair we considered all possible alignments as having equal probability.2
It is easy to adjust our learning algorithm such that when computing the prob-
ability estimates (step 1 in the algorithm of Figure 1) we consider a distribution
over alignments between training pairs. This can be done efficiently with a sim-
ple extension to the classic dynamic programming algorithm for computing edit
distances.
At test, predicting the best output sequence (summing over all hidden se-
quences) is not tractable. We resorted to the standard approach of sampling,
where we used the FST to compute conditional estimates of the next output
symbol (see [13] for details on these computations).
The only free parameter of the FST is the number of hidden states (m).
There is a trade-off between increasing the number of hidden states, yielding
2
The alignments between sequences are a missing part in the training data, and
learning such alignments is in fact an important problem in FST learning (e.g.,
see [16]). However, note that our focus is not on learning alignments, but instead
on learning non-deterministic transductions between aligned sequences. In practice,
our algorithm could be used with an iterative EM method to learn both alignment
distributions and hidden states, and we believe future work should explore this line.
A Spectral Learning Algorithm for Finite State Transducers 169

Table 2. Normalized edit distance at test (ned) of a model as a function of the number
of hidden states (m), using all training samples. σ is the mth singular value of P̂ .

m 1 2 3 4 5
σ 0.0929 0.0914 0.0327 0.0241 0.0088
ned 21.769 21.189 21.224 26.227 71.780

Table 3. Running times (averaged, in seconds) of a single EM iteration, for different


number of training pairs and two different values for m. The number in parenthesis is
the number of iterations it takes to reach the best test performance (see Figure 4).

75 350 750 1500 3000 6000


m=2 1.15 (120) 3.53 (70) 6.73 (140) 10.60 (120) 19.8 (50) 37.74 (40)
m=3 1.16 (50) 3.55 (80) 6.75 (40) 10.62 (180) 19.9 (180) 37.78 (30)

lower approximation error, and increasing the estimation error. In particular,


our analysis states that the estimation error depends on the mth singular value
of P̂ . Table 2 illustrates this trade-off. Clearly, the singular values are a good
indicator of the proper range for m.
We also trained FST models using EM, for different number of hidden states.
We tried multiple random initializations and ran EM for a sufficiently large
number of iterations (200 in our experiments). We evaluated each EM model at
every 10 iterations on the test, and chose the best test performance.
Figure 4 shows the best learning curves using the spectral algorithm and EM,
for m = 2 and m = 3. The performance of the spectral method is similar to
that of EM for large training sizes while for smaller training sizes EM seems
to be unable to find a good model. Our experience at running EM was that for
large training sizes, the performance of different runs was similar, while for small
training sets the error rates of different runs had a large variance. The spectral
method seems to be very stable at finding good solutions.
We also compared the two learning methods in terms of computation time.3 .
Table 3 shows the time it takes to complete an iteration under EM, together
with the number of iterations it takes to reach the best error rates at tests.
In comparison, the spectral method takes about 13 seconds to compute the
statistics on the larger training set (step 1 on algorithm 1) and about 13 seconds
to perform the SVD and compute the operators (steps 2 and 3 on algorithm 1)
which gives a total of 26 seconds for the largest setting.
Comparing to state-of-the-art on the News data, our model obtains 87% on
the F1 metric, while the range of performances goes from 86.5% to 93% [17]4 .
It should be noted that transliteration systems exploit combinations of several
models optimized for the task. In contrast, we use out-of-the-box FSTs.
3
We used Matlab on an Intel Xeon 2.40GHz machine with 12Gb of RAM running
Linux.
4
Normalized edit distance was not measured in the News’09 Shared Task.
170 B. Balle, A. Quattoni, and X. Carreras

80
Spectral, m=2
Spectral, m=3
70 EM, m=2

normalized edit distance


EM, m=3

60

50

40

30

20
75 150 350 750 1500 3000 6000
# training sequences

Fig. 4. Learning curves for transliteration experiments using the spectral algorithm
and EM, for different number of hidden states. Error is measured as Normalized Edit
Distance.

7 Conclusions
In this paper we presented a spectral learning algorithm for probabilistic non-
deterministic FSTs. The main result are strong PAC-style guarantees, which, to
our knowledge, are the first for FST learning. Furthermore, we present extensive
experiments demonstrating the effectiveness of the proposed method in practice,
when learning from synthetic and real data.
An attractive property of our algorithm is its speed and scalability at training.
Experiments on a transliteration task show that, in practice, it is an effective
algorithm for learning FSTs. Our models could be used as building blocks to
solve complex tasks, such as parsing and translation of natural languages, and
planning in reinforcement learning.
Future work should improve the behavior of our algorithm in large input
alphabets by means of smoothing procedures. In practice, this should improve
the robustness of the method and make it applicable to a wider set of tasks.
Other lines of future research include: conducting a theoretical analysis of the
joint Schur approach for recovering parameters of HMM and FST, and exploring
the power of our algorithm for learning more general families of transductions.

References
1. Abe, N., Takeuchi, J., Warmuth, M.: Polynomial Learnability of Stochastic Rules
with Respect to the KL-Divergence and Quadratic Distance. IEICE Transactions
on Information and Systems 84(3), 299–316 (2001)
2. Bailly, R., Denis, F., Ralaivola, L.: Grammatical inference as a principal component
analysis problem. In: Proc. ICML (2009)
A Spectral Learning Algorithm for Finite State Transducers 171

3. Bernard, M., Janodet, J.-C., Sebban, M.: A discriminative model of stochastic edit
distance in the form of a conditional transducer. In: Sakakibara, Y., Kobayashi, S.,
Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp.
240–252. Springer, Heidelberg (2006)
4. Carlyle, J.W., Paz, A.: Realization by stochastic finite automaton. Journal of Com-
puter and System Sciences 5, 26–40 (1971)
5. Casacuberta, F.: Inference of finite-state transducers by using regular grammars
and morphisms. In: Oliveira, A.L. (ed.) ICGI 2000. LNCS (LNAI), vol. 1891, pp.
1–14. Springer, Heidelberg (2000)
6. Chang, J.T.: Full reconstruction of markov models on evolutionary trees: Identifi-
ability and consistency. Mathematical Biosciences 137, 51–73 (1996)
7. Chen, S., Goodman, J.: An empirical study of smoothing techniques for language
modeling. In: Proc. of ACL, pp. 310–318 (1996)
8. Clark, A.: Partially supervised learning of morphology with stochastic transducers.
In: Proc. of NLPRS, pp. 341–348 (2001)
9. Clark, A., Costa Florêncio, C., Watkins, C.: Languages as hyperplanes: grammat-
ical inference with string kernels. Machine Learning, 1–23 (2010)
10. Eisner, J.: Parameter estimation for probabilistic finite-state transducers. In: Proc.
of ACL, pp. 1–8 (2002)
11. Fliess, M.: Matrices de Hankel. Journal de Mathematiques Pures et Appliquees 53,
197–222 (1974)
12. Haardt, M., Nossek, J.A.: Simultaneous schur decomposition of several nonsymmet-
ric matrices to achieve automatic pairing in multidimensional harmonic retrieval
problems. IEEE Transactions on Signal Processing 46(1) (1998)
13. Hsu, D., Kakade, S.M., Zhang, T.: A spectral algorithm for learning hidden markov
models. In: Proc. of COLT (2009)
14. Jaeger, H.: Observable operator models for discrete stochastic time series. Neural
Computation 12, 1371–1398 (2000)
15. Jelinek, F.: Statistical Methods for Speech Recognition (Language, Speech, and
Communication). MIT Press, Cambridge (1998)
16. Knight, K., Graehl, J.: Machine transliteration. Computational Linguistics 24(4),
599–612 (1998)
17. Li, H., Kumaran, A., Pervouchine, V., Zhang, M.: Report of news 2009 machine
transliteration shared task. In: Proc. Named Entities Workshop (2009)
18. Mossel, E., Roch, S.: Learning nonsingular phylogenies and hidden markov models.
In: Proc. of STOC (2005)
19. Neuts, M.F.: Matrix-geometric solutions in stochastic models: an algorithmic ap-
proach. Johns Hopkins University Press, Baltimore (1981)
20. Ng, A., Jordan, M.: On discriminative vs. generative classifiers: A comparison of
logistic regression and naive bayes. In: NIPS (2002)
21. Ron, D., Singer, Y., Tishby, N.: On the learnability and usage of acyclic proba-
bilistic finite automata. In: Proc. of COLT, pp. 31–40 (1995)
22. Schützenberger, M.: On the definition of a family of automata. Information and
Control 4, 245–270 (1961)
23. Siddiqi, S.M., Boots, B., Gordon, G.J.: Reduced-Rank Hidden Markov Models. In:
Proc. AISTATS, pp. 741–748 (2010)
An Analysis of Probabilistic Methods for Top-N
Recommendation in Collaborative Filtering

Nicola Barbieri1,2 and Giuseppe Manco2


1
Department of Electronics, Informatics and Systems - University of Calabria,
via Bucci 41c, 87036 Rende (CS) - Italy
nbarbieri@deis.unical.it
2
Institute for High Performance Computing and Networks (ICAR)
Italian National Research Council
via Bucci 41c, 87036 Rende (CS) - Italy
{barbieri,manco}@icar.cnr.it

Abstract. In this work we perform an analysis of probabilistic ap-


proaches to recommendation upon a different validation perspective,
which focuses on accuracy metrics such as recall and precision of the
recommendation list. Traditionally, state-of-art approches to recommen-
dations consider the recommendation process from a “missing value pre-
diction” perspective. This approach simplifies the model validation phase
that is based on the minimization of standard error metrics such as
RMSE. However, recent studies have pointed several limitations of this
approach, showing that a lower RMSE does not necessarily imply im-
provements in terms of specific recommendations. We demonstrate that
the underlying probabilistic framework offers several advantages over tra-
ditional methods, in terms of flexibility in the generation of the recom-
mendation list and consequently in the accuracy of recommendation.

Keywords: Recommender Systems, Collaborative Filtering, Probabilis-


tic Topic Models, Performance.

1 Introduction

Recommender systems (RS) play an important role in several domains as they


provide users with potentially interesting recommendations within catalogs of
available information/products/services. Recommendations can rely either on
static information about the content of the available catalogs [16], or on a-
posteriori analysis of past behavior through collaborative filtering approaches
(CF) [7]. CF techniques are effective with huge catalogs when information about
past interactions is available.
To improve the accuracy of CF-based recommendation engines, researchers
have focused on the development of accurate techniques for rating prediction.
The recommendation problem has been interpreted as a missing value prediction
problem [19], in which, given an active user, the system is asked to predict her
preference for a set of items. Since a user is more prone to access items for which

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 172–187, 2011.

c Springer-Verlag Berlin Heidelberg 2011
An Analysis of Probabilistic Methods for Top-N Recommendation in CF 173

she will likely provide a positive feedback, a recommendation list can be be hence
built by drawing upon the (predicted) highly-rated items.
Under this perspective, a common approach to evaluate the predictive skills
of a recommender systems is to minimize statistical error metrics, such as the
Root Mean Squared Error (RMSE). The common assumption is that small im-
provements in RMSE would reflect into an increase of the accuracy of the recom-
mendation lists. This assumption, however does not necessarily hold. In [4], the
authors review the most common approaches to CF-based recommendation, and
compare them according to a new testing methodology which focuses on the ac-
curacy of the recommendation lists rather than on the rating prediction accuracy.
Notably, cutting-edge approaches characterized by low RMSE values achieves
performances comparable to naive techniques, whereas simpler approaches, such
as the pure SVD, consistently outperforms the other techniques. In an attempt
to find an explanation, the authors impute the contrasting behavior with a “lim-
itation of RMSE testing, which concentrates only on the ratings that the user
provided to the system” and consequently “misses much of the reality, where all
items should count, not only those actually rated by the user in the past” [4].
The point is that pure SVD rebuilds the original rating matrix in terms of
latent factors, rather than trying to minimize the error on observed data. In
practice, the underlying optimization problem is quite different, since it takes
into account the whole rating matrix considering both observed and unobserved
preference values. To summarize, it is likely to better identify the latent factors
and the hidden relationships between both factor/users and factors/items. It
is natural then to ask whether more sophisticated latent factor models confirm
this trend, and are able to guarantee better results in terms of recommendation
accuracy, even when they provide poor RMSE performances.
Among the state-of-the art latent factor models, probabilistic techniques of-
fer some advantages over traditional deterministic models: notably, they do not
minimize a particular error metric but are designed to maximize the likelihood
of the model given the data which is a more general approach; moreover, they
can be used to model a distribution over rating values which can be used to de-
termine the confidence of the model in providing a recommendation; finally, they
allow the possibility to include prior knowledge into the generative process, thus
allowing a more effective modeling of the underlying data distribution. However,
previous studies on recommendation accuracy do not take into consideration
such probabilistic approaches to CF, which instead appear rather promising un-
der the above devised perspective.
In this paper we adopt the testing methodology proposed in [4], and discuss
also other metrics [6] for assessing the accuracy of the recommendation list. Based
on these settings, we perform an empirical study of some paradigmatic probabilis-
tic approaches to recommendation. We study different techniques to rank items
in a probabilistic framework, and evaluate their impact in the generation of a
recommendation list. We shall consider approaches for both implicit and explicit
preference values, and show that latent factor models, equipped with the proper
ranking functions, achieve competitive advantages over traditional techniques.
174 N. Barbieri and G. Manco

The rest of the paper is organized as follows: the testing methodology and
the accuracy metrics are discussed in Sec. 2. Section 3 introduces the proba-
bilistic approaches to CF that we are interested in evaluating. The approaches
we include can be considered representative of wider classes which share the
same roots. In this context, our results can be extended to more sophisticated
approaches. Finally, in Sec. 4 we compare the approaches and assess their effec-
tiveness according to the selected testing methodology.

2 Evaluating Recommendations: A Review


To begin with, we introduce some notation to be used throughout the paper.
User’s preferences can be represented by using a m×n rating matrix R, where m
is the cardinality of the user-set UR = {u1 , · · · , um } and n is the cardinality of
the item-set IR = {i1 , · · · , in }. We denote by riu (resp. Rui when the reference
to the matrix needs to be made explicit) the rating value associated to the
pair u, i. Values fall within a fixed integer range V = {0, · · · , V }, where 0
denotes “rating unknown”, and V represents the highest interest value . Implicit
feedback assumes that V = 1. When V > 1, we will denote by r R the average
rating among all those ratings riu > 0 in R. Users tend to express their interest
only on a restricted number of items; thus, the rating matrix is characterized by
an exceptional sparseness factor (e.g., more than 95%).. Let IR (u) denotes the
set of products rated by the user u: IR (u) = {i ∈ I : riu  = 0}; symmetrically,
we will denote by UR (i) the set of users who have expressed their preference on
the item i.
The general framework for the generation of a recommendation list can be
modeled as follows. We will denote by Lju the recommendation list provided by
the system to the user u during a generic session j. Then, the following protocols
applies:
– Let Cuj a list of D candidate random items unrated by the user u in the past
sessions 1, . . . , j − 1;
– Associate to each item i ∈ Cuj a score pu,j
i which represents the user’s interest
for i in session j;
– Sort Cuj in descending order given the values pu,j
i ;
– Add the first N items from Cuj to Lju and return the latter to the user.
Simple scoring functions can be obtained considering non-personalized baseline
models which take into account the popularity or the average rating of an items.
More specifically, Top Popular (Top-Pop) recommends items with the highest
number of ratings, while Item Average (Item-Avg) selects items with the highest
average rating. For the purposes of this paper, we assume that each RS is capable
of providing a specific scoring pu,j
i . Thus, the testing methodology basically relies
on the evaluation of capability of the RS in providing higher scores for the items
of interest in Cuj .
A common framework in the evaluation of the predictive capabilities of a RS
algorithm is to split the rating matrix R into matrices T and S: the first one is
An Analysis of Probabilistic Methods for Top-N Recommendation in CF 175

used to train the RS, while the latter is used for validation. It is worth noticing
that, while both T and S share the same dimensions as R, for each pair (u, i) we
have that Sui > 0 implies Tui = 0, i.e. no incompatible values overlap between
training and test set. By selecting a user in S, the set Cuj is obtained by drawing
upon IR − IT (u). Next, we ask the system to predict a set of items which he/she
may like and then measure the accuracy of the provided recommendation. Here,
the accuracy is measured by comparing the top-N items selected by resorting to
the RS, with those appearing in IS (u).

Precision and Recall of the Recommendation List. A first, coarse-grained ap-


proach to evaluation, can be obtained by employing standard classification-based
accuracy metrics such as precision and recall, which require the capability to dis-
tinguish between relevant and not relevant recommendations. Given a user, we
assume a unique session of recommendation, and we compare the recommen-
dation list of N items provided by the RS, according to the protocol described
above, with those relevant items in IS (u). In particular, assuming we can identify
a subset Tur ⊆ IS (u) of relevant items, we can compute precision and recall as:
M
1  |L·u ∩ Tur |
Recall(N ) =
M u=1 |Tur |
M
1  |L·u ∩ Tur |
P recision(N ) =
M u=1 N

Relevance can be measured in several different ways. Here we adopt two alter-
native definitions. When V > 1 (i.e., an explicit preference value is available)
we denote as relevant all those items which received a rating greater than the
average ratings in the training set, i.e.,

Tur = {i ∈ IS (u)|Sui > r T }

Implicit preferences assume instead that all items in IS (u) are relevant.

Evaluating Users Satisfaction. The above definitions of precision and recall aims
at evaluating the amount of useful recommendations in a single session. A dif-
ferent perspective can be considered by assuming that a recommendation meets
user satisfaction if he/she can find in the recommendation list at least an item
which meets his/her interests. This perspective can be better modeled by a dif-
ferent approach to measure accuracy, as proposed in [5,4]. The approach relies
on a different definition of relevant items, namely:

Tur = {i ∈ IS (u)|Sui = V }

Then, the following testing protocol can be applied:


– For each user u and for each positively-rated item i ∈ Tur :
176 N. Barbieri and G. Manco

• Generate the candidate list Cu by randomly drawing from IR − (IT (u) ∪


{i});
• add i to Cu and sort the list according to the scoring function;
• Record the position of the item i in the ordered list:
– if i belongs to the top-k items, we have a hit
– otherwise, we have a miss
Practically speaking, we ask the RS to rank an initial random sample which also
contains i. If i is actually recommended, we have an hit, otherwise the RS has
failed in detecting an item of high interest for the considered user. Recall and
precision can hence be tuned accordingly:
#hits
USRecall(N ) = (1)
|Ts |
#hits recall (N )
USPrecision(N ) = r
= (2)
N · |Tu | N

Notice that the above definition of precision does not penalize false positives:
the recommendation is considered successful if it matches at least an item of
interest. However, neither the amount of non-relevant“spurious” items, nor the
position of the relevant item within the top-N is taken into account.

3 Collaborative Filtering in a Probabilistic Framework


Probabilistic approaches assume that each preference observation is randomly
drawn from the joint distribution of the random variables which model users,
items and preference values (if available). Typically, the random generation pro-
cess follows a bag of words assumption and preference observations are assumed
to be generated independently. A key difference between probabilistic and deter-
ministic models relies in the inference phase: while the latter approaches try to
minimize directly the error made by the model, probabilistic approaches do not
focus on a particular error metric; parameters are determined by maximizing the
likelihood of the data, typically employing an Expectation Maximization pro-
cedure. In addition, background knowledge can be explicitly modeled by means
prior probabilities, thus allowing a direct control on overfitting within the infer-
ence procedure [8]. By modeling prior knowledge, they implicitly solve the need
for regularization which affects traditional gradient-descent based latent factors
approaches.
Further advantages of probabilistic models can be found in their easy inter-
pretability: they can often be represented by using a graphical model, which
summarizes the intuition behind the model by underlying causal dependencies
between users, items and hidden factors. Also, they provide an unified frame-
work for combining collaborative and content features [1,21,11], to produce more
accurate recommendations even in the case of new users/items. Moreover, as-
suming that an explicit preference value is available, probabilistic models can
An Analysis of Probabilistic Methods for Top-N Recommendation in CF 177

be used to model a distribution over rating values which can be used to infer
confidence intervals and to determine the confidence of the model in providing
a recommendation.
In the following we will briefly introduce some paradigmatic probabilistic ap-
proaches to recommendation, and discuss how these probabilistic model can be
used for item ranking, which is then employed to produce the top-N recommen-
dation list. The underlying idea of probabilistic models based on latent factors
is that each preference observation u, i is generated by one of k possible states,
which informally model the underlying reason why u has chosen/rated i. Based
on the mathematical model, two different inferences can be then supported to
be exploited in item ranking, where the main difference [9], lies in a difference
way of modeling data according to the underlying model:
– Forced Prediction: the model provides estimate of P (r|u, i), which represents
the conditional probability that user u assign a rating value r given the item
i;
– Free prediction: the item selection process is included in the model, which is
typically based on the estimate of P (r, i|u). In this case we are interested in
predicting both the item selection and the preference of the user for each se-
lected item. P (r, i|u) can be factorized as P (r|i, u)P (i|u); the resulting model
still includes a component of forced prediction which however is weighted by
the item selection component and thus allows a more precise estimate of
user’s preferences.

3.1 Modeling Preference Data


In the simplest model, we assume that a user u is associated with a latent
factor Z, and ratings for an item i are generated according to this factor. the
generative model for this mixture is given in Fig. 1(a). The θ parameter here a
the prior probability distribution P (Z), whereas βi,z is the prior for the rating
generation P (R = r|i, z). We shall refer to the Multinomial Mixture Model
(MMM, [14]) to denote that βi,z is a multinomial over V. Forced prediction can
be achieved by 
P (r|i, u) = βz,i,r P (z|u) (3)
z

where
P (z|u) ∝ P (uobs |z)θz
and uobs represents the observed values (u, i, r) in R.
The probabilistic Latent Semantic Analysis approach (PLSA, [9]) spec-
ifies a co-occurence data model in which the user u and item i are conditionally
independent given the state Z of the latent factor. Differently from the previous
mixture model, where a single latent factor is associated with every user u, the
PLSA model associates a latent variable with every observation triplet (u, i, r).
Hence, different ratings of the same user can be explained by different latent
178 N. Barbieri and G. Manco

φ
k
β
k×n

z r n θ z r
θ nu
m
m

β k×n

(a) Mixture Model (b) User Community Model

φ β
k k×n

θ z i n θ z r
n
m m

(c) pLSA (d) Aspect Model

η φ η β
k k×n

α α
θ z i n θ z r
n
m m

(e) LDA (f) URP


σu σi

m γ δ n
σ

m×n
(g) Probabilistic Matrix Factoriza-
tion

Fig. 1. Generative models to preference data


An Analysis of Probabilistic Methods for Top-N Recommendation in CF 179

causes in PLSA (modeled as priors {θu }1,...,m in Fig. 1(c)), whereas a mixture
model assumes that all ratings involving the same user are linked to the same
underlying community. PLSA directly supports item selection:

P (i|u) = φz,i θu,z (4)
z

where φz represents a multinomial distribution over items. The main drawback


of the PLSA approach is that it cannot directly model new users, because the
parameters θu,z = P (z|u) are specified only for those users in the training set.
We consider two further variants for the PLSA, where explicit preferences are
modeled by an underlying distribution βz,i . In the Aspect Model (AM, [10])
βz,i is a multinomial over V. In this case, the rating probability can be modeled
as 
P (r|u, i) = βr,i,z θu,z (5)
z

Conversely, the Gaussian Mixture Model (G-PLSA, [8]) models βz,i = (μiz ,
σiz ) as a gaussian distribution, and provides a normalization of ratings through
the user’s mean and variance, thus allowing to model users with different rating
patterns. The corresponding rating probability is

P (r|u, i) = N (r; μiz , σiz )θu,z (6)
z

The Latent Dirichlet Allocation [3] is designed to overcome the main draw-
back in the PLSA-based models, by introducing Dirichlet priors, which provide
a full generative semantic at user level and avoid overfitting. Again, two differ-
ent formulations, are available, based on whether we are interested in modeling
implicit (LDA) or explicit (User Rating Profile, URP[13]) preference values.
In the first case, we have:
 
P (i|u) = φz,i θz P (θ|uobs )dθ (7)
z

(where P (θ|uobs ) is estimated in the inference phase). Analogously, for the URP
we have  
P (r|u, i) = βz,i,r θz P (θ|uobs )dθ (8)
z

The User Communities Model (UCM, [2]) adopts the same inference for-
mula Eq. 3 of the multinomial model. Nevertheless, it introduces some key fea-
tures, that combine the advantages of both the AM and the MMM, as shown
in Fig. 1(b). First, the exploitation of a unique prior distribution θ over the
user communities helps in preventing overfitting. Second, adds flexibility in the
prediction by modeling an item as an observed (and hence randomly generated)
component. UCM directly a free-prediction approach.
180 N. Barbieri and G. Manco

Finally, the Probabilistic Matrix Factorization approach (PMF, [18]) re-


formulates the rating assignment as a matrix factorization. Given the latent
user and item k-feature matrices γu and δi , (where K denotes the number of
the features employed in the factorization), the preference value is generated by
assuming a Gaussian distribution over rating values conditioned on the interac-
tions between the user and the considered item in the latent space, as shown in
Fig. 1(g). In practice, P (r|u, i) is modeled as a gaussian distribution, with mean
γuT δi and fixed variance σ:

P (r|u, i) = N (r; γuT δi , σ 2 ) (9)

Both the original approach and its bayesian generalizations [17,20] are charac-
terized by high prediction accuracy.

3.2 Item Ranking


In this section we discuss how the above described models can be used to provide
the ranking pui for a given user u and an item i in the protocol described in Sec. 2.

Predicted Rating. The most intuitive way to provide item ranking in the recom-
mendation process relies on the analysis of the distribution over preference values
P (r|u, i) (assuming that we are modeling explicit preference data). Given this
distribution, there are several methods for computing the ranking for each pair
u, i; the most commonly used is the expected value E[R|u, i], as it minimizes
the MSE and thus the RMSE:

pui = E[R|u, i] (10)

We will show in Sec. 4 that this approach fails in providing accurate recommen-
dation and discuss about potential causes.

Item Selection. For co-occurrence preference approaches, the rank of each item
i, with regards to the user u can be computed as the mixture:

pui = P (i|u) = P (z|u)P (i|z) (11)
z

where P (i|z) is the probability that i will be selected by users represented by the
abstract pattern z. This distribution is a key feature of co-occurrence preference
approaches and models based on free-prediction. When P (i|z) is not directly
inferred by the model, we can still estimate it by averaging on all the possible
users who selected i: 
P (i|z) ∝ δ(u, i)T P (z|u)
u

where δT (u, i) = 1 if Tui 


= 0.
An Analysis of Probabilistic Methods for Top-N Recommendation in CF 181

Item Selection And Relevance. In order to force the selection process to concen-
trate on relevant items, we can extend the ranking discussed above, by including
a component that represents the “predicted” relevance of an item with respect
to a given user:
pui = P (i, r > r T |u)

= P (i|u)P (r > r T |u, i) = P (z|u)P (i|z)P (r > r T |i, z) (12)
z

where P (r > r T |i, z) = r>rT P (r|i, z). In practice, an item is ranked on the
basis of the value of its score, by giving high priority to the high-score items.

4 Evaluation
In this section we experiment the testing protocols presented in Sec. 2 on the
probabilistic approaches defined in the previous section. We use the MovieLens-
1M1 dataset, which consists of 1, 000, 209 ratings given by 6, 040 users on approx-
imately 3, 706 movies, with a sparseness coefficient 96% and an average number
of ratings 132 per user, and 216 per item. In the evaluation phase, we adopt
a MonteCarlo 5-folds validation, where for each fold contains about the 80% of
overall ratings and the remaining data (20%) is used as test-set. The final results
reported by averaging the values achieved in each fold.
In order to make our results comparable with the ones reported in [4], we con-
sider Top-Pop and Item-Avg algorithms as baseline, and Pure-SVD as a main
competitor. Notice that there are some differences between our evaluation and
the one performed in the above cited study, namely: (i) we decided to employ
bigger test-sets (20% of the overall data vs 1.4%) and to cross-validate the re-
sults; (ii) for lack of space we concentrate on MovieLens only, and omit further
evaluations on the Netflix data (which however, in the original paper [4], confirm
Pure-SVD as the top-performer); (iii) we decided to omit the “long tail” test,
aimed at evaluating the capability of suggesting non-trivial items, as it is out of
the scope of this paper.2
In the following we study the effects of the ranking function on the accuracy
of the recommendation list. The results we report are obtained by varying the
length of the recommendation list in the range 1 − 20 and the dimension of the
random sample is fixed to D = 1000. In a preliminary test, we found the optimal
number of components for the Pure-SVD to be set to 50.

4.1 Predicted Rating


We start our analysis from the evaluation of the recommendation accuracy
achieved by approaches that model explicit preference data, namely PMF, MMM,
1
http://www.grouplens.org/system/files/ml-data-10M100K.tar.gz
2
Notice, however, that it is still possible to perform an indirect measurement of the
non-triviality and correctness of the discussed approaches by measuring the gain in
recommendation accuracy wrt. the Top-Pop recommendation algorithm.
182 N. Barbieri and G. Manco

URP, UCM and G-PLSA, where the predicted rating is employed as ranking
function. First of all, the following table summarizes the RMSE obtained by
these approaches:
Approach RMSE #Latent Factors
Item Avg 0.9784 -
MMM 1.0000 20
G-PLSA 0.9238 70
UCM 0.9824 10
URP 0.8989 10
PMF 0.8719 30

The results about Recall and Precision are given in Fig. 2, where the respective
number of latent factors is given in brackets. Considering user satisfaction, al-
most all the probabilistic approaches fall between the two baselines. Pure-SVD
outperforms significantly the best probabilistic performers, namely URP and
PMF. The trend for probabilistic approaches does not change considering Recall
and Precision, but in this case not even the Pure-SVD is able to outperform
Top-Pop, which exhibits a consistent gain over all the considered competitors.
A first summary can be obtained as follows. First, we can confirm that there
is no monotonic relationship between RMSE and recommendation accuracy. All

US-Recall Ranking Method: Expected Value Prediction US-Precision Ranking Method: Expected Value Prediction
0.35 0.06

0.3
0.05

0.25
0.04
US-Precision

0.2
US-Recall

0.03
0.15

0.02
0.1

0.01
0.05

0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N N
Item-Avg Pure-SVD(50) MMM(20) GPLSA(70) Item-Avg Pure-SVD(50) MMM(20) GPLSA(70)
Top-Pop PMF(30) URP(10) UCM(10) Top-Pop PMF(30) URP(10) UCM(10)

(a) US-Recall (b) US-Precision


Recall Ranking Method: Expected Value Prediction Precision Ranking Method: Expected Value Prediction
0.3 0.18

0.16
0.25
0.14

0.2 0.12
Precision

0.1
Recall

0.15
0.08

0.1 0.06

0.04
0.05
0.02

0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N N
Item-Avg Pure-SVD(50) MMM(20) GPLSA(70) Item-Avg Pure-SVD(50) MMM(20) GPLSA(70)
Top-Pop PMF(30) URP(10) UCM(10) Top-Pop PMF(30) URP(10) UCM(10)

(c) Recall (d) Precision

Fig. 2. Recommendation Accuracy achieved by probabilistic approaches considering


E[r|u, i] as ranking function
An Analysis of Probabilistic Methods for Top-N Recommendation in CF 183

the approaches tend to have a non-deterministic behavior, and even the best
approaches provide unstable results depending on the size N . Further, ranking
by the expected value exhibits unacceptable performance on the probabilistic
approaches, which reveal totally inadequate in this perspective. More in general,
any variant of this approach that we do not report here for space limitations)
does not substantially change the results.

4.2 Item Selection and Relevance

Things radically change when item occurrence is taken into consideration. Fig. 3
show the recommendation accuracy achieved by probabilistic models which em-
ploy Item-Selection (LDA,PLSA,UCM and URP) and Item-Selection&Relevance
(UCM and URP). The LDA approach significantly outperforms all the available
approaches. Surprisingly, UCM is the runner-up, as opposed to the behavior ex-
hibited with the expected value ranking. it is clear that the component P (i|z)
here plays a crucial role, that is further strengthened by the relevance ranking
component.
Also surprising is the behavior of URP, which still achieves a satisfactory
performance compared to Pure-SVD. However, it does not compare to LDA. The
reason can be found in the fact that the inference procedure in the LDA directly

US-Recall - Probabilistic Approaches US-Precision - Probabilistic Approaches


0.5 0.08

0.45
0.07
0.4
0.06
0.35
0.05
US-Precision

0.3
US-Recall

0.25 0.04

0.2
0.03
0.15
0.02
0.1
0.01
0.05

0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N N
Item-Avg PLSA(20) Item-Avg PLSA(20)
Top-Pop URP(10)SelectionRanking Top-Pop URP(10)SelectionRanking
Pure-SVD(50) URP(10)SelectionRelevanceRanking Pure-SVD(50) URP(10)SelectionRelevanceRanking
LDA(20) UCM(10)SelectionRanking LDA(20) UCM(10)SelectionRanking
MMM(20)SelectionRanking UCM(10)SelectionRelevanceRanking MMM(20)SelectionRanking UCM(10)SelectionRelevanceRanking

(a) US-Recall (b) US-Precision


Recall - Probabilistic Approaches Precision - Probabilistic Approaches
0.4 0.3

0.35
0.25
0.3
0.2
0.25
Precision
Recall

0.2 0.15

0.15
0.1
0.1
0.05
0.05

0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N N
Item-Avg PLSA(20) Item-Avg PLSA(20)
Top-Pop URP(10)SelectionRanking Top-Pop URP(10)SelectionRanking
Pure-SVD(50) URP(10)SelectionRelevanceRanking Pure-SVD(50) URP(10)SelectionRelevanceRanking
LDA(20) UCM(10)SelectionRanking LDA(20) UCM(10)SelectionRanking
MMM(20)SelectionRanking UCM(10)SelectionRelevanceRanking MMM(20)SelectionRanking UCM(10)SelectionRelevanceRanking

(c) Recall (d) Precision

Fig. 3. Recommendation Accuracy achieved by probabilistic approaches considering


P (i|u) or P (i, r > 3|u) as ranking functions
184 N. Barbieri and G. Manco

US Recall - Probabilistic Approaches - K=20 US Precision - Probabilistic Approaches - K=20


0.8 0.04

0.75

0.7 0.035

0.65

0.6 0.03

US Precision
US Recall

0.55

0.5 0.025

0.45

0.4 0.02

0.35

0.3 0.015
250 500 750 1000 250 500 750 1000
Dimension of the Random Sample Dimension of the Random Sample
LDA(20) UCM(10)SelectionRelevanceRanking LDA(20) UCM(10)SelectionRelevanceRanking
Pure-SVD(50) URP(10)SelectionRelevanceRanking Pure-SVD(50) URP(10)SelectionRelevanceRanking

(a) US-Recall (b) US-Precision

Fig. 4. Recommendation Accuracy achieved by probabilistic approaches considering


K=20 and varying the dimension of the random Sample

estimates P (i|z), whereas such a component in the URP model is approximated


a-posteriori. This is also proved by the unsatisfactory performance of the MMM
approach which falls short of the expectations. Since the UCM is an extension
of the MMM, it is clear that explicitly inferring the φ component in the model
helps in achieving a stronger accuracy.
The PLSA model also seems to suffer from from the overfitting issues, as it is
not able to reach the performances of the Pure-SVD. On the other side, if user
satisfaction is not taken into account, the PLSA outperforms the Pure-SVD,
as it follows the general trend of the Top-Pop model. More in general, mod-
els equipped with Item-Selection&Relevance outperform their respective version
which make recommendation basing only on the Item-Selection component.
We also perform an additional test to evaluate the impact of the size of the
random sample in the testing methodology employed to measure user satis-
faction. Results achieved by LDA,Pure-SVD, UCM/URP (Selection&Relevance
Ranking) are given in Fig. 4. Probabilistic approaches outperform systematically
Pure-SVD for each value of D.

4.3 Discussion
There are two main considerations in the above figures. One is that rating pre-
diction fails in providing accurate recommendations. The second observation is
the unexpected strong impact of the item selection component, when properly
estimated.
In an attempt to carefully analyze the rating prediction pitfalls, we can plot
in Fig. 5(a) the contribution to the RMSE in each single evaluation in V by the
probabilistic techniques under consideration. Item-Avg acts as baseline here.
While predictions are accurate for values 3 − 4, they result rather inadeguate
for border values, namely 1, 2 and 5. This is mainly due to the nature of RMSE,
which penalizes larger errors. This clearly supports the thesis that low RMSE
An Analysis of Probabilistic Methods for Top-N Recommendation in CF 185

RMSE For Rating Values Distribution of Rating Values


2.2 35
ItemAvg
PMF(30)
2 MMM(20)
URP(10)
GPLSA(70) 30
UCM(10)
1.8

1.6 25

Percentage of Ratings
1.4
RMSE

20
1.2

1 15

0.8
10
0.6

0.4 5
* ** *** **** ***** * ** *** **** *****
Rating Values Rating Values

(a) RMSE by rating value (b) Rating values in MovieLens

Fig. 5. Analysis of Prediction Accuracy

Comparison of Ranking Methods UCM(10): US-Recall Probabilistic Ensemble: US-Recall


0.4 0.5

0.45
0.35
0.4
0.3
0.35

0.25 0.3
US-Recall
US-Recall

0.25
0.2
0.2
0.15
0.15

0.1 0.1

0.05
0.05
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 N
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
N Pure-SVD(50)
LDA(20)SelectionRanking
UCM(10)Ranking:E[r|u,i] UCM(10)Ranking:P(i,r>3|u) URP(10)PredictionRanking
UCM(10)Ranking:P(i|u) Ensemble(LDA+URP)SelectionRelevanceRanking

(a) Comparison of Different Ranking (b) Probabilistic Ensamble


Function on UCM(10)

Fig. 6.

does not necessarily induces good accuracy, as the latter is mainly influenced by
the items in class 5 (where the approaches are more prone to fail). It is clear
that a better tuning of the ranking function should take this component into
account.
Also, by looking at the distribution of the rating values, we can see that the
dataset is biased towards the mean values, and more in general the low rating
values represent a lower percentage. This explains, on one side, the tendency of
the expected value to flatten towards a mean value (and hence to fail in providing
an accurate prediction). On the other side, the lack of low-rating values provides
an interpretation of the dataset as a Like/DisLike matrix, for which the item
selection tuning provides a better modeling.
By the way, the rating information, combined with item selection, provides
a marginal improvement, as testified by Fig. 6(a). Here, a closer look at the
UCM approach is taken, by plotting three curves relative to the three different
approaches to item ranking. Large recommendation lists tend to be affected by
the rating prediction.
186 N. Barbieri and G. Manco

Our experiments have shown that item selection component plays the most im-
portant role in recommendation ranking. However, better results can be achieved
by considering also a rating prediction component. To empirically prove the ef-
fectiveness of such approach, we performed a final test in which item ranking is
performed by employing an ensemble approach based on the item selection and
relevance ranking. In this case, the components of the ranking come from dif-
ferent model: the selection probability is computed according to an LDA model,
while the relevance ranking is computed by employing the URP model. Fig. 6(b)
shows that this approach outperforms LDA, achieving the best result in recom-
mendation accuracy ( due to the lack of space we show only the trend corre-
spoding to US-Recall).

5 Conclusion and Future Works

We have shown that probabilistic models, equipped with the proper ranking
function, exhibit competitive advantages over state-of-the-art RS in terms of
recommendation accuracy. In particular, we have shown that strategies based on
item selection guarantee significant improvements, and we have investigated the
motivations behind the failure of prediction-based approaches. The advantage
of probabilistic models lies in their flexibility, as they allow switching between
both methods in the same inference framework. The nonmonotonic behavior of
RMSE also finds its explanation in the distribution of errors along the rating
values, thus suggesting different strategies for prediction-based recommendation.
Besides the above mentioned, there are other significant advantages in the
adoption of probabilistic models for recommendation. Recent studies pointed
out that there is more in recommendation than just rating prediction. A suc-
cessful recommendation should answer to the simple question ‘What is the user
actually looking for?’ which is strictly tied with dynamic user profiling. Moreover,
prediction-based recommender systems do not consider one of the most impor-
tant applications from the retailer point of view: suggesting users products they
would not have found otherwise discovered.
In [15] the authors argued that the popular testing methodology based on pre-
diction accuracy is rather inadeguate and does not capture important aspects
of the recommendations, like non triviality, serendipity, users needs and expec-
tations, and their studies have scaled down the usefulness of achieving a lower
RMSE [12]. In short, the evaluation of a recommender cannot rely exclusively
on prediction accuracy but must take into account what is really displayed to
user, i.e the recommendation list, and its impact on his/her navigation.
Clearly, probabilistic graphical models, like the ones discussed in this paper,
provide several components which can be fruitfully exploited for the estimation
of such measures. Latent factors, probability of item selection and rating proba-
bility can help in better specify usefulness in recommendation. We plan to extend
the framework in this paper in this promising directions, by providing subjective
measures for such features and measuring the impact of such models.
An Analysis of Probabilistic Methods for Top-N Recommendation in CF 187

References
1. Agarwal, D., Chen, B.-C.: flda: matrix factorization through latent dirichlet allo-
cation. In: WSDM, pp. 91–100 (2010)
2. Barbieri, N., Guarascio, M., Manco, G.: A probabilistic hierarchical approach for
pattern discovery in collaborative filtering data. In: SMD (2011)
3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of
Machine Learning Research 3, 993–1022 (2003)
4. Cremonesi, P., Koren, Y., Turrin, R.: Performance of recommender algorithms on
top-n recommendation tasks. In: ACM RecSys, pp. 39–46 (2010)
5. Cremonesi, P., Turrin, R., Lentini, E., Matteucci, M.: An evaluation methodology
for collaborative recommender systems. In: AXMEDIS, pp. 224–231 (2008)
6. Ge, M., Delgado-Battenfeld, C., Jannach, D.: Beyond accuracy: evaluating recom-
mender systems by coverage and serendipity. In: ACM RecSys, pp. 257–260 (2010)
7. Goldberg, D., Nichols, D., Oki, B.M., Terry, D.: Using collaborative filtering to
weave an information tapestry. Communications of the ACM 35(12), 61–70 (1992)
8. Hofmann, T.: Collaborative filtering via gaussian probabilistic latent semantic anal-
ysis. In: SIGIR (2003)
9. Hofmann, T.: Latent semantic models for collaborative filtering. ACM Transactions
on Information Systems (TOIS) 22(1), 89–115 (2004)
10. Hofmann, T., Puzicha, J.: Latent class models for collaborative filtering. IJCAI,
688–693 (1999)
11. Jin, X., Zhou, Y., Mobasher, B.: A maximum entropy web recommendation system:
combining collaborative and content features. In: KDD, pp. 612–617 (2005)
12. Koren, Y.: How useful is a lower rmse? (2007),
http://www.netflixprize.com/community/viewtopic.php?id=828
13. Marlin, B.: Modeling user rating profiles for collaborative filtering. In: NIPS (2003)
14. Marlin, B., Marlin, B.: Collaborative filtering: A machine learning perspective.
Tech. rep., Department of Computer Science University of Toronto (2004)
15. McNee, S.M., Riedl, J., Konstan, J.A.: Being accurate is not enough: How accuracy
metrics have hurt recommender systems. In: ACM SIGCHI Conference on Human
Factors in Computing Systems, pp. 1097–1101 (2006)
16. Pazzani, M.J., Billsus, D.: Content-based recommendation systems. The Adaptive
Web: Methods and Strategies of Web Personalization, 325–341 (2007)
17. Salakhutdinov, R., Mnih, A.: Bayesian probabilistic matrix factorization using
markov chain monte carlo. In: ICML, pp. 880–887 (2008)
18. Salakhutdinov, R., Mnih, A.: Probabilistic matrix factorization. In: NIPS, pp.
1257–1264 (2008)
19. Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filtering
recommendation algorithms. In: WWW, pp. 285–295 (2001)
20. Shan, H., Banerjee, A.: Generalized probabilistic matrix factorizations for collab-
orative filtering. In: ICDM (2010)
21. Stern, D.H., Herbrich, R., Graepel, T.: Matchbox: large scale online bayesian rec-
ommendations. In: WWW, pp. 111–120 (2009)
Learning Good Edit Similarities with
Generalization Guarantees

Aurélien Bellet1 , Amaury Habrard2 , and Marc Sebban1


1
Laboratoire Hubert Curien UMR CNRS 5516,
University of Jean Monnet, 42000 Saint-Etienne Cedex 2, France
{aurelien.bellet,marc.sebban}@univ-st-etienne.fr
2
Laboratoire d’Informatique Fondamentale UMR CNRS 6166,
University of Aix-Marseille, 13453 Marseille Cedex 13, France
amaury.habrard@lif.univ-mrs.fr

Abstract. Similarity and distance functions are essential to many learn-


ing algorithms, thus training them has attracted a lot of interest. When
it comes to dealing with structured data (e.g., strings or trees), edit simi-
larities are widely used, and there exists a few methods for learning them.
However, these methods offer no theoretical guarantee as to the gener-
alization performance and discriminative power of the resulting similari-
ties. Recently, a theory of learning with (, γ, τ )-good similarity functions
was proposed. This new theory bridges the gap between the properties
of a similarity function and its performance in classification. In this pa-
per, we propose a novel edit similarity learning approach (GESL) driven
by the idea of (, γ, τ )-goodness, which allows us to derive generalization
guarantees using the notion of uniform stability. We experimentally show
that edit similarities learned with our method induce classification mod-
els that are both more accurate and sparser than those induced by the
edit distance or edit similarities learned with a state-of-the-art method.

Keywords: Edit Similarity Learning, Good Similarity Functions.

1 Introduction
Similarity and distance functions between objects play an important role in
many supervised and unsupervised learning methods, among which the popular
k-nearest neighbors, k-means and support vector machines. For this reason, a lot
of research has gone into automatically learning similarity or distance functions
from data, which is usually referred to as metric learning. When data consists
in numerical vectors, a common approach is to learn the parameters (i.e., the
transformation matrix) of a Mahalanobis distance [1–4].
Because they involve more complex procedures, less work has been devoted to
learning such functions from structured data (for example strings or trees). Still,
there exists a few methods for learning edit distance-based functions. Roughly

We would like to acknowledge support from the ANR LAMPADA 09-EMER-007-02
project and the PASCAL 2 Network of Excellence.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 188–203, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Learning Good Edit Similarities with Generalization Guarantees 189

speaking, the edit distance between two objects is the cost of the best sequence of
operations (insertion, deletion, substitution) required to transform an object into
another, where an edit cost is assigned to each possible operation. Most general-
purpose methods for learning the edit cost matrix maximize the likelihood of the
data using EM-based iterative methods [5–9], which can imply a costly learning
phase. Saigo et al. [10] manage to avoid this drawback in the context of remote
homologies detection in protein sequences by applying gradient descent to a
specific objective function. Some of the above methods do not guarantee to
find the optimal parameters and/or are only based on a training set of positive
pairs: they do not take advantage of pairs of examples that have different labels.
Above all, none of these methods offer theoretical guarantees that the learned
edit functions will generalize well to unseen examples (while it is the case for
some Mahalanobis distance learning methods [4]) and lead to good performance
for the classification or clustering task at hand.
Recently, Balcan et al. [11, 12] introduced a theory of learning with so-called
(, γ, τ )-good similarity functions that gives intuitive, sufficient conditions for a
similarity function to allow one to learn well. Essentially, a similarity function
K is (, γ, τ )-good if an  proportion of examples are on average 2γ more similar
to reasonable examples of the same class than to reasonable examples of the
opposite class, where a τ proportion of examples must be reasonable. K does
not have to be a metric nor positive semi-definite (PSD). They show that if K
is (, γ, τ )-good, then it can be used to build a linear separator in an explicit
projection space that has margin γ and error arbitrarily close to . This separator
can be learned efficiently using a linear program and is supposedly sparse.
In this article, we propose a novel edit similarity learning procedure driven
by the notion of good similarity function. Our approach (GESL, for Good Edit
Similarity Learning) is formulated as an efficient convex programming approach
allowing us to learn the edit costs so as to optimize the (, γ, τ )-goodness of
the resulting similarity function. We provide a bound based on the notion of
uniform stability [13] that guarantees that our learned similarity will generalize
well and induce low-error classifiers. This bound is independent of the size of the
alphabet, making GESL suitable for handling problems with large alphabet. To
the best of our knowledge, this work is the first attempt to establish a theoretical
relationship between a learned edit similarity function and its generalization and
discriminative power. We show in a comparative experimental study that GESL
has fast convergence and leads to more accurate and sparser classifiers than other
edit similarities.
This paper is organized as follows. In Section 2, we introduce a few notations,
and review the theory of Balcan et al. as well as some prior work on edit simi-
larity learning. In Section 3, which is the core of this paper, we present GESL,
our approach to learning good edit similarities. We then propose a theoretical
analysis of GESL based on uniform stability, leading to the derivation of a gen-
eralization bound. An experimental evaluation of our approach is provided in
Section 4. Finally, we conclude this work by outlining promising lines of research
on similarity learning.
190 A. Bellet, A. Habrard, and M. Sebban

2 Notations and Related Work


We consider the following binary classification problem: we are given some la-
beled examples (x, ) drawn from an unknown distribution P over X × {−1, 1},
where X is the instance space. We want to learn a classifier h : X → {−1, 1}
whose error rate is as low as possible using pairwise similarities according to a
similarity function K : X × X → [−1, 1]. We say that K is symmetric if for all
x, x ∈ X, K(x, x ) = K(x , x). K is a valid (or Mercer) kernel if it is symmetric
and PSD.

2.1 Learning with Good Similarity Functions


In recent work, Balcan et al. [11, 12] introduced a new theory of learning with
good similarity functions. Their motivation was to overcome two major limita-
tions of kernel theory. First, a good kernel is essentially a good similarity func-
tion, but the theory talks in terms of margin in an implicit, possibly unknown
projection space, which can be a problem for intuition and design. Second, the
PSD and symmetry requirement often rules out natural similarity functions for
the problem at hand. As a consequence, Balcan et al. proposed the following
definition of good similarity function.

Definition 1 (Balcan et al. [12]). A similarity function K is an (, γ, τ )-


good similarity function in hinge loss for a learning problem P if there exists
a (random) indicator function R(x) defining a (probabilistic) set of “reasonable
points” such that the following conditions hold:
1. E(x,l)∼P [[1 − g(x)/γ]+ ] ≤ , where g(x) = E(x , )∼P [ K(x, x )|R(x )] and
[1 − c]+ = max(0, 1 − c) is the hinge loss,
2. Prx [R(x )] ≥ τ .

Thinking of this definition in terms of number of margin violations, we can


interpret the first condition as an  proportion of examples x are on average 2γ
more similar to random reasonable examples of the same class than to random
reasonable examples of the opposite class and the second condition as at least a
τ proportion of the examples should be reasonable. Note that other definitions
are possible, like those proposed in [14] for unbounded dissimilarity functions.
Yet Definition 1 is very interesting in two respects. First, it includes all good
kernels as well as some non-PSD similarity functions. In that sense, this is a
strict generalization of the notion of good kernel [12]. Second, these conditions
are sufficient to learn well, i.e., to induce a linear separator α in an explicit space
that has low-error relative to L1 -margin γ. This is formalized in Theorem 1.

Theorem 1 (Balcan et al. [12]). Let K be an (, γ, τ )-good similarity function


in hinge loss for a learning problem P . For any 1 > 0 and0 ≤ δ ≤ γ1 /4, let S =

{x1 , . . . , xd } be a (potentially unlabeled) sample of d = 2
τ log(2/δ) + 16 log(2/δ)
(1 γ)2
landmarks drawn from P . Consider the mapping φS : X → Rd defined as follows:
Learning Good Edit Similarities with Generalization Guarantees 191

φSi (x) = K(x, xi ), i ∈ {1, . . . , d}. Then, with probability at least 1 − δ over the
random sample S, the induced distribution φS (P ) in Rd has a linear separator
α of error at most  + 1 at margin γ.
Therefore, if we are given an (, γ, τ )-good similarity function for a learning
problem P and enough (unlabeled) landmark examples, then with high proba-
bility there exists a low-error linear separator α in the explicit “φ-space”, which
is essentially the space of the similarities to the d landmarks. As Balcan et al.
mention, using du unlabeled examples and dl labeled examples, we can efficiently
find this separator α ∈ Rdu by solving the following linear program (LP):1
⎡ ⎤
dl 
du
min ⎣1 − αj i K(xi , xj )⎦ + λα1 . (1)
α
i=1 j=1
+

Note that Problem (1) is essentially a 1-norm SVM problem [15] with an em-
pirical similarity map [11], and can be efficiently solved. The L1 -regularization
induces sparsity in α: it allows us to automatically select useful landmarks (the
reasonable points), ignoring the others, whose corresponding coordinates in α
will be set to zero during learning. We can also control the sparsity of the solu-
tion directly: the larger λ, the sparser α. Therefore, one does not need to know
in advance the set of reasonable points R, it is automatically worked out while
learning α.
Our objective in this paper is to make use of the theory of Balcan et al. to
efficiently learn (, γ, τ )-good edit similarities from data that will lead to effective
classifiers. In the next section, we review some past work on edit cost learning.

2.2 String Edit Similarity Learning


The classic edit distance, known as the Levenshtein distance, is defined as follows.

Definition 2. The Levenshtein distance eL (x, x ) between strings x = x1 . . . xt


and x = x1 . . . xv is the minimum number of edit operations to change x into
x . The allowable operations are insertion, deletion and substitution of a symbol.
eL can be computed in O(|x| · |x |) time using dynamic programming. Instead of
only counting the minimum number of required operations, we can set a cost (or
probability) for each edit operation. These parameters are usually represented
as a positive cost matrix C of size (A + 1) × (A + 1), where A is the size of A,
the alphabet x and x have been generated from (the additional row and column
account for insertion and deletion costs respectively). Ci,j gives the cost of the
operation changing the symbol ci into cj , ci and cj ∈ A ∪ {$}, where $ is the
empty symbol. Given C, a generalized edit similarity eC can be defined as being
the cost corresponding to the sequence of minimum cost. This sequence is called
the optimal edit script.
1
The original formulation proposed in [12] was actually L1 -constrained. We trans-
formed it into an equivalent L1 -regularized one.
192 A. Bellet, A. Habrard, and M. Sebban

Using a matrix C that is appropriately tuned to the considered task can


lead to significant improvements in performance. For some applications, such
matrices may be available, like BLOSUM in the context of protein sequence
alignment [16]. However, in most domains it is not the case, and tuning the costs
is difficult. For this reason, methods for learning C from data have attracted a
lot of interest. Most general-purpose approaches take the form of probabilistic
models. Parameter estimation methods of edit transducers were used to infer
generative models [5, 6, 9], discriminative models [7] or tree edit models [8].
Note that the above approaches usually use an Expectation-Maximization
(EM)-based algorithm to estimate the parameters of probabilistic models. Be-
yond the fact that EM is not guaranteed to converge to a global optimum, it can
also cause two major drawbacks in the context of edit distance learning. First,
since EM is iterative, parameter estimation and distance calculations must be
performed several times until convergence, which can be expensive to compute,
especially when the size of the alphabet and/or the length of the strings are
large. Second, by maximizing the likelihood of the data, one only considers pairs
of strings of the same class (positive pairs) while it may be interesting to make
use of the information brought by pairs of strings that have a different label
(negative pairs). As a consequence, the above methods “move closer together”
examples of the same class, without trying to also “move away from each other”
examples of different class. In [17], McCallum et al. consider discriminative con-
ditional random fields, dealing with positive and negative pairs in specific states,
but still using EM for parameter estimation. To overcome the drawback of itera-
tive approaches for the task of detecting remote homology in protein sequences,
Saigo et al. [10] optimize by gradient descent an objective function meant to fa-
vor the discrimination between positive and negative examples. But this is done
by only using positive pairs of distant homologs.
Despite their diversity, a common feature shared by all of the above ap-
proaches is that they do not optimize similarity functions to be (, γ, τ )-good
and thus do not take advantage of the theoretical results of Balcan et al.’s
framework. In other words, there is no theoretical guarantee that the learned
edit functions will work well for the classification or clustering task at hand. In
the next section, we propose a novel approach that bridges this gap.

3 Learning (, γ, τ )-Good Edit Similarity Functions

What makes the edit costs C hard and expensive to optimize is the fact that the
edit distance is based on an optimal script which depends on the edit costs them-
selves. This is the reason why, as we have seen earlier, iterative approaches are
very commonly used to learn C from data. In this section, we take a novel convex
programming approach based on the theory of Balcan et al. to learn (, γ, τ )-
good edit similarity functions from both positive and negative pairs without
requiring a costly iterative procedure. Moreover, this new framework allows us
to derive a generalization bound establishing the convergence of our method and
a relationship between the learned similarities and their (, γ, τ )-goodness.
Learning Good Edit Similarities with Generalization Guarantees 193

3.1 An Exponential-Based Edit Similarity Function


Let #(x, x ) be a (A + 1) × (A + 1) matrix whose each component #i,j (x, x )
is the number of times the edit operation (i, j) is used to turn x into x in the
optimal Levenshtein script, 0 ≤ i, j ≤ A. We define the following edit function:

eG (x, x ) = Ci,j #i,j (x, x ).
0≤i,j≤A

Note that to compute eG , we do not extract the optimal script with respect to
C: we use the Levenshtein script2 and apply custom costs C to it. Therefore,
since the edit script defined by #(x, x ) is fixed, eG (x, x ) is nothing more than
a linear function of the edit costs and can be optimized directly.
Recall that in the framework of Balcan et al., a similarity function must be
in [−1, 1]. To respect this requirement, we define our similarity function to be:

KG (x, x ) = 2e−eG (x,x ) − 1.

The motivation for this exponential form is related to the one for using exponen-
tial kernels in SVM classifiers: it can be seen as a way to introduce nonlinearity
to further separate examples of opposite class while moving closer those of the
same class. Note that KG may not be PSD nor symmetric. However, as we have
seen earlier, Balcan et al.’s theory does not require these properties, unlike SVM.

3.2 Learning the Edit Costs: Problem Formulation


We aim at learning an edit cost matrix C so as to optimize the (, γ, τ )-goodness
of KG . It would be tempting to try to find a way to directly optimize Definition
1. Unfortunately, this is very difficult for two reasons. First, it would result in
a nonconvex formulation (summing/subtracting up exponential terms). Second,
we do not know the set R of reasonable points in advance (R is inferred when
learning the classifier). Instead, we propose to optimize the following criterion:

E(x,l) E(x , ) [1 −  KG (x, x )/γ]+ |R(x ) ≤  . (2)

Criterion (2) bounds that of Definition 1 due to the convexity of the hinge loss.
It is harder to satisfy since the “goodness” is required with respect to each
reasonable point instead of considering the average similarity to these points.
Clearly, if KG satisfies (2), then it is (, γ, τ )-good with  ≤  .
Let us consider a training sample of NT labeled points T = {zi = (xi , i )}N T
i=1
  
and a sample of landmark examples SL = {zj = (xj , j )}j=1 . Note that these
NL

examples must be labeled in order to allow us to move closer examples of the


same class and to separate points of opposite class. In practice, SL can be a
subsample of the training sample T . Recall that the goodness of a similarity only
2
In practice, one could use another type of script. We picked the Levenshtein script
because it is a “reasonable” edit script, since it corresponds to a shortest script
transforming x into x .
194 A. Bellet, A. Habrard, and M. Sebban

relies on some relevant subset of examples: the reasonable points. Therefore, in


the general case, a relevant strategy does not consist in optimizing the similarity
with respect to all the landmarks, but rather to some particular ones allowing a
high margin with low violation. In order to model this, we suppose the existence
of an indicator matching function fland : T × SL → {0, 1} that associates to each
element x ∈ T a non empty set of landmark points. We say that x ∈ SL is a
landmark point for x ∈ T if and only if fland (x, x ) = 1. We suppose that fland
matches exactly NL landmark points to each x ∈ T . We will discuss in Section
3.4 how we can build fland .
Our formulation requires the goodness for each (xi , xj ) with fland (xi , xj ) = 1.

Therefore, we want 1 − i j KG (xi , xj )/γ + = 0, hence i j KG (xi , xj ) ≥ γ. A
benefit from using this constraint is that it can easily be turned into an equivalent
linear one, considering the following two cases.
= j , we get:
1. If i 
 1−γ 1−γ
−KG (xi , xj ) ≥ γ ⇐⇒ e−eG (xi ,xj ) ≤ ⇐⇒ eG (xi , xj ) ≥ − log( ).
2 2
We can use a variable B1 ≥ 0 and write the constraint as eG (xi , xj ) ≥ B1 ,
with the interpretation that B1 = − log( 1−γ 2 ). In fact, B1 ≥ − log( 2 ).
1
 
2. Likewise, if i = j , we get eG (xi , xj ) ≤ − log( 2 ). We can use a variable
1+γ

B2 ≥ 0 and write the constraint as eG (xi , xj ) ≤ B2 , with the interpretation


that B2 = − log( 1+γ


2 ). In fact, B2 ∈ 0, − log( 2 ) .
1

The optimization problem GESL can then be expressed as follows:



(GESL) min NT1NL V (C, zi , zj ) + βC2
C,B1 ,B2
1≤i≤NL ,
1≤j≤NT ,
fland (xi ,xj )=1

[B1 − eG (xi , xj )]+ if i 
= j
s.t. V (C, zi , zj ) =
[eG (xi , xj ) − B2]+ if i = j


B1 ≥ − log( 2 ), 0 ≤ B2 ≤ − log( 12 ), B1 − B2 = ηγ
1

Ci,j ≥ 0, 0 ≤ i, j ≤ A,

where β ≥ 0 is a regularization parameter on edit costs, · denotes the Frobenius


norm (which corresponds to the classical L2 -norm when considering a matrix as
a n × n vector) and ηγ ≥ 0 a parameter corresponding to the desired “margin”.
The relationship between the margin γ and ηγ is given by γ = eeηγ −1
ηγ

+1 .
GESL is a convex program, thus we can efficiently find its global optimum.
Using slack variables to express the hinge loss, it has O(NT NL + A2 ) variables
and O(NT NL ) constraints. Note that GESL is quite sparse: each constraint
involves at most one string pair and a limited number of edit cost variables,
making the problem faster to solve. It is also worth noting that our approach is
very flexible. First, it is general enough to be used with any definition of eG that
is based on an edit script (or even a convex combination of edit scripts). Second,
Learning Good Edit Similarities with Generalization Guarantees 195

one can incorporate additional convex constraints, which offers the possibility of
including background knowledge or desired requirements on C (e.g., symmetry).
Lastly, it can be easily adapted to the multi-class case.
In the next section, we derive a generalization bound guaranteeing not only
the convergence of our learning method but also the overall goodness of the
learned edit similarity function for the task at hand.

3.3 Theoretical Guarantees


The outline of this theoretical part is the following: considering that the pairs
(zi , zj ) used to learn C in GESL are not i.i.d., the classic results of statistical
learning theory do not directly hold. To derive a generalization bound, extend-
ing the ideas of [4, 13] to string edit similarity, we first prove that our learning
method has a uniform stability. This is established in Theorem 2 using Lemma 1
and 2. The stability property allows us to derive our generalization bound (The-
orem 4) using the McDiarmid inequality (Theorem 3).
In the following, we suppose every string length bounded by a constant W > 0,
which is not a strong restriction. This implies that for any string pair,
#(x1 , x2 ) ≤ W ,3 since the Levenshtein script contains at most max(|x1 |, |x2 |)
operations. We denote the objective function of GESL by:

1  1 
NT NL
FT (C) = V (C, zk , zk j ) + βC2 ,
NT NL j=1
k=1

where zk j denotes the j th landmark associated to zk .


The first term of FT (C), noted LT (C) in the following, is the empirical loss
over the training sample T . Let us also define the loss over the true distribution
P , called L(C), and the estimation error DT as follows:

L(C) = Ezk ,zj [V (C, zk , zj )] ; DT = L(CT ) − LT (CT ),

where CT denotes the edit cost matrix learned by GESL from sample T . Our
objective is to derive an upper bound on the generalization loss L(CT ) with
respect to the empirical loss LT (CT ).
A learning algorithm is stable [13] when its output does not change signifi-
cantly under a small modification of the learning sample. We consider the follow-
ing definition of uniform stability meaning that the replacement of one example
must lead to a variation bounded in O(1/NT ) in terms of infinite norm.
Definition 3 (Jin et al. [4], Bousquet and Elisseeff [13]). A learning al-
gorithm has a uniform stability in NκT , where κ is a positive constant, if
κ
∀(T, z), |T | = NT , ∀i, sup |V (CT , z1 , z2 ) − V (CT i,z , z1 , z2 )| ≤ ,
z1 ,z2 NT

where T i,z is the new set obtained by replacing zi ∈ T by a new example z.


3
Also denoted #(z1 , z2 ) for the sake of convenience when using labeled strings.
196 A. Bellet, A. Habrard, and M. Sebban

To prove that GESL has the property of uniform stability, we need the following
two lemmas (proven in Appendices 1 and 2).
Lemma 1. For any edit cost matrices C, C  and any examples z, z  :

|V (C, z, z  ) − V (C  , z, z  )| ≤ C − C  W.

Lemma 2. Let FT and FT i,z be the functions to optimize, CT and CT i,z their
corresponding minimizers, and β the regularization parameter. Let ΔC = (CT −
CT i,z ). For any t ∈ [0, 1]:
(2NT + NL )t2W
CT 2 − CT − tΔC2 + CT i,z 2 − CT i,z + tΔC2 ≤ ΔC.
βNT NL
Using Lemma 1 and 2, we can now prove the stability of GESL.
Theorem 2. Let NT and NL be respectively the number of training examples
and landmark points. Assuming that NL = αNT , α ∈ [0, 1], GESL has a uniform
2
stability in NκT , where κ = 2(2+α)W
βα .

Proof. Using t = 1/2 on the left-hand side of Lemma 2, we get


1 1 1
CT 2 − CT − ΔC2 + CT i,z 2 − CT i,z + ΔC2 = ΔC2 .
2 2 2
Then, applying Lemma 2, we get
2(2NT + NL )W 2(2NT + NL )W
ΔC2 ≤ ΔC ⇒ ΔC ≤ .
βNT NL βNT NL
Now, from Lemma 1, we have for any z, z 
2(2NT + NL )W 2
|V (CT , z, z ) − V (CT i,z , z, z  )| ≤ ΔCW ≤ .
βNT NL
Replacing NL by αNT completes the proof. 
Now, using the property of stability, we can derive our generalization bound over
L(CT ). This is done by using the McDiarmid inequality [18].
Theorem 3 (McDiarmid inequality [18]). Let X1 , . . . , Xn be n independent
random variables taking values in X and let Z = f (X1 , . . . , Xn ). If for each
1 ≤ i ≤ n, there exists a constant ci such that

sup |f (x1 , . . . , xn ) − f (x1 , . . . , xi , . . . , xn )| ≤ ci , ∀1 ≤ i ≤ n,


x1 ,...,xn ,xi ∈X

−22
then for any  > 0, Pr[|Z − E[Z]| ≥ ] ≤ 2 exp n 2 .
i=1 ci

To derive our bound on L(CT ), we just need to replace Z by DT in Theorem 3


and to bound ET [DT ] and |DT − DT i,z |, which is shown by the following lemmas
(proven in Appendices 3 and 5).
Learning Good Edit Similarities with Generalization Guarantees 197

Lemma 3. For any learning method of estimation error DT and satisfying a


uniform stability in NκT , we get ET [DT ] ≤ N2κT .

Lemma 4. For any edit cost matrix learned by GESL using NT training exam-
ples and NL landmarks, with Bγ = max(ηγ , −log(1/2)), we have the following
bound:
(2NT + NL )( √2W + 3)Bγ
2κ βBγ
∀i, 1 ≤ i ≤ NT , ∀z, |DT − DT i,z | ≤ + .
NT NT NL
We are now able to derive our generalization bound over L(CT ).
Theorem 4. Let T be a sample of NT randomly selected training examples and
let CT be the edit costs learned by GESL with stability NκT using NL = αNT
landmark points. With probability 1 − δ, we have the following bound for L(CT ):
   
κ 2+α 2W ln(2/δ)
L(CT ) ≤ LT (CT ) + 2 + 2κ +  + 3 Bγ
NT α βBγ 2NT

2(2+α)W 2
with κ = αβ and Bγ = max(ηγ , −log(1/2)).

Proof. Recall that DT = L(CT ) − LT (CT ). From Lemma 4, we get


 
2κ + B (2 + α) 2W
|DT −DT i,z | ≤ sup |DT −DT i,z | ≤ with B =  + 3 Bγ .
T,z  NT α βBγ

Then by applying the McDiarmid inequality, we have


⎛ ⎞  
2 2
2 2
Pr[|DT − ET [DT ]| ≥ ] ≤ 2 exp ⎝− NT (2κ+B)2 ⎠ ≤ 2 exp − (2κ+B)2 . (3)
i=1 NT2 NT

  
22 ln(2/δ)
By fixing δ = 2 exp − (2κ+B) 2 /N
T
, we get  = (2κ + B) 2NT . Finally, from
(3), Lemma 3 and the definition of DT , we have with probability at least 1 − δ:

κ ln(2/δ)
DT < ET [DT ] +  ⇒ L(CT ) < LT (CT ) + 2 + (2κ + B) . 
NT 2NT

This bound outlines three important features of our approach. First, it has a
convergence in O( N1T ), which is classical with the notion of uniform stability.
Second, this rate of convergence is independent of the alphabet size, which means
that our method should scale well to large alphabet problems. Lastly, thanks to
the relation between the optimized criterion and Definition 1 that we established
earlier, this bound also ensures the goodness in generalization of the learned
similarity function. Therefore, by Theorem 1, it guarantees that the similarity
will induce low-error classifiers for the classification task at hand.
198 A. Bellet, A. Habrard, and M. Sebban

3.4 Discussion on the Matching Function


The question of how one should define the matching function fland relates to the
open question of building the training pairs in many metric or similarity learning
problems. In some applications, it may be trivial: e.g., a misspelled word and its
correction. Otherwise, popular choices are to pair each example with its nearest
neighbor or to consider all possible pairs. In our case, matching each example
with every landmark may result in a similarity function that performs very
poorly, because requiring the goodness over all landmarks (including irrelevant
ones) defines an over-constrained problem and does not capture the essence of
Definition 1. Remember that on average, examples should be more similar to
reasonable points of the same class than to reasonable points of the opposite
class. In that sense, reasonable points “represent” the data well. Since classes
have intra-class variability, a given reasonable point can only account for a subset
of the data. Therefore, reasonable points must be somewhat complementary.
Keeping this in mind, we propose the following strategy, used in the experi-
ments. Assuming an even distribution of classes in T , we use a positive parameter
P ≤ NT /2 to pair each example with its P nearest neighbors of the same class in
T and its P farthest neighbors of the opposite class in T , using the Levenshtein
distance. Therefore, we have NL = 2P with NL = αNT , 0 < α ≤ 1 (where α is
typically closer to 0 than to 1). In other words, we essentially take a few land-
marks that are already good representatives of a given example and optimize
the edit costs so that they become even better representatives. Note that the
choice of the Levenshtein distance to determine the neighbors is consistent with
our choice to define eG according to the Levenshtein script.

4 Experimental Results
In this section, we provide an experimental evaluation of the approach presented
in Section 3. Using the learning rule (1) of Balcan et al., we compare three edit
similarity functions:4 (i) KG , learned by GESL,5 (ii) the Levenshtein distance
eL , and (iii) an edit similarity function pe learned with the method of Oncina
and Sebban [7].6 The task is to learn a model to classify words as either English
or French. We use the 2,000 top words lists from Wiktionary.7
First, we assess the convergence rate of the two considered edit cost learning
methods (i and iii). We keep aside 600 words as a validation set to tune the pa-
rameters, using 5-fold cross-validation and selecting the value offering the best
4
A similarity function that is not in [−1, 1] can be normalized.
5
In this series of experiments, we constrained the cost matrices to be symmetric in
order not to be dependent on the order in which the examples are considered.
6
We used their software SEDiL, available online at http://labh-curien.
univ-st-etienne.fr/SEDiL/
7
These lists are available at http://en.wiktionary.org/wiki/Wiktionary:
Frequency_lists. We only considered unique words (i.e., not appearing in both
lists) of length at least 4, and we also got rid of accent and punctuation marks. We
ended up with about 1,300 words of each language over an alphabet of 26 symbols.
Learning Good Edit Similarities with Generalization Guarantees 199

78

76 200

74
Classification accuracy

72 150

Model size
70

68 100

66

64 50
eL eL
62 pe pe
KG KG
60 0
0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 200 400 600 800 1000 1200 1400 1600 1800 2000
NT NT

Fig. 1. Learning the costs: accuracy and sparsity results with respect to NT

classification accuracy. We build bootstrap samples T from the remaining 2,000


words to learn the edit costs, as well as 600 words to train the separator α and
400 words to test its performance. Figure 1 shows the accuracy and sparsity
results of each method with respect to NT , averaged over 5 runs. We see that
KG leads to more accurate models than eL and pe for every size NT > 20. The
difference is statistically significant: the Student’s t-test yields p < 0.01. At the
same time, KG requires considerably less reasonable points (thus speeding up
classification). This clearly indicates that GESL leads to a better similarity than
(ii) and (iii). Moreover, the convergence rate of GESL is very fast, considering
that (26+1)2 = 729 costs must be learned: it needs very few examples to outper-
form the Levenshtein distance, and about 200 examples to reach convergence.
This provides experimental evidence that our method scales well with the size
of the alphabet, as suggested by the generalization bound derived in Section
3.3. On the other hand, (iii) seems to suffer from the large number of costs to
estimate: it needs a lot more examples to outperform Levenshtein (about 200)
and convergence seems to be only reached at 1,000.
Now, we assess the performance of the three edit similarities with respect
to the number of examples dl used to learn the separator α. For KG and pe ,
we use the matrix that performed best in the previous experiment. Taking our
set of 2,000 words, we keep aside 400 examples to test the models and build
bootstrap samples from the remaining 1,600 words to learn α. Figure 2 shows
the accuracy and sparsity results of each method with respect to dl , averaged
over 5 runs. Again, KG outperforms eL and pe for every size dl (the difference
is statistically significant with p < 0.01 using a Student’s t-test) while always
leading to much sparser models. Moreover, the size of the models induced by
KG stabilizes for dl ≥ 400 while the accuracy still increases. This is not the case
for the models induced by eL and pe , whose size keeps growing. To sum up, the
best matrix learned by GESL outperforms the best matrix learned by [7], which
had been proven to perform better than other state-of-the-art methods.
Finally, one may wonder what kind of words are selected as reasonable points
in the models. The intuition is that they should be some sort of “discriminative
prototypes” the classifier is based on. Table 1 gives an example of a set of 11
200 A. Bellet, A. Habrard, and M. Sebban

80 250

75 200
Classification accuracy

Model size
70 150

65 100

60 50
eL eL
pe pe
KG KG
55 0
0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 1200 1400 1600
dl dl

Fig. 2. Learning the separator: accuracy and sparsity results with respect to dl

Table 1. Example of a set of 11 reasonable points

English French
high showed holy economiques americaines decouverte
liked hardly britannique informatique couverture

Table 2. Some discriminative patterns extracted from the reasonable points of Table
1 (^: start of word, $: end of word, ?: 0 or 1 occurrence of preceding letter)

w y k q nn gh ai ed$ ly$ es?$ ques?$ ^h


English 146 144 83 14 5 34 39 151 51 265 0 62
French 7 19 5 72 35 0 114 51 0 630 43 14

reasonable points obtained with KG using a set of 1,200 examples to learn α.8
This small set actually carries a lot of discriminative patterns (shown in Table
2 along with their number of occurrences in each class over the entire dataset).
For example, words ending with ly correspond to English words, while those
ending with que characterize French words. Note that Table 1 also reflects the
fact that English words are shorter on average (6.99) than French words (8.26)
in the dataset, but the English (resp. French) reasonable points are significantly
shorter (resp. longer) than the average (mean of 5.00 and 10.83 resp.), which
allows better discrimination.

5 Conclusion and Future Work

In this work, we proposed a novel approach to the problem of learning edit sim-
ilarities from data that induces (, γ, τ )-goodness. We derived a generalization
bound using the notion of uniform stability that is independent from the size of
the alphabet, making it suitable for problems involving large vocabularies. This
8
We used a high λ value in order to get a small set, thus making the analysis easier.
Learning Good Edit Similarities with Generalization Guarantees 201

bound is related to the goodness of the resulting similarity, which gives guar-
antees that the similarity will induce accurate models for the task at hand. We
experimentally showed that it is indeed the case and that the induced models
are also sparser than if we use other (standard or learned) edit similarities. Our
approach is flexible enough to be straightforwardly generalized to tree edit sim-
ilarity learning: one just has to redefine eG to be a tree edit script. Considering
that tree edit distances generally run in cubic time and that the methods for
learning tree edit similarities available in the literature are mostly EM-based
(thus requiring the distances to be recomputed many times), this seems a very
promising avenue to explore. Finally, learning (, γ, τ )-good Mahalanobis dis-
tance could also be considered.

A Appendices
A.1 Proof of Lemma 1

Proof. |V (C, z, z  ) − V (C  , z, z  )| ≤ | 0≤i,j≤A (Ci,j − Ci,j

)#i,j (z, z  )| ≤ C −
 
C #(z, z ). The first inequality uses the 1-lipschitz property of the hinge loss
and the fact that B1 ’s and B2 ’s cancel out. The second one comes from the
Cauchy-Schwartz inequality.9 Finally, since #(z, z  ) ≤ W , the lemma holds.

A.2 Proof of Lemma 2


Proof. Let B = LT (CT +tΔC)−LT i,z (CT +tΔC)−(LT (CT )−LT i,z (CT )). Since
LT , FT , LT i,z and FT i,z are convex functions and using the fact that CT and
CT i,z are minimizers of FT and FT i,z respectively, we get10 for any t ∈ [0, 1]:
 
β CT 2 − CT − tΔC2 + CT i,z 2 − CT i,z + tΔC2 ≤ B.
Then, using the previous upper bound B, we get
B ≤ |LT (CT + tΔC) − LT i,z (CT + tΔC) + LT i,z (CT ) − LT (CT )|
2(NT − 1) + NL
≤ sup |V (CT + tΔC, z1 , z2 ) − V (CT , z1 , z2 ) +
NT NL z1 ,z2 ∈T
z3 ,z4 ∈T i,z

V (CT , z3 , z4 ) − V (CT + tΔC, z3 , z4 )|


2(NT − 1) + NL
≤ tΔC sup (#(z1 , z2 ) + #(z3 , z4 ))
NT NL z1 ,z2 ∈T
z3 ,z4 ∈T i,z

(2NT + NL )t2W
≤ ΔC.
NT NL
The second line is obtained by the fact that every zk in T , zk 
= zi , has at most
two landmark points different between T and T i,z , and z and zi at most NL
different landmarks. To complete the proof, we reorder the terms and use the 1-
lipschitz property, Cauchy-Schwartz, triangle inequalities and #(z, z  ) ≤ W .

9
1-lipschitz implies |[X]+ − [Y ]+ | ≤ |X − Y |, Cauchy-Schwartz | ni=1 xi yi | ≤ xy.
10
Due to the limitation of space, the details of this construction are not presented in
this paper. We advise the interested reader to have a look at Lemma 20 in [13].
202 A. Bellet, A. Habrard, and M. Sebban

A.3 Proof of Lemma 3


Proof. ET [DT ] ≤ ET [Ez,z [V (CT , z, z  )] − LT (CT )]
NT NL
1  1 
≤ ET,z,z  [|V (CT , z, z  ) − V (CT , zk , zk j )|]
NT k=1 NL j=1
NT NL
1  1 
≤ ET,z,z  [| (V (CT , z, z  )−V (CT , zk , z  )+V (CT , zk , z  )− V (CT , zk , zk j ))|].
NT NL j=1
k=1
Since, T, z and z  are i.i.d. from distribution P , we do not change the expected
value by replacing one point with another and thus
ET,z,z  [|V (CT , z, z  ) − V (CT , zk , z  )|] = ET,z,z  [|V (CT , z, z  ) − V (CT k,z , z, z  )|].

It suffices to apply this trick twice, combined with the triangle inequality and
the property of stability in NκT to lead to the lemma. 

A.4 Lemma 5
In order to bound |DT − DT i,z |, we need to bound CT .
Lemma 5. Let (CT , B1 , B2 ) an optimal solution learned
by GESL from a sam-

ple T , and let Bγ = max(ηγ , −log(1/2)), then CT  ≤ β .

Proof. Since (CT , B1 , B2 ) is an optimal solution then the value reached by FT


is lower than the one obtained with (0, Bγ , 0), where 0 denotes the null matrix:
NT NL NT NL
 1  1  1  1
V (C, zk , zk j ) + βCT 2 ≤ V (0, zk , zk j ) + β02 ≤ Bγ .
NT j=1 NL NT j=1 NL
k=1 k=1

For the last inequality, note that V (0, zk , zk j ) is bounded either by Bγ or 0.
T 1 NL 1 
Since N k=1 NT j=1 NL V (C, zk , zkj ) ≥ 0, we get βCT  ≤ Bγ .
2


A.5 Proof of Lemma 4


Proof. First, we derive a bound on |DT − DT i,z |.
|DT − DT i,z | = |L(CT ) − LT (CT ) − (L(CT i,z ) − LT i,z (CT i,z ))|
≤ |L(CT ) − L(CT i,z )| + |LT (CT i,z ) − LT (CT )| + |LT i,z (CT i,z ) − LT (CT i,z )|
≤ Ez1 ,z2 [|V (CT , z1 , z2 ) − V (CT i,z , z1 , z2 )|] +
NT NL
1  1 
|V (CT i,z , zk , zk j ) − V (CT , zk , zk j )| + |LT i,z (CT i,z ) − LT (CT i,z )|
NT NL j=1
k=1
κ
≤2 + |LT i,z (CT i,z ) − LT (CT i,z )|. (by using the hypothesis of stability twice)
NT
Now, proving Lemma 4 boils down to bounding the last term above. Using
similar arguments to the proof of Lemma 2,
(2NT + NL )
|LT i,z (CT i,z )−LT (CT i,z )| ≤ sup |V (CT i,z , z1 , z2 )−V (CT i,z , z3 , z4 )|.
NT NL z1 ,z2 ∈T
z3 ,z4 ∈T i,z
Learning Good Edit Similarities with Generalization Guarantees 203

We study two cases that need the 1-lipschtiz property of hinge


 loss and Lemma 5.
B
If z1 z2 = z3 z4 , |V (CT i,z , z1 , z2 ) − V (CT i,z , z3 , z4 )| ≤ γ
β W . Otherwise, if
 z1  z2 
= z3 z4 , note that |B1 + B2 | = ηγ + 2B2 ≤ 3Bγ . Hence we get


|V (CT i,z , z1 , z2 ) − V (CT i,z , z3 , z4 )| ≤ 2W + 3Bγ . 
β

References
1. Yang, L., Jin, R.: Distance Metric Learning: A Comprehensive Survey. Technical
report, Dep. of Comp. Science and Eng., Michigan State University (2006)
2. Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric
learning. In: Proc. of the Int. Conf. on Machine Learning (ICML), pp. 209–216
(2007)
3. Weinberger, K.Q., Saul, L.K.: Distance Metric Learning for Large Margin Nearest
Neighbor Classification. J. of Mach. Learn. Res. (JMLR) 10, 207–244 (2009)
4. Jin, R., Wang, S., Zhou, Y.: Regularized distance metric learning: Theory and
algorithm. In: Adv. in Neural Inf. Proc. Sys. (NIPS), pp. 862–870 (2009)
5. Ristad, E.S., Yianilos, P.N.: Learning String-Edit Distance. IEEE Trans. on Pat-
tern Analysis and Machine Intelligence. 20, 522–532 (1998)
6. Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String
Similarity Measures. In: Proc. of the Int. Conf. on Knowledge Discovery and Data
Mining (SIGKDD), pp. 39–48 (2003)
7. Oncina, J., Sebban, M.: Learning Stochastic Edit Distance: application in hand-
written character recognition. Pattern Recognition 39(9), 1575–1587 (2006)
8. Bernard, M., Boyer, L., Habrard, A., Sebban, M.: Learning probabilistic models of
tree edit distance. Pattern Recognition 41(8), 2611–2629 (2008)
9. Takasu, A.: Bayesian Similarity Model Estimation for Approximate Recognized
Text Search. In: Proc. of the Int. Conf. on Doc. Ana. and Reco., pp. 611–615
(2009)
10. Saigo, H., Vert, J.-P., Akutsu, T.: Optimizing amino acid substitution matrices
with a local alignment kernel. BMC Bioinformatics 7(246), 1–12 (2006)
11. Balcan, M.F., Blum, A.: On a Theory of Learning with Similarity Functions. In:
Proc. of the Int. Conf. on Machine Learning (ICML), pp. 73–80 (2006)
12. Balcan, M.F., Blum, A., Srebro, N.: Improved Guarantees for Learning via Simi-
larity Functions. In: Proc. of the Conf. on Learning Theory (COLT), pp. 287–298
(2008)
13. Bousquet, O., Elisseeff, A.: Stability and generalization. Journal of Machine Learn-
ing Research 2, 499–526 (2002)
14. Wang, L., Yang, C., Feng, J.: On Learning with Dissimilarity Functions. In: Proc.
of the Int. Conf. on Machine Learning (ICML), pp. 991–998 (2007)
15. Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-norm Support Vector Machines.
In: Adv. in Neural Inf. Proc. Sys. (NIPS), vol. 16, pp. 49–56 (2003)
16. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks.
Proc. of the National Academy of Sciences of the United States of America 89,
10915–10919 (1992)
17. McCallum, A., Bellare, K., Pereira, F.: A Conditional Random Field for
Discriminatively-trained Finite-state String Edit Distance. In: Conference on Un-
certainty in AI, pp. 388–395 (2005)
18. McDiarmid, C.: On the method of bounded differences. In: Surveys in Combina-
torics, pp. 148–188. Cambridge University Press, Cambridge (1989)
Constrained Laplacian Score for
Semi-supervised Feature Selection

Khalid Benabdeslem and Mohammed Hindawi

University of Lyon1 - GAMA, Lab.


43 Bd du 11 Novembre, 69622 Villeurbanne, France
kbenabde@univ-lyon1.fr,
mohammed.hindawi@insa-lyon.fr

Abstract. In this paper, we address the problem of semi-supervised


feature selection from high-dimensional data. It aims to select the most
discriminative and informative features for data analysis. This is a re-
cent addressed challenge in feature selection research when dealing with
small labeled data sampled with large unlabeled data in the same set.
We present a filter based approach by constraining the known Laplacian
score. We evaluate the relevance of a feature according to its locality pre-
serving and constraints preserving ability. The problem is then presented
in the spectral graph theory framework with a study of the complexity of
the proposed algorithm. Finally, experimental results will be provided for
validating our proposal in comparison with other known feature selection
methods.

Keywords: Feature selection, Laplacian score, Constraints.

1 Introduction and Motivation


Feature selection is an important task in machine learning for high dimensional
data mining. It is one of the effective means to identify relevant features for
dimension reduction [1]. This task has led to improved performance for several
UCI data sets [2] as well as for real-world applications over data such as digital
images, financial time series and gene expression microarrays [3].
Generally, feature selection methods can be classified in three types: filter,
wrapper or embedded. The filter model techniques examine intrinsic properties
of the data to evaluate the features prior to the learning tasks [4]. The wrapper
based approaches evaluate the features using the learning algorithm that will
ultimately be employed [5]. Thus, they “wrap” the selection process around the
learning algorithm. The embedded methods are locally specific to models during
their construction. They aim to learn the feature relevance with the associated
learning algorithm [6].
Moreover, Feature selection could be done in three frameworks according to
class label information. The most addressed framework is the supervised one,
in which feature relevance can be evaluated by their correlation with the class
label [7]. In unsupervised feature selection, without label information, feature

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 204–218, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Constrained Laplacian Score for Semi-supervised Feature Selection 205

relevance can be evaluated by their capability of keeping certain properties of


the data, such as the variance or the separability. It is considered as a much
harder problem, due to the absence of class labels that would guide the search
for relevant information [8].
The problem becomes more challenging when the labeled and unlabeled data
are sampled from the same population. It is more adapted with real-world appli-
cations where labeled data are costly to obtain. In this context, the effectiveness
of semi-supervised learning has been demonstrated [9]. The authors in [10] intro-
duced a semi-supervised feature selection algorithm based on spectral analysis.
Later, they exploited intrinsic properties underlying supervised and unsuper-
vised feature selection algorithms, and proposed a unified framework for feature
selection based on spectral graph theory [11]. The second known work in semi-
supervised selection deals with a wrapper-type forward based approach proposed
by [12] which introduced unlabeled examples to extend the initial labeled train-
ing set.
Furthermore, utilizing domain knowledge became an important issue in many
machine learning and data mining tasks [13,14,15]. Several recent works have
attempted to exploit pairwise constraints or other prior information in feature
selection. The authors in [16] proposed an efficient algorithm, called SSDR (with
different variants: SSDR-M, SSDR-CM, SSDR-CMU), which can simultaneously
preserve the structure of original high-dimensional data and the pairwise con-
straints specified by users. The main problem of these methods is that the pro-
posed objective function is independent of the variance, which is very important
for the locality preserving for the features. In addition, the similarity matrix
used in the objective function uses the same value for all pairs of data which are
not related by constraints. The same authors proposed a constraint score based
method [17,18] which evaluates the relevance of features according to constraints
only. The method carries out with little supervision information in labeled data
ignoring the unlabeled data part even if it is very large. The authors in [19]
proposed to solve the problem of semi-supervised feature selection by a simple
combination of scores computed on labeled data and unlabeled data respectively.
The method (called C4 ) tries to find a consensus between an unsupervised score
and a supervised one (by multiplying both scores). The combination is simple,
but can dramatically bias the selection for the features having best scores for
labeled part of data and bad scores for the unlabeled part and vice-versa.
In the contrast of all cited methods, our proposal uses a new developed score
by constraining the well known Laplacian score (unsupervised) [20] that we will
detail in the next section. The idea behind our proposal is to assess the ability
of features in preserving the local geometric structure offered by unlabeled data,
while respecting the constraints offered by labeled data.
Therefore, our semi-supervised feature selection algorithm is based on a filter
approach. We think that one important motivation to have a filter method for
feature selection is the specificity of the semi-supervised data. This is because,
in this paradigm, data may be used in the service of both unsupervised and
supervised learning. On the one hand, semi-supervised data could be used in
206 K. Benabdeslem and M. Hindawi

the goal of data clustering, then using the labels to generate constraints which
could in turns ameliorate the clustering. In this context, “good” features are
those which better describe the geometric structure of data. On the other hand,
semi-supervised data could be used for supervised learning, i.e. classification
or prediction of the unlabeled examples using a classifier constructed from the
labeled examples. In this context, “good” features are those which are better cor-
related with the labels. Subsequently, the use of a filter method makes the feature
selection process independent from the further learning algorithm whether it is
supervised or unsupervised. This is important to eliminate the bias of feature
selection in both cases, i.e. good features in this case would be those which com-
promise between better description of data structure and better correlation with
desired labels.

2 Related Work
In semi-supervised learning, a data set of N data points X = {x1 , ..., xN } consists
of two subsets depending on the label availability: XL = (x1 , ..., xl ) for which
the labels YL = (y1 , ..., yl ) are provided, and XU = (xl+1 , ..., xl+u ) whose labels
are not given. Here data point xi is a vector with m dimensions (features), and
label yi ∈ {1, 2, ..., C} (C is the number of different labels) and l + u = N (N
is the total number of instances). Let F1 , F2 , ..., Fm denote the m features of X
and f1 , f2 , ..., fm be the corresponding feature vectors that record the feature
value on each instance.
Semi-supervised feature selection is to use both XL and XU to identify the
set of most relevant features Fj1 , Fj2 , ..., Fjk of the target concept, where k ≤ m
and jr ∈ {1, 2, ..., m} for r ∈ {1, 2, ..., k}.

2.1 Laplacian Score


This score was used for unsupervised feature selection. It not only prefers those
features with larger variances which have more representative power, but it also
tends to select features with stronger locality preserving ability. A key assump-
tion in Laplacian Score is that data from the same class are close to each other.
The Laplacian score of the rth feature , which should be minimized, is computed
as follows [20]: 
i,j (fri − frj ) Sij
2
Lr =  (1)
i (fri − μr ) Dii
2

where D is a diagonal matrix with Dii = j Sij , and Sij is defined by the
neighborhood relationship between samples (xi = 1, .., N ) as follows:
 x −x 2
− i λj
Sij = e if xi and xj are neighbors (2)
0 otherwise

where λ is a constant to be set, and


xi , xj are neighbors means that xi is among
k nearest neighbors of xj , μr = N1 i fri .
Constrained Laplacian Score for Semi-supervised Feature Selection 207

2.2 Constraint Score


In general, domain knowledge can be expressed in diverse forms, such as class
labels, pairwise constraints or other prior information.
The constraint score guides the feature selection according to pairwise instance
level constraints which can be classified on two sets: ΩML (a set of Must-Link
constraints) and ΩCL (a set of Cannot-Link constraints)
– Must-Link constraint (ML): involving xi and xj , specifies that they have
the same label.
– Cannot-Link constraint (CL): involving xi and xj , specifies that they
have different labels.
Constraint score of the rth feature, which should be minimized, is computed as
follows [17]: 
(x ,x )∈ΩM L (fri − frj )
2
Cr =  i j (3)
(xi ,xj )∈ΩCL (fri − frj )
2

3 Constrained Laplacian Score


The main advantage of Laplacian score is its locality preserving ability. However,
its assumption that data from the same class are close to each other, is not always
true. In fact, there are several cases where the classes overlap in some instances.
Thus, two close instances could naturally have two different labels and vis-versa.
Furthermore, for constraint score, the principle is mainly based on the constraint
preserving ability. This few supervision information is certainly necessary for fea-
ture selection, but not sufficient when ignoring the unlabeled data part especially
if it is very large. For that, we propose a Constrained Laplacian Score (CLS)
which constraints the Laplacian score for an efficient semi-supervised feature
selection. Thus, we define CLS, which should be minimized, as follows:

i,j (fri − frj ) Sij
2
CLSr =   i 2
(4)
i j|∃k,(xk ,xj )∈ΩCL (fri − αrj ) Dii

where :

⎨ xi −xj 2
Sij = e− λ if xi and xj are neighbors or (xi , xj ) ∈ ΩM L (5)
⎩0 otherwise

and: 
frj if (xi , xj ) ∈ ΩCL
αirj = (6)
μr otherwise
Since the labeled and unlabeled data are sampled from the same population
generated by target concept, the basis idea behind our score is to generalize
the Laplacian score for semi-supervised feature selection. Note that if there are
208 K. Benabdeslem and M. Hindawi

no labels (l = 0, X = XU ) then CLSr = Lr and when (u = 0, X = Xl ), CLS


represents an adjusted Cr , where the M L and CL information would be weighted
by Sij and Dii respectively in the formula.
With CLS, on the one hand, a relevant feature should be the one on which
those two samples (neighbors or related by an M L constraint) are close to each
other. On the other hand, the relevant feature should be the one with a larger
variance or on which those two samples (related by a CL constraint) are well
separated.

4 Spectral Graph Based Formulation


The spectral graph theory [11] represents a solid theoretical framework which
has been the basis of many powerful existing feature selection methods such as
ReliefF [21], Laplacian [20] , sSelect [10], SPEC[22] and Constraint score [17].
Similarly, we give a graph based explanation for our proposed Constrained
Laplacian Score (CLS). A reasonable criterion for choosing a relevant feature is
to minimize the object function  represented by CLS. Thus, the problem is to
minimize the first term T1 = i,j (fri − frj )2 Sij and maximize the second one
 
T2 = i j|∃k,(xk ,xj )∈ΩCL (fri − αirj )2 Dii . By resolving these two optimization
problems, we prefer those features respecting their pre-defined graphs, respec-
tively. Thus, we construct a k-neighborhood graph Gkn from X (data set) and
ΩML (M L constraint set) and a second graph GCL from ΩCL (CL constraint
set).
Given a data set X, let G(V, E) be the complete undirected graph constructed
from X, with V is its node set and E is its edge set. The ith node vi of G
corresponds to xi ∈ X and there is an edge between each nodes pair (vi , vj ),
xi −xj 2
whose weight wij = e− λ is the dissimilarity between xi and xj .
Gkn (V, Ekn ) is a subgraph which could be constructed from G where Ekn is
the edge set {ei,j } from E such that ei,j ∈ Ekn if (xi , xj ) ∈ ΩML or xi is one of
the k-neighbohrs of xj . GCL (VCL , ECL ) is a subgraph constructed from G with
VCL its node set and {ei,j } its edge set such that ei,j ∈ ECL if (xi , xj ) ∈ ΩCL .
Once the graphs Gkn and GCL are constructed, their weight matrices, denoted
by S kn and S CL respectively, can be defined as:

kn wij if xi and xj are neighbors or (xi , xj ) ∈ ΩM L
Sij = (7)
0 otherwise

CL 1 if (xi , xj ) ∈ ΩCL
Sij = (8)
0 otherwise
Then, we can define :
T
– For each feature r, its vector
 frkn= (fr1 , ..., frN )
kn CL CL
– Diagonal matrices Dii = j Sij and Dii = j Sij
– Laplacian matrices L = D − S and L = D − S CL
kn kn kn CL CL
Constrained Laplacian Score for Semi-supervised Feature Selection 209

Algorithm 1. CLS
Input: Data set X
1: Construct the constraint set (ΩM L and ΩCL ) from YL
2: Construct graphs Gkn and GCL from (X, ΩM L ) and ΩCL respectively.
3: Calculate the weight matrices S kn , S CL and their Laplacians Lkn , LCL respec-
tively.
for r = 1 to m do
4: Calculate CLSr
end for
5: Rank the features r according to their CLSr in ascending order.

Following some simple algebraic steps, we see that:


 
kn kn
T1 = (fri − frj )2 Sij = 2
(fri 2
+ frj − 2fri frj )Sij (9)
i,j i,j
 
2 kn kn
= 2( fri Sij − fri Sij frj ) (10)
i,j i,j

= 2(frT Dkn fr − frT S kn fr ) (11)

= 2frT Lkn fr (12)


Note that satisfying the graph-strutures is done according to αirj
in the equation
(6). In fact, when ΩCL = ∅, we should maximize the variance of fr which would
be estimated as: 
kn
var(fr ) = (fri − μr )2 Dii (13)
i

The optimization of (13) is well detailed in [20]. In this case, CLSr = Lr =


frT Lkn fr
frT Dkn fr . Otherwise, we develop as above the second term (T2 ) and obtain
f T Lkn f
2frT LCL Dkn fr . Subsequently, CLSr = f T LrCL Dknr fr seeks those features that
r
respect Gkn and GCL . The whole procedure of the proposed CLS is summarized
in Algorithm 1.
Lemma 1. Algorithm 1 is computed in time O(m × max(N 2 , Log m)).
Proof. The first step of the algorithm requires l2 operations. Steps 2-3 build
the graph matrices requiring N 2 operations. Step 4 evaluates the m features
requiring mN 2 operations and the last step ranks features according to their
scores with m Log(m) operations. 
Note that the “small-labeled” problem becomes an advantage in our case, be-
cause it supposes that the number of extracted constraints is smaller since it
depends on the number of labels, l. Thus, the cost of the algorithm depends
considerably on u, the size of unlabeled data XU .
To reduce this complexity, we propose to apply a clustering on XU . The idea
aims to substitute this huge part of data by a smaller one XU = (p1 , ..., pK ) by
210 K. Benabdeslem and M. Hindawi

preserving the geometric structure of XU , where K is the number of clusters. We


propose to use Self-Organizing Map (SOM) based clustering [23] that we briefly
present in the next section.
Lemma 2. By clustering XU the complexity of Algorithm 1 is reduced to
O(m × max(u, Log m)).
Proof. The size of labeled data is very smaller than the one of √ unlabeled
data, l << u < N and the clustering of XU provides at most K =√ u clusters. √
Therefore, Algorithm 1 is applied over a data set with size equal to u+ l  u.
This allows to decrease the complexity to O(m × max(u, Log m)). 

4.1 SOM Algorithm


SOM is a very popular tool used for visualizing high dimensional data spaces. It
can be considered as doing vector quantization and/or clustering while preserving
the spatial ordering of the input data rejected by implementing an ordering of
the codebook vectors (also called prototype vectors, cluster centroids or reference
vectors) in a one or two dimensional output space. The SOM consists of nodes
organized on a regular low-dimensional grid, called the map. More formally, the
map is described by a graph (V, E). V is a set of K interconnected nodes having
a discrete topology defined by E. For each pair of nodes (c, s) on the map, the
distance δ(c, s) is defined as the shortest path between c and s on the graph.
This distance imposes a neighborhood relation between nodes.
Each node c is represented by an m-dimensional reference vector pc = p1c , ...., pm
c
from M (the set of all map’s nodes), where m is equal to the dimension of the in-
put vectors xi ∈ XU (unlabeled data set). The SOM training algorithm resembles
K-means. The important distinction is that in addition to the best matching ref-
erence vector, its neighbors on the map are updated.
More formally, we define an assignment function γ from Rm (the input space)
to M (the output space), that associates each element xi of Rm to the node
whose reference vector is “closest” to xi . This function induces a partition P =
Pc ; c = 1...K of the set of observations where each part Pc is defined by: Pc =
{xi ∈ XU ; γ(xi ) = c}.
Next, an adaptation step is performed when the algorithm updates the ref-
erence vectors by minimizing a cost function, noted E(γ, M). This function has
to take into account the inertia of the partition P , while insuring the topology
preserving property. To achieve these two goals, it is necessary to generalize the
inertia function of P by introducing the neighborhood notion attached to the
map. In the case of individuals belonging to Rm , this minimization can be done
in a straight way. Indeed, new reference vectors are calculated as:
u
hsc (t)xi
pt+1
s = i=1
u (14)
i=1 hsc (t)

where c = arg mins xi − ws , is the index of the best matching unit of the data
sample xi , . is the distance mesure, typically the Euclidean distance, and t
denotes the time. hsc (t) is the neighborhood function around the winner unit c.
Constrained Laplacian Score for Semi-supervised Feature Selection 211

Fig. 1. Semi-supervised feature selection framework

δsc
In practice, we often use hsc = e− 2T 2 where T represents the neighborhood
raduis in the map. It is decreased from an initial value Tmax to a final value
Tmin .
Subsequently, as explained above, SOM will be applied on the unsupervised
part of data (XU ) for obtaining XU with a size equal to the number of SOM’
nodes (K). Therefore, CLS will be performed on the new obtained data set
(XL + XU ). Note that any other clustering method could be applied over XU ,
but here SOM is chosen for its ability to well preserve the topological relationship
of data and thus the geometric structure of their distribution. Finally, the feature
selection framework is represented in the Figure 1.

5 Results

5.1 Data Sets and Methods

In this section, we present an empirical study on several databases downloaded


from different repositories. “Iris”, “Wave”, “Ionosphere”, “Sonar” and “Soy-
bean” in [2]. Microarray data sets, “Leukemia” and “Colon cancer” in [24] and
[25] respectively. Face-image data sets, “Pie10P” and “Pix10P” which can be
found in http://featureselection.asu.edu/datasets.php. The whole data sets in-
formation is detailed in Table 1.
The data sets are voluntarily chosen for evaluating the clustering performance
of our proposal, CLS, and comparing it with other state of the art techniques.
The concerned methods are listed below:
212 K. Benabdeslem and M. Hindawi

Table 1. Data sets

Data sets N m #classes


Iris 150 4 3
Wave 5000 40 3
Ionosphere 351 34 2
Sonar 208 60 2
Soybean 47 35 4
Leukemia 72 7129 2
Colon cancer 62 2000 2
Pie10P 210 2420 10
Pix10P 100 10000 10

– Variance score, is based on variance for feature selection [26].


– Fisher score, is based on variance and all labels for feature selection [27].
– Laplacian score, is only based on geometric structure of data [20].
– Constraint score (CS or CScore), selects the feature according to few super-
vision information, extracted from labeled data [17].
– C4 , is a semi-supervised feature selection by a simple combination of Lapla-
cian score and CS [19].
– ReliefF, estimates the significance of features according to how well their
values distinguish between the instances of the same and different classes
that are near to each other [21].
– F2+r4 and F3+r (SPEC), spectral feature selection methods [22].

The experimental results will be presented on three folds. First, we test our
algorithm on data sets whose the relevant features are known. Second, we do
some comparisons with known powerful feature selection methods and finally,
we apply the algorithm on databases with huge number of features. In most
experiments, the λ value is set to 0.1 and k = 10 for building the neighborhood
graph. For the semi-supervised data, we chose the first labeled examples for
all data sets (with different labels). We did no selection neither on the level of
examples to be labeled, nor on the generated constraints.

5.2 Validation of Feature Selection


In this section, we are particularly interested on the two first data sets (“Iris”
and “Wave”) which are popularly used in machine learning and data mining
tasks.
In “Iris”, one class is linearly separable from the other two which are not
linearly separable from each other. Out of the four features it is known that
the features F3 (petal length) and F4 (petal width) are more important for
the underlying clusters than F1 (sepal length) and F2 (sepal width) Figure 2.
The sub-figure (c) shows the data projected on the subspace constructed by F3
and F4, whereas the sub-figure (b) shows the data projected on the subspace
of F1 and F2. In [20], it was reported that by using variance score [26], the
Constrained Laplacian Score for Semi-supervised Feature Selection 213

(a)
1.5

0.5

−0.5

−1

−1.5
−4 −3 −2 −1 0 1 2 3 4

(b) (c)
4.5 2.5

4 2

3.5 1.5
F2

F4
3 1

2.5 0.5

2 0
4 6 8 0 5 10
F1 F3

Fig. 2. 2D-Visualization of “Iris”

four features are sorted as (F3, F1, F4, F2). With k ≥ 15, Laplacian score
sorts these four features as (F3, F4, F1, F2). It sorts them as (F4, F3, F1, F2)
when 3 ≤ k < 15. By using CLS, the features are sorted as (F3, F4, F1, F2)
for any value of k (between 1 and 20). For explaining the difference between
the two scores, we chose for this data set, l = 10 generating 45 constraints.
Two of CL-type constraints are constructed from the pairs (73th , 150th ) and
(78th , 111th ) according to the labels of the points Figure.2(a)1 (The concerned
points are represented by rounds). Since, the data points between brackets are
close, with the Laplacian score, the edges e73,150 and e78,111 are constructed in
the associated k-neighborhood graph and affect the feature selection process.
With our method, these edges never exist because of the CL constraint property
even if k is small. For that, the scores obtained by CLS are smaller than the
ones obtained by Laplacian score. We also observed an important gap on scores
between the relevant variables (CLS3 = 1.4 × 10−3 , CLS4 = 2.7 × 10−3 ) and
the irrelevant ones (CLS1 = 1.07 × 10−2 , CLS2 = 1.77 × 10−2 ). In fact, In
the region where the points belong to the two non-linearly separable classes,
Laplacian score is biased by the dissimilarity which could affect the ranking of
features for their selection, while CLS is able to control this problem with the
help of constraints.
The waveform of Brieman data set “Wave” consists of 5000 instances divided
into 3 classes. This data set is composed of 21 relevant features (the first ones)
and 19 noise features with mean 0 and variance 1. Each class is generated from
a combination of 2/3 “base” waves. We tested our feature selection algorithm
1
Figure 2(a) is obtained by PCA.
214 K. Benabdeslem and M. Hindawi

Brieman Wave
0.08

0.07

0.06

0.05

CLS 0.04

0.03

0.02

0.01

0
0 10 20 30 40 50
Features

Fig. 3. Results of CLS on features of “Wave” data set

with l = 8 (28 constraints) and the dimension of the map (26 × 14) for SOM
algorithm. We can see in Figure 3 that the features (21 to 40) have high values
on CLS. The noise represented by these features is clearly detected.

5.3 Comparison of the Feature Selection Quality


For comparing our feature selection approach with another ones, the nearest
neighborhood (1-NN) classifier with Euclidean distance is employed for classifi-
cation after feature selection. For each data set, the classifier is learned in the
first half of samples from each class and tested on the remaining data. We tested
the Accuracy behavior of the ranking feature function represented by CLS for
comparing it with those of other methods cited in [17]. These experiments were
applied on three data sets “Ionosphere”, “Sonar” and “Soybean” with 5 labeled
instances for each one (so 10 pairwise constraints were generated).
Figure 4 indicates that, in most cases, the performance of CLS is comparable
to Fisher Score [26] and significantly better than that of Variance, Laplacian and
Constraint scores. This verifies that merging supervision information of labeled
data with geometrical structure of unlabeled data is very useful in learning fea-
ture scores. Table 2 compares the averaged accuracy under different number of
selected features. Here the values after the symbol ± denote the standard devi-
ation. From Table 2 and Figure 4 we can find that, the performance of CLS is
almost always better than that of Variance, Laplacian score and Constraint score
and is comparable with Fisher Score. More specifically, CLS is superior to Fisher
Score on “Soybean” and “Ionosphere” and is inferior on “Sonar”. Note that Fisher
score uses all labels when CLS score uses just 5 labels for each data set.
Then, we compare the performance of CLS with that of Fisher and constraint
scores when different levels of supervision are used. Figure 5 shows the plots
for accuracy under desired number of selected features vs. different numbers of
labeled data (for Fisher Score) or pairwise constraints (for CScore and CLS)
on the three data sets (“Ionosphere”, “Sonar” and “Soybean”). Here the desired
Constrained Laplacian Score for Semi-supervised Feature Selection 215

Fig. 4. Accuracy vs different numbers of selected features

Table 2. Averaged accuracy of different algorithms on “Ionosphere”, “Sonar” and


“Soybean”

Data sets Variance Laplacian Fisher CS CLS


Ionosphere 82.2±3.8 82.6±3.6 86.3±2.5 85.1±2.9 86.73±2.1
Sonar 79.3±6.3 79.5±7.2 86.4±6.9 80.7±7.8 83.3±1.7
Soybean 88.9±12.7 79.4±28.4 94.5±12.1 93.5±11.6 95.06±1.3

Fig. 5. Accuracy vs. different numbers of labeled data (for Fisher Score) or pairwise
constraints (for CScore and CLS)

number of selected features is chosen as half of the original dimension of samples.


For all scores, the results are averaged over 100 runs. As shown in Figure 5, except
on “Sonar”, CLS is much better than the other two algorithms especially when
only a few labeled data or constraints are used. On “Sonar”, both CScore and
CLS are inferior to Fisher Score when the number of labeled data (or constraints)
is great; CLS is always better when this number is small. A closer study on
Figure 5 reveals that, generally, the accuracy of CLS increases steadily and
fast in the beginning (with few constraints) and slows down at the end (with
relatively more constraints). It implies that too many constraints won’t help
too much to further boost the accuracy, and only a few constraints are required
in CLS, which corresponds exactly to our initial problem concerning “small-
labeled” data. While Fisher Score typically requires relatively more labeled data
to obtain a satisfying accuracy.

5.4 Results on Gene Expression Data Sets


“Leukemia” and “Colon cancer” are gene expression databases with huge num-
ber of features. The microarray Leukemia data is constituted of a set of 72
216 K. Benabdeslem and M. Hindawi

Fig. 6. Accuracy vs. different numbers of selected features on gene expression data sets

Fig. 7. Accuracy vs. different numbers of selected features on face-image data sets

samples, corresponding to two types of Leukemia called ALL (Acute Lympho-


cytic Leukemia) and AML (Acute Myelogenous Leukemia), with 47 ALL and 25
AML. The data set contains expressions for 7129 genes. While “Colon cancer”
is a data set of 2000 genes measured on 62 tissues (40 tumors and 22 “normal”).
We present our results on these data sets on comparison with Laplacian, Fisher,
C4 and CS scores, and that in case of Accuracy vs. Selected features. The results
(Figure6) show that CLS records a comparable performance with other scores
when the number of features is inferior to 2500 for “Leukemia” data set, and 500
for “Colon cancer” data set, then the performance of CLS is superior to other
scores performance when increasing the number of features.

5.5 Results on Face-Image Data Sets


“Pie10P” and “Pix10P” are face-image data sets, each containing 10 persons.
The validation on these data sets is presented in comparison with Laplacian, Re-
liefF scores on both data sets. In addition, results were compared with (F2+r4 )
score on “Pix10P” data set and with (F3+r) score on “Pie10P” data set. We
chose to compare our results with (F3+r) and (F2+r4 ) because they achieved
best results over the other variant scores proposed by authors in [22]. Experi-
mentation results in Figure7 show that CLS outperforms significantly the other
scores whatever the exploited number of features. Meanwhile, on “Pie10P” data
set, CLS is higher than Laplacian and (F3+r) scores and inferior to ReliefF. Nev-
ertheless, it could be shown that CLS has an excellent accuracy on “Pix10P”
data set and very good one on “Pie10P” data set.
Constrained Laplacian Score for Semi-supervised Feature Selection 217

6 Conclusion
In this paper, we proposed a filter approach for semi-supervised feature selection.
A new score function was developed to evaluate the relevance of features based
on both, the locally geometrical structure of unlabeled data and the constrains
preserving ability of labeled data. In this way, we combined two powerful scores,
unsupervised and supervised, in a new one which is more generic for a semi-
supervised paradigm. The proposed score function was explained in the spectral
graph theory framework with the study of the complexity of the associated
algorithm. For reducing this complexity we proposed to cluster the unlabeled
part of data by preserving its geometrical structure before feature selection.
Finally, experimental results on five UCI data sets and one microarray database
show that with only a small number of constraints, the proposed algorithm
significantly outperforms other filter based features selection methods.
There are a number of interesting potential avenues for future research. The
choice of (λ, k) is discussed in [20] and [10], we tried to keep the same values that
the authors used in their experiments in order to compare with their results. But
even, the study of the influence of (λ, k) on our function score (with the treatment
of constraints) is still interesting.
Another line of our future work is to study the constraint utility before in-
tegrating them for feature selection. In our proposal, we used the maximum
number of constraints which could be generated from the labeled data. This
could have ill effects over accuracy when constraints are incoherent or inconsis-
tent. It would be thus more interesting to investigate in constraint selection for
more efficient semi-supervised feature selection.

References
1. Jain, A., Zongker, D.: Feature selection: Evaluation, application, and small sam-
ple performance. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 19(2), 153–158 (1997)
2. Frank, A., Asuncion, A.: Uci machine learning repository. Technical report, Uni-
versity of California (2010)
3. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal
of Machine Learning Research (3), 1157–1182 (2003)
4. Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-
based filter solution. In: Proceedings of the Twentieth International Conference on
Machine Learning (2003)
5. Kohavi, R., John, G.: Wrappers for feature subset selection. Artificial Intelli-
gence 97(12), 273–324 (1997)
6. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by local linear em-
bedding. Science (290), 2323–2326 (2000)
7. Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analy-
sis 1(3), 131–156 (2000)
8. Dy, J., Brodley., C.E.: Feature selection for unsupervised learning. Journal of Ma-
chine Learning Research (5), 845–889 (2004)
9. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning. The MIT Press,
Cambridge (2006)
218 K. Benabdeslem and M. Hindawi

10. Zhao, Z., Liu, H.: Semi-supervised feature selection via spectral analysis. In: Pro-
ceedings of SIAM International Conference on Data Mining (SDM), pp. 641–646
(2007)
11. Chung, F.: Spectral graph theory. AMS, Providence (1997)
12. Ren, J., Qiu, Z., Fan, W., Cheng, H., Yu, P.S.: Forward semi-supervised feature
selection. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008.
LNCS (LNAI), vol. 5012, pp. 970–976. Springer, Heidelberg (2008)
13. Basu, S., Davidson, I., Wagstaff, K.: Constrained clustering: Advances in algo-
rithms, theory and applications. Chapman and Hall/CRC Data Mining and Knowl-
edge Discovery Series (2008)
14. Xing, E., Ng, A., Jordan, M., Russel, S.: Distance metric learning, with application
to clustering with side-information. In: Advances in Neural Information Processing
Systems, vol. 15, pp. 505–512 (2003)
15. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning a mahalanobis metric
from equivalence constraints. Journal of Machine Learning Research 6, 937–965
(2005)
16. Zhang, D., Zhou, Z., Chen, S.: Semi-supervised dimensionality reduction. In: Pro-
ceedings of SIAM International Conference on Data Mining, SDM (2007)
17. Zhang, D., Chen, S., Zhou, Z.: Constraint score: A new filter method for feature
selection with pairwise constraints. Pattern Recognition 41(5), 1440–1451 (2008)
18. Sun, D., Zhan, D.: Bagging constraint score for feature selection with pairwise
constraints. Pattern Recognition 43(6), 2106–2118 (2010)
19. Kalakech, M., Biela, P., Macaire, L., Hamad, D.: Constraint scores for semi-
supervised feature selection: A comparative study. Pattern Recognition Let-
ters 32(5), 656–665 (2011)
20. He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in
Neural Information Processing Systems, vol. 17 (2005)
21. Robnik-Sikonja, M., Kononenko, I.: Theoretical and empirical analysis of relief and
relieff. Machine Learning 53, 23–69 (2003)
22. Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learn-
ing. In: Proceedings of the Twenty Fourth International Conference on Machine
Learning (2007)
23. Kohonen, T.: Self Organizing Map. Springer, Berlin (2001)
24. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller,
H., Loh, M., Downing, L., Caligiuri, M., Bloomfield, C., Lander, E.: Molecular
classification of cancer: Class discovery and class prediction by gene expression
monitoring. Science 15 286(5439), 531–537 (1999)
25. Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., Levine, A.:
Broad patterns of gene expression revealed by clustering analysis of tumor and
normal colon tissues probed by oligonucleotide arrays. Natl. Acad. Sci. 96(12),
6745–6750 (1999)
26. Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press,
Oxford (1995)
27. Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley-Interscience, Hoboken
(2000)
COSNet: A Cost Sensitive Neural Network for
Semi-supervised Learning in Graphs

Alberto Bertoni, Marco Frasca, and Giorgio Valentini

DSI, Dipartimento di Scienze dell’ Informazione,


Università degli Studi di Milano,
Via Comelico 39, 20135 Milano, Italia
{bertoni,frasca,valentini}@dsi.unimi.it

Abstract. The semi-supervised problem of learning node labels in


graphs consists, given a partial graph labeling, in inferring the unknown
labels of the unlabeled vertices. Several machine learning algorithms have
been proposed for solving this problem, including Hopfield networks and
label propagation methods; however, some issues have been only par-
tially considered, e.g. the preservation of the prior knowledge and the
unbalance between positive and negative labels. To address these items,
we propose a Hopfield-based cost sensitive neural network algorithm
(COSNet). The method factorizes the solution of the problem in two
parts: 1) the subnetwork composed by the labelled vertices is consid-
ered, and the network parameters are estimated through a supervised
algorithm; 2) the estimated parameters are extended to the subnetwork
composed of the unlabeled vertices, and the attractor reached by the dy-
namics of this subnetwork allows to predict the labeling of the unlabeled
vertices. The proposed method embeds in the neural algorithm the “a
priori” knowledge coded in the labelled part of the graph, and separates
node labels and neuron states, allowing to differentially weight positive
and negative node labels. Moreover, COSNet introduces an efficient cost-
sensitive strategy which allows to learn the near-optimal parameters of
the network in order to take into account the unbalance between pos-
itive and negative node labels. Finally, the dynamics of the network is
restricted to its unlabeled part, preserving the minimization of the over-
all objective function and significantly reducing the time complexity of
the learning algorithm. COSNet has been applied to the genome-wide
prediction of gene function in a model organism. The results, compared
with those obtained by other semi-supervised label propagation algo-
rithms and supervised machine learning methods, show the effectiveness
of the proposed approach.

1 Introduction

The growing interest of the scientific community in methods and algorithms for
learning network-structured data is motivated by emerging applications in sev-
eral domains, ranging from social to economic and biological sciences [1, 2]. In

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 219–234, 2011.

c Springer-Verlag Berlin Heidelberg 2011
220 A. Bertoni, M. Frasca, and G. Valentini

this context a fundamental problem is represented by the supervised or semi-


supervised node classification, i.e. predicting node labels by exploiting the re-
lationships between labeled and unlabeled nodes of the network. Instances are
connected via a set of links, and a learner relies on the assumption that linked
entities tend to be assigned to the same class label. For example, in protein-
protein interaction networks genetic or physical interactions coded in the links
of the network bear witness to common biological processes or molecular func-
tion activities between linked proteins [3]; in social networks, people that are
friends often share similar characteristics or commons interests [4]; in document
classification, texts that are linked through common citations often share similar
topics [5].
Several approaches have been proposed in literature to classify networked
data. They usually represent data through an undirected graph G = (V, W ),
where nodes v ∈ V correspond to instances to be classified, and W defines
the weights of the edges according to the “strength” or the evidence of the
relationships between pairs of nodes.
The first and simplest algorithms proposed were based on “guilt-by-association”
methods, by which unlabeled nodes are set according to the majority or the
weighted majority of the labels in their neighborhoods [6, 7]. By extending this
approach, nodes can “propagate” their labels to their neighbors iteratively by re-
peating this “label propagation” process until convergence [8, 9]. In this context
Markov Random Walks can be applied to tune the amount of propagation we
allow in the graph, by setting the length of the walk across the graph [10, 11].
Other related methods are based on smoothness considerations that yields to
graph regularization [12, 13], or exploit the properties of the graph Laplacian
associated to the weight matrix of the network [14]. Algorithms based on the
evaluation of the functional flow in graphs [3, 15], on Markov [16] and Gaussian
Random Fields [17, 18] have been applied to the prediction of gene functions
in biological networks. Hopfield networks [19] shares common elements with
label propagation algorithms. Indeed labels are iteratively propagated across
the neighbors of each node and a quadratic cost function related to the con-
sistency of the labeling of the nodes w.r.t. the network topology is minimized
by the network dynamics. From this standpoint Hopfield networks and most
of the proposed graph-based algorithms for the prediction of node labels can
be cast into a recently proposed common framework where a quadratic cost
objective function is minimized [20]. Nevertheless, there are some issues that
have been only partially considered in classifying networked data. Many of the
graph-based approaches do not preserve prior information coded in nodes label-
ing, and are unable to effectively predict node labels when data are unbalanced,
e.g. when negative nodes significantly outnumber positives. This issue is par-
ticularly relevant when label propagation algorithms are applied to predict the
functions of genes, since positive annotations are usually much less than negative
ones [18]. Despite some cost-sensitive variants of Gaussian Random fields have
been proposed, they are based on simple class rescaling so that their respective
weights over unlabeled examples match the prior class distribution estimated
A COSNet for Semi-supervised Learning in Graphs 221

from labeled examples [8, 18]. Finally, many approaches based on neural net-
works do not distinguish between the node labels and the values of the neuron
states [21], thus resulting in a lower predictive capability of the network.
To address these issues, we propose a cost-sensitive neural algorithm (COSNet ),
based on Hopfield networks, whose main characteristics are the following:
1. Available a priori information is embedded in the neural network and pre-
served by the network dynamics.
2. Labels and neuron states are conceptually separated. In this way a class of
Hopfield networks is introduced, having as parameters the values of neuron
states and the neuron thresholds.
3. The parameters of the network are learned from the data through an efficient
supervised algorithm, in order to take into account the unbalance between
positive and negative node labels.
4. The dynamics of the network is restricted to its unlabeled part, preserving
the minimization of the overall objective function and significantly reducing
the time complexity of the learning algorithm.
In sect. 2 the classification of nodes in networked data is formalized as a semi-
supervised learning problem. Hopfield networks and the main issues related to
this type of recurrent neural network are discussed in Sect. 3 and 4. The problem
of the restriction of network dynamics to a subset of nodes is analyzed in Sect 5,
while the proposed neural network algorithm, COSNet (COst Sensitive neural
Network), is discussed in Sect. 6. In the same section we show that COSNet
covers the main Hopfield networks learning issues, and in particular a statistical
analysis highlights that the network parameters selected by COSNet lead to
significantly lower values of the the energy function w.r.t. the non cost-sensitive
version of the Hopfield network. In Section 7, to test the proposed algorithm on
a classical unbalanced semi-supervised classification problem, we applied COS-
Net to the genome-wide prediction of gene functions in a model organism, by
considering about 200 different functional classes of the FunCat taxonomy [22],
and five different types of biomolecular data. The conclusions end the paper.

2 Semi-supervised Learning in Graphs


Consider a weighted graph G = (V, W ), where V = {1, . . . , n} is the vertex set
and W = (wij ) is the symmetric weight matrix: the weight wij ∈ R denotes a
similarity index of node i with respect to node j. The vertices in V are labeled
with {+, −}, leading to the subsets P and N of positive and negative vertices, but
the labeling is known only for a subset S ⊂ V , while is unknown for U = V \ S.
Let be S + = S ∩ P and S − = S ∩ N : we can refer to S + , S − and W as the
“prior information”.
The semi-supervised classification problem consists in finding a bipartition
(U + , U − ) of nodes in U relying on the prior information. Nodes in U + are then
considered candidates for the class P ∩ U .
A reasonable measure of the “correctness” in approximating P ∩U by U + and
N ∩ U by U − , especially when unbalanced data are considered, is the Fscore ,
222 A. Bertoni, M. Frasca, and G. Valentini

defined as follows: by calling false positives the vertices F P = U + ∩ N , false


negatives F N = U − ∩ P and true positives T P = U + ∩ P , the Fscore is the
harmonic mean between precision and recall, where precision = |T P|T|+|F P|
P|,
|T P |
recall = |T P |+|F N | . Note that 0  Fscore  1 and Fscore = 1 iff U + = P ∩ U .

3 Hopfield Networks

By slightly generalizing the classical definition of discrete Hopfield networks


(DHNs) [19], a Hopfield network H with neurons V = {1, 2, . . . , n} can be de-
scribed by a triple H = < W, γ, α >, where:

- W is a n × n symmetric matrix in which wij ∈ R is the connection strength


between neurons i and j, with wii = 0 for each i
- γ = (γ1 , γ2 , . . . , γn ) ∈ Rn is a vector of activation thresholds
- α is a real number in [0, π2 ] that determines the two different values {sin α,
− cos α} for neuron states.

At each discrete time t each neuron i has a value xi (t) ∈ {sin α, − cos α} accord-
ing to the following dynamics:

1. At time 0 an initial value xi (0) = ai is given for each neuron i


2. At time t + 1 each neuron is updated asynchronously in a random order by
the following activation rule

⎪ 
i−1 
n

⎨ sin α if wij xj (t + 1) + wik xk (t) − γi > 0
j=1 k=i+1
xi (t + 1) = (1)

⎪ 
i−1 n
⎩ − cos α if wij xj (t + 1) + wik xk (t) − γi ≤ 0
j=1 k=i+1

The state of the network at time t is the vector x(t) = (x1 (t), x2 (t), . . . , xn (t)).
The main feature of a Hopfield network is the existence of a quadratic state
function, i.e. the energy function:

1
E(x) = − xT W x + xT γ (2)
2
This is a non increasing function w.r.t. the evolution of the network according
to the activation rules (1), i.e.

E(x(0)) ≥ E(x(1)) ≥ . . . ≥ E(x(t)) ≥ . . .

It is easy to show that every dynamics of the network converges to an equilibrium


state x̂ = (x̂1 , x̂2 , . . . , x̂n ), where, by updating each neuron i, the value x̂i doesn’t
change for any i ∈ {1, 2, . . . , n}. In this sense a DHN is a local minimizer of the
energy function, and x̂ is also called “attractor” of the dynamics.
A COSNet for Semi-supervised Learning in Graphs 223

4 Learning Issues in Hopfield Networks


Hopfield networks have been used in many different applications, including
content-addressable memory [23, 24, 25], discrete nonlinear optimization [26],
binary classification [21]. In particular in [21] is described a binary classifier for
gene function prediction, named GAIN, that exploits DHNs as semi-supervised
learners. According to the semi-supervised set-up described in Section 2, GAIN
considers a set V of genes divided into (U , S), together with an index wij of
similarity between genes i and j, with 0 ≤ wij ≤ 1. Finally, S is divided into
the genes with positive labels S + and negative labels S − . The aim is to predict
a bipartition (U + , U − ) of genes U .
To solve the problem, a DHN with connection strength wij , thresholds 0 and
neuron states

{1, −1} is considered; let observe that, up to the multiplicative
constant 2 , in our setting the neuron states correspond to α = π4 . The network
2

is initialized with the state x = (u, s) by assigning 1 to neurons in S + , -1 to


neurons in S − and a random value to those in U (subvector u). The equilibrium
state x̂ = (û, ŝ) reached by the asynchronous dynamics is used to infer the
bipartition (U + , U − ) of U by setting U + = {i ∈ U | ûi = 1} and U − = {i ∈ U |
ûi = −1}.
This approach leads to three main drawbacks:
1. Preservation of the prior knowledge. During the network dynamics each neu-
ron is updated, and the available prior information coded in the bipartition
(S + , S − ) of S may not be preserved. This happens when the reached state
x̂ = (û, ŝ) is such that ŝ 
= s.
2. Limit attractors problem. By assigning the value 1 to positive labels, -1 to
those negative and by setting to 0 the threshold of each neuron, when |S + | 
|S − | the network is likely to converge to a trivial state: in fact, the network
dynamics in this case leads to the trivial attractor (−1, −1, . . . , −1). It is
notable that this behaviour has been frequently registered in several real-
world problems, e.g. the gene function prediction problem [27, 22].
3. Incoherence of the prior knowledge coding. Since the inference criterion is
based on the minimization of the overall objective function, we expect that
the initial state s of labeled neurons is a subvector of a state (s, û) “close”
to a minimum of the energy function. Unfortunately, in many cases this is
not true.
To address these problems, we exploit a simple property which holds for sub-
networks of a DHN, and that we discuss in the next section.

5 Sub-network Property
Let be H = < W, γ, α > a network with neurons V = {1, 2, . . . , n}, having
the following bipartitions: (U, S) bipartition of V , where up to a permutation,
U = {1, 2, . . . , h} and S = {h + 1, h + 2, . . . , n}; (S + , S − ) bipartition of S;
(U + , U − ) bipartition of U .
224 A. Bertoni, M. Frasca, and G. Valentini

According to (U, S), each network state x can be decomposed in x = (u, s),
where u and s are respectively the states of neurons in U and in S. The energy
function of H can be written by separating the contributions due to U and S:
1 T 
E(u, s) = − u Wuu u + sT Wss s + uT Wus s + sT Wus
T
u + uT γ u + sT γ s , (3)
2

Wuu Wus
where W = T and γ = (γ u , γ s ).
Wus Wss
By setting to a given state s̃ the neurons in S, we consider the dynamics
obtained by updating only neurons in U , without changing the state of neurons
in S. Since
1 1
E(u, s̃) = − uT Wuu u + uT (γ u − Wus s̃) − s̃T Wss s̃ + s̃T γ s ,
2 2
the dynamics of neurons in U is described by the subnetwork HU|s̃ =< Wuu , γ u −
Wus s̃, α >. It holds the following:

Fact 5.1 (Sub-network property). If s̃ is part of a energy global minimum


of H, and ũ is a energy global minimum of HU|s̃ , then (ũ, s̃) is a energy global
minimum of H.

In our setting, we associate the state x(S + , S − ) with the given bipartition
(S + , S − ) of S:
− sin α if i ∈ S +
xi (S , S ) =
+
− cos α if i ∈ S −
for each i ∈ S. By the sub-network property, if x(S + , S − ) is part of a energy
global minimum of H, we can predict the hidden part relative to neurons U by
minimizing the energy of HU|x(S + ,S − ) .

6 COSNet
In this section we propose COSNet (COst-Sensitive neural Network), a semi-
supervised learning algorithm whose main feature is the introduction of a super-
vised learning strategy which exploits the sub-network property to automatically
estimate the parameters α and γ of the network H =< W, γ, α >. The main steps
of COSNet can be summarized as follows:
INPUT : symmetric connection matrix W : V × V −→ [0, 1], bipartition (U, S)
of V and bipartition (S + , S − ) of S.
OUTPUT : bipartition (U + , U − ) of U .

Step 1. Generate an initial temporary bipartition (U + , U − ) of U such that


|U + | |S + |
|U|
|S| .
Step 2. Find the optimal parameters (α̂, γ̂) of the Hopfield sub-network
HS|x(U + ,U − ) , such that the state x(S + , S − ) is “as close as possible” to an
equilibrium state.
A COSNet for Semi-supervised Learning in Graphs 225

Step 3. Extend the parameters (α̂, γ̂) to the whole network and run the sub-
network HU|x(S + ,S − ) until an equilibrium state û is reached. The final solu-
tion (U + , U − ) is:
U + = {i ∈ U | ûi = sin α̂}
U − = {i ∈ U | ûi = − cos α̂}.
Below we explain in more details each step of the algorithm.

6.1 Generating a Temporary Solution


To build the sub-network HS|x(U + ,U − ) , we need to provide an initial bipartition
of U . The adopted procedure is the following:
- generate a random number m according to the binomial distribution B(|U |,
|S + |
|S| )
- assign to U + m elements uniformly chosen in U
- assign to U − the set U \ U + .
This bipartition criterion comes from the probabilistic model described below.
Suppose that V contains some positive and negative examples, a priori un-
known, and that all bipartitions (U , S) of V are equiprobable, with |U | = h.
If S contains |S + | positive examples, while U is not observed, then by setting
P (x) = P rob {|U + | = x | S contains |S + | positives}, it is easy to see that the
following equality holds:
|S + |
· h = argmax P (x).
|S| x

In the next section we exploit this labeling of U to estimate the parameters α


and γ of the network.

6.2 Finding the Optimal Parameters


By exploiting the temporary bipartition (U + , U − ) of U found in the previous step,
we consider the sub-network HS|x(U + ,U − ) = < Wss , γ s − Wus T
x(U + , U − ), α >,
s
where γi = γ ∈ R for each i ∈ {h + 1, h + 2, . . . , n}. The aim is to find the values
of the parameters α and γ such that the state x(S + , S − ) is “as close as possible”
to an equilibrium state.
For each node k in S let define Δ(k) ≡ (Δ+ (k), Δ− (k)), where

Δ+ (k) = wkj
j∈S + ∪ U +

Δ− (k) = wkj .
j∈S − ∪ U −

In this way, each element k ∈ S corresponds to a point Δ(k) in the plane. In


particular, let consider the sets I + = {Δ(k), k ∈ S + } and I − = {Δ(k), k ∈ S − }.
It holds the following:
226 A. Bertoni, M. Frasca, and G. Valentini

Fact 6.1. I + is linearly separable from I − if and only if there is a couple (α, γ)
such that x(S + , S − ) is an equilibrium state for the network HS|x(U + ,U − ) .
This fact suggests a method to optimize the parameters α and γ. Let be fα,γ a
straight line in the plane that separates the points Iα,γ
+
= {Δ(k) | fα,γ (Δ(k)) ≥

0} from points Iα,γ = {Δ(k) | fα,γ (Δ(k)) < 0}:
fα,γ (y, z) = cos α · y − sin α · z − γ = 0 (4)
Note that we assume that the positive half-plane is “above” the line fα,γ .
To optimize the parameters (α, γ) we adopt the F-score maximization crite-
rion, since it can be shown that Fscore (α, γ) = 1 iff x(S + , S − ) is an equilibrium
state of HS|x(U + ,U − ) . We obtain
(α̂, γ̂) = argmax Fscore (α, γ). (5)
α,γ

In order to reduce the computational complexity of this optimization, we propose


a two-step approximation algorithm that at first computes the optimum line (in
terms of the Fscore criterion) among the ones crossing the origin of the axes, and
then computes the optimal intercept:
1. Compute α̂. The algorithm computes the slopes of the lines crossing the origin
and each point Δ(k) ∈ I + ∪ I − . Then it searches the line which maximizes
the Fscore criterion by sorting the computed lines according to their slopes in
an increasing order. Since all the points lie in the first quadrant, this assures
that the angle α̂ relative to the optimum line is in the interval [0, π2 ].
2. Compute γ̂. Compute the intercepts of the lines whose slope is tan α̂ and
crossing each point belonging to I + ∪ I − . The optimum line is identified by
scanning the computed lines according to their intercept in an increasing
order. Let q̂ be the intercept of the optimum line, then we set γ̂ = −q̂ cos α̂.
Both step 1 and step 2 can be computed in O(n log n) computational time (due
to the sorting), where n is the number of points.

6.3 Network Dynamics


The optimum parameters (α̂, γ̂) computed in the previous step are then extended
to the sub-network HU|x(S + ,S − ) = < Wuu , γ̂ u − Wsu T
x(S + , S − ), α̂ >, where γ̂iu =
γ̂ for each i ∈ {1, 2, . . . , h}. Then, by running the sub-network HU|x(S + ,S − ) , we
learn the unknown labels of neurons U , preserving the prior information coded
in the labels of neurons in S.
The initial state of the network is set to ui = 0 for each i ∈ {1, 2, . . . , h}.
When the position of the positive half-plane in the maximization problem (5) is
“above” the line, the update rule for node i at time t + 1 is


⎪ 
i−1 h
⎪ sin α̂ if
⎨ wij uj (t + 1) + wik uk (t) − θi < 0
j=1 k=i+1
ui (t + 1) = (6)

⎪ 
i−1 h

⎩ − cos α̂ if w u
ij j (t + 1) + w u
ik k (t) − θ i > 0
j=1 k=i+1
A COSNet for Semi-supervised Learning in Graphs 227


where θi = γ̂ − wij xj (S + , S − ). When the position of the positive half-plane is
j∈S
“below” the line, the disequalities (6) need to be reversed: the first one becomes
“sin α̂ if . . . > 0”, and the second “− cos α̂ if . . . < 0”.
The stable state û reached by this dynamics is used to classify unlabeled
data. If the known state x(S + , S − ), with the parameters found according to the
procedure described in Section 6.2, is a part of a global minimum of the energy of
H, and û is an energy global minimum of HU|x(S + ,S − ) , the sub-network property
(Section 5) guarantees that (û, x(S + , S − )) is a energy global minimum of H.

6.4 COSNet Covers Hopfield Networks Learning Issues


In this section we analyze the effectiveness of the proposed algorithm w.r.t the
learning issues described in Section 4.

1. Preservation of the Prior Knowledge. The restriction of the dynamics to


the unlabeled data assures the preservation of the prior knowledge coded in the
connection matrix and in the bipartition of the labeled data. Note that a similar
approach has been proposed in [8], even if in that case the known labels are
simply restored at each iteration of the algorithm, without an actual restriction
of the dynamics.
In addition, the restriction of the dynamics to the unlabeled neurons reduces
the time complexity, since often unlabeled data are much less than the labeled
ones. This is an important advantage when huge and complex graphs, e.g. bio-
logical networks, are analyzed.
2. Limit Attractors Problem. This problem may occur when training data
are characterized by a large unbalance between positive and negative examples,
e.g. when |S + |  |S − |, which is frequent in many real-world problems [18].
In this case the points Δ(k) ≡ (Δ+ (k), Δ− (k)) (Section 6.2) are such that
Δ− (k) Δ+ (k). Accordingly, a separation angle π4 ≤ α̂ ≤ π2 is computed
by the supervised algorithm described in Section 6.2. In our setting, such an
angle determines a value of the positive states greater than the negative ones,
yielding the network dynamics to converge towards non trivial attractors.
3. Incoherence of the Prior Knowledge Coding. We would like to show
that the parameters (α, γ) automatically selected by COSNet can yield to a
“more coherent” state w.r.t. the prior knowledge, in the sense that this state
corresponds to a lower energy of the underlying network.
To this end, by considering the data sets used in the experimental validation
tests (Section 7), in which a labeling x ∈ {1, −1}|V | of V is known, we randomly
choose a subset U of V . After hiding the corresponding labels, by applying
COSNet we approximate the optimal parameters (α̂, γ̂). Accordingly, we define
the state x(α̂) by setting xk (α̂) = sin α̂ if xk = 1 and xk (α̂) = − cos α̂ if xk = −1,
for each k ∈ {1 . . . |V |}. We show that the state x(α̂) is “more coherent” with
the prior knowledge than x, by studying whether x(α̂) is “closer” than x to a
global minimum of the energy function E(x).
228 A. Bertoni, M. Frasca, and G. Valentini

Table 1. Confidence interval estimation for the probabilities Px(α̂) and Px at a confi-
dence level 0.95 (data set PPI-VM)

Data set PPI-VM


Class Confidence interval Class Confidence interval
Px (α̂) Px Px (α̂) Px
min max min max min max min max
“01” 0 0.0030 0 0.0030 “02” 0 0.0030 0 0.0030
“01.01” 0 0.0030 0 0.0030 “02.01” 0 0.0030 0.0638 0.0975
“01.01.03” 0.0001 0.0056 0.0433 0.0722 “02.07” 0 0.0030 0.0011 0.0102
“01.01.06” 0.0001 0.0056 0.0442 0.0733 “02.10” 0 0.0030 0.0522 0.0833
“01.01.06.05” 0.0210 0.0427 0.0702 0.1051 “02.11” 0.0002 0.0072 0.0939 0.1332
“01.01.09” 0 0.0030 0.0045 0.0174 “02.13” 0.0312 0.0565 0.3622 0.4226
“01.02” 0.0001 0.0056 0.0067 0.0212 “02.13.03” 0.7139 0.7681 0.7740 0.8236
“01.03” 0 0.0030 0.0620 0.0953 “02.19” 0.0001 0.0056 0.0006 0.0088
“01.03.01” 0.1452 0.1915 0.2232 0.2768 “02.45” 0.1022 0.1428 0.1815 0.2312
“01.03.01.03” 0 0.0030 0.0145 0.0333 “11” 0 0.0030 0 0.0030
“01.03.04” 0.5020 0.5637 0.6280 0.6867 “11.02” 0 0.0030 0 0.0030
“01.03.16” 0.0025 0.0135 0.1189 0.1619 “11.02.01” 0 0.0030 0.7761 0.8255
“01.03.16.01” 0 0.0030 0.3025 0.3608 “11.02.02” 0.2184 0.2716 0.8519 0.8931

As measure of “closeness” of a given state z to a global minimum of E(x),


we consider the probability Pz that E(x) < E(z), where x = (x1 , x2 , . . . , x|V | )
is a random state generated according to the binomial distribution B(|V |, ρz ),
where ρz is the rate of positive components in z.
To estimate Pz, we independently generate t random states x(1) , x(2) , ..., x(t)
t
and we set Y = i=1 β(E(z) − E(x(i)), where β(x) = 1 if x ≥ 0, 0 otherwise.
Y
The variable t is an estimator of pz , and in our setting Y << t. For determining
the confidence interval of Pz at a 1 − δ confidence level, we need to consider three
cases:
1
1. Y = 0. We can directly compute the confidence interval [0, 1 − δ t ].
2. 1 ≤ Y ≤ 5. Y is approximately distributed according to the Poisson
distribution
with expected value λ = Y . Accordingly, the confidence interval
is 2n χ2Y,1− δ , 2n
1 2
χ2(Y +1), δ , where χ2k is a chi squared random variable with
1 2
2 2
k degrees of freedom.
3. Y > 5. The random variable Y is approximately distributed according
to a normal distribution with expected value Y and variance Y (1−Y t
)
. We
adopt the Agresti-Coull interval estimator [28], which is more stable for
values of Ycloser to the outliers [29]. The resulting confidence interval is
Y +2
t+4 ± t+4 (Y + 2)(t − Y − 2)z1− δ , where z1−α is the 1 − α percentile of
1
2
the standard normal distribution.
By setting δ = 0.05 and t = 1000, we estimated the confidence interval for both
Px(α̂) and Px for the data sets used in the experimental phase and for all the
FunCat classes considered in Section 7. In Table 1 we report the comparison of
the confidence intervals of Px(α̂) and Px in the PPI-VM data set and for some of
the considered FunCat classes. Similar results are obtained also with the other
data sets.
A COSNet for Semi-supervised Learning in Graphs 229

We distinguish two main cases: a) both the confidence intervals coincide with
the minimum interval [0, 0.0030], case coherent with the prior information; b)
both lower and upper bounds of Px(α̂) are less than the corresponding bounds
of Px . It is worth noting that, in almost all cases, the probability Px(α̂) has an
upper bound smaller than the lower bound of Px . This is particularly evident
for classes “01.03.16.01”, “02.13” and “11.02.01”; in the latter the lower bound
of Px is 0.7761, while the corresponding upper bound of Px(α̂) is  0.
These results, reproduced with similar trends in other data sets (data not
shown), point out the effectiveness of our method in approaching the problem
of the incoherence of the prior knowledge coding.

7 Results and Discussion


We evaluated the performance of the proposed algorithm on the gene function
prediction problem, a real-world multi-class, multi-label classification problem
characterized by hundreds of functional classes. In this context the multi-label
classification can be decomposed in a set of dichotomic classification problems
by which genes can be assigned or not to a specific functional class. Classes are
usually unbalanced, that is positive examples are significantly less than nega-
tives, and different biomolecular data sources, able to capture different features
of genes, can be used to predict their functions.

7.1 Experimental Set-Up


We performed genome-wide predictions of gene functions with the yeast model
organism, using the whole FunCat ontology [22], a taxonomy of functional classes
structured according to a tree forest1 . To this end we used five different biomolec-
ular data sources, previously analyzed in [30]. The main characteristics of the
data can be summarized as follows:
- Pfam-1 data are represented as binary vectors: each feature registers the
presence or absence of 4,950 protein domains obtained from the Pfam (Pro-
tein families) data base. This dataset contains 3529 genes.
- Pfam-2 is an enriched representation of Pfam domains by replacing the bi-
nary scoring with log E-values obtained with the HMMER software toolkit [31].
- Expr data contains gene expression measures of 4523 genes relative to two
experiments described in [32] and [33].
- PPI-BG data set contains protein-protein interaction data downloaded
from the BioGRID database [34]. Data are binary: they represent the pres-
ence or absence of protein-protein interactions for 4531 proteins.
- PPI-VM is another data set of protein-protein interactions that collects
binary protein-protein interaction data for 2338 proteins from yeast two-
hybrid assay, mass-spectrometry of purified complexes, correlated mRNA
expression and genetic interactions [35].
1
We used the funcat-2.1 scheme with the annotation data funcat-2.1 data 20070316,
available from: ftp://ftpmips.gsf.de/yeast/catalogues/funcat/funcat-2.1_
data_20070316.
230 A. Bertoni, M. Frasca, and G. Valentini

For PPI data we adopt the scoring function used by Chua et al [36], which
assigns to genes i and j the similarity score

2|Ni ∩ Nj | 2|Ni ∩ Nj |
Sij = ×
|Ni − Nj | + 2|Ni ∩ Nj | + 1 |Nj − Ni | + 2|Ni ∩ Nj | + 1

where Nk is the set of the neighbors of gene k (k is included). Informally, this


score is a way to take in account the interaction partners shared by the two
genes: when two genes share a high number of neighboring genes, the score is
close to 1, otherwise it is close to 0. When two genes share similar interactions,
it is likely that they share also similar biological functions.
The remaining data sets associate to each gene a feature vector; in these cases,
the score for each gene pair is set to the Pearson’s correlation coefficient of the
corresponding feature vectors. For Expr data we computed the squared corre-
lation coefficient to equally consider positive and negative correlated expression
between genes.
To reduce the complexity of the network and the noise introduced by too
small edge weights, as a pre-processing step we eliminated edges below a given
threshold. In this way we removed very weak similarities between genes, but at
the same time we chose low thresholds to avoid the generation of “singletons”
with no connections with other nodes. In brief, we tuned the threshold for each
dataset so that each vertex has at least one connection: in this way we obtained
a 0.05 threshold for Expr, 0.15 for Pfam-2, 0.0027 for Pfam-1, 0.01 for PPI-VM
and 0.04 for PPI-BG.
Moreover, to avoid training sets with a too small number of positive examples,
according to the protocol followed in [30], for each dataset we selected the classes
with at least 20 positives, thus resulting in about 200 functional classes for each
considered data set.

7.2 Results
We compared COSNet with other semi-supervised label propagation algorithms
and supervised machine learning methods proposed in the literature for the gene
function prediction problem. We considered the classical GAIN algorithm [21],
based on Hopfield networks; LP-Zhu, a semi-supervised learning method based
on label propagation [8]; SVM-l and SVM-g, i.e. respectively linear and gaussian
kernel SVMs with probabilistic output [37]. SVMs had previously been shown to
be among the best algorithms for predicting gene functions in a “flat” setting (that
is without considering the hierarchical relationships between classes) [38, 39].
To estimate the generalization capabilities of the compared methods we
adopted a stratified 10-fold cross validation procedure, by ensuring that each
fold includes at least one positive example for each classification task. Consid-
ering the severe unbalance between positive and negative classes, beyond the
classical accuracy, we computed the F-score for each functional class and for
each considered data set. Indeed in this context the accuracy is only partially
A COSNet for Semi-supervised Learning in Graphs 231

Table 2. Performance comparison between GAIN, COSNet , LP-Zhu, SVM-l, SVM-g

Dataset Methods Performance


GAIN COSNet LP-Zhu SVM-l SVM-g measures
0.9615 0.9570 0.9613 0.7528 0.7435 Accuracy
Pfam-1
0.0277 0.3892 0.0120 0.2722 0.2355 F-score
0.9613 0.9020 0.9656 0.7048 0.7515 Accuracy
Pfam-2
0.0296 0.3233 0.2117 0.1054 0.0270 F-score
0.9655 0.4617 0.9655 0.7496 0.7704 Accuracy
Expr
0 0.0957 0.0008 0.0531 0.0192 F-score
0.9666 0.9455 0.9704 0.7679 0.7597 Accuracy
PPI-BG
0.0362 0.3486 0.1758 0.1546 0.1178 F-score
0.9554 0.9363 0.9560 0.7237 0.7222 Accuracy
PPI-VM
0.1009 0.3844 0.2106 0.1888 0.2351 F-score
0.7

0.7
COSNet COSNet
SVM−l SVM−l
SVM−g SVM−g
0.6

LB−Zhu 0.6 LB−Zhu


0.5

0.5
0.4

0.4
PPI−VM
Pfam−2

0.3

0.3
0.2

0.2
0.1

0.1
0.0

0.0

Precision Recall F−score Precision Recall F−score

Fig. 1. Average precision, recall and F-score for each compared method (excluding
GAIN). Left: Pfam-2; Right: PPI-VM.

informative, since a classifier predicting always “negative” could obtain a very


high accuracy. Table 2 shows the average F-score and accuracy across all the
classes and for each data set.
The results show that COSNet achieves the best performances (in terms of
the F-score) w.r.t. all the other methods. The LP-Zhu method is the second best
method in Pfam-2 and PPI-BG data sets, but obtains very low performances
with Pfam-1 and Expr data. These overall results are confirmed by the Wilcoxon
signed-ranks test [40]: we can register a significant improvement in favour of
COSNet with respect to all the other methods and for each considered data set
at α = 10−15 significance level.
In order to understand the reasons for which our method works better, we
compared also the overall precision and recall of the methods separately for
each data set: we did not consider GAIN, since this methods achieved the worst
results in almost all the data sets. For lack of room, in Figure 1 we show only the
results relative to Pfam-2 and PPI-VM data sets. We can observe that, while
COSNet does not achieve the best precision or recall, it obtains the best F-score
232 A. Bertoni, M. Frasca, and G. Valentini

as a result of a good balancing between them. These results are replicated with
the other data sets, even if with Pfam-1 and Expr data COSNet achieves also
the best average precision and recall (data not shown).
We think that these results come from the COSNet cost-sensitive approach
that allows to automatically find the “near-optimal” parameters of the network
with respect to the distribution of positive and negative nodes (Section 6). It is
worth noting that using only single sources of data COSNet can obtain a rela-
tively high precision, without suffering a too high decay of the recall. This is of
paramount importance in the gene function prediction problem, where “in silico”
positive predictions of unknown genes need to be confirmed by expensive “wet”
biological experimental validation procedures. From this standpoint the experi-
mental results show that our proposed method could be applied to predict the
“unknown” functions of genes, considering also that data fusion techniques could
in principle further improve the reliability and the precision of the results [2, 41].

8 Conclusions
We introduced an effective neural algorithm, COSNet, which exploits Hopfield
networks for semi-supervised learning in graphs. COSNet adopts a cost sensitive
methodology to manage the unbalance between positive and negative labels, and
to preserve and coherently encode the prior information. We applied COSNet
to the genome-wide prediction of gene function in yeast, showing a large im-
provement of the prediction performances w.r.t. the compared state-of-the-art
methods.
By noting that the parameter γ of the neural network may assume different
values for each node, our method could be extended by allowing a different ac-
tivation threshold for each neuron. To avoid overfitting due to the increment of
network parameters, this approach should be paired with proper regularization
techniques. Moreover, by exploiting the supervised learning of network param-
eters, COSNet could be also adapted to combine multiple sources of networked
data: indeed the accuracy of the linear classifier on the labeled portion of the net-
work could be used to “weight” the associated source of data, in order to obtain
a “consensus” network, whose edges are the result of a weighted combination of
multiple types of data.
Acknowledgments. The authors gratefully acknowledge partial support by the
PASCAL2 Network of Excellence under EC grant no. 216886. This publication
only reflects the authors’ views.

References
[1] Zheleva, E., Getoor, L., Sarawagi, S.: Higher-order graphical models for classifi-
cation in social and affiliation networks. In: NIPS 2010 Workshop on Networks
Across Disciplines: Theory and Applications, Whistler BC, Canada (2010)
[2] Mostafavi, S., Morris, Q.: Fast integration of heterogeneous data sources for pre-
dicting gene function with limited annotation. Bioinformatics 26(14), 1759–1765
(2010)
A COSNet for Semi-supervised Learning in Graphs 233

[3] Vazquez, A., et al.: Global protein function prediction from protein-protein inter-
action networks. Nature Biotechnology 21, 697–700 (2003)
[4] Leskovec, J., et al.: Statistical properties of community structure in large social
and information networks. In: Proc. 17th Int. Conf. on WWW, pp. 695–704. ACM,
New York (2008)
[5] Bilgic, M., Mihalkova, L., Getoor, L.: Active learning for networked data. In: Proc.
of the 27th ICML, Haifa, Israel (2010)
[6] Marcotte, E., et al.: A combined algorithm for genome-wide prediction of protein
function. Nature 402, 83–86 (1999)
[7] Oliver, S.: Guilt-by-association goes global. Nature 403, 601–603 (2000)
[8] Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning with gaussian
fields and harmonic functions. In: Proc. of the 20th ICML, Washintgton DC,
USA (2003)
[9] Zhou, D.: et al.: Learning with local and global consistency. In: Adv. Neural Inf.
Process. Syst., vol. 16, pp. 321–328 (2004)
[10] Szummer, M., Jaakkola, T.: Partially labeled classification with markov random
walks. In: NIPS 2001, Whistler BC, Canada, vol. 14 (2001)
[11] Azran, A.: The rendezvous algorithm: Multi- class semi-supervised learning with
Markov random walks. In: Proc. of the 24th ICML (2007)
[12] Belkin, M., Matveeva, I., Niyogi, P.: Regularization and semi-supervised learning
on large graphs. In: Shawe-Taylor, J., Singer, Y. (eds.) COLT 2004. LNCS (LNAI),
vol. 3120, pp. 624–638. Springer, Heidelberg (2004)
[13] Delalleau, O., Bengio, Y., Le Roux, N.: Efficient non-parametric function induc-
tion in semi-supervised learning. In: Proc. of the Tenth Int. Workshop on Artificial
Intelligence and Statistics (2005)
[14] Belkin, M., Niyogi, P.: Using manifold structure for partially labeled classification.
In: Adv. Neural Inf. Process. Syst., vol. 15 (2003)
[15] Nabieva, E., et al.: Whole-proteome prediction of protein function via graph-
theoretic analysis of interaction maps. Bioinformatics 21(S1), 302–310 (2005)
[16] Deng, M., Chen, T., Sun, F.: An integrated probabilistic model for functional
prediction of proteins. J. Comput. Biol. 11, 463–475 (2004)
[17] Tsuda, K., Shin, H., Scholkopf, B.: Fast protein classification with multiple net-
works. Bioinformatics 21(suppl 2), ii59–ii65 (2005)
[18] Mostafavi, S., et al.: GeneMANIA: a real-time multiple association network inte-
gration algorithm for predicting gene function. Genome Biology 9(S4) (2008)
[19] Hopfield, J.: Neural networks and physical systems with emergent collective com-
pautational abilities. Proc. Natl Acad. Sci. USA 79, 2554–2558 (1982)
[20] Bengio, Y., Delalleau, O., Le Roux, N.: Label Propagation and Quadratic Crite-
rion. In: Chapelle, O., Scholkopf, B., Zien, A. (eds.) Semi-Supervised Learning,
pp. 193–216. MIT Press, Cambridge (2006)
[21] Karaoz, U., et al.: Whole-genome annotation by using evidence integration in
functional-linkage networks. Proc. Natl Acad. Sci. USA 101, 2888–2893 (2004)
[22] Ruepp, A., et al.: The FunCat, a functional annotation scheme for systematic
classification of proteins from whole genomes. Nucleic Acids Research 32(18),
5539–5545 (2004)
[23] Wang, D.: Temporal pattern processing. In: The Handbook of Brain Theory and
Neural Networks, pp. 1163–1167 (2003)
[24] Liu, H., Hu, Y.: An application of hopfield neural network in target selection of
mergers and acquisitions. In: International Conference on Business Intelligence
and Financial Engineering, pp. 34–37 (2009)
234 A. Bertoni, M. Frasca, and G. Valentini

[25] Zhang, F., Zhang, H.: Applications of a neural network to watermarking capacity
of digital image. Neurocomputing 67, 345–349 (2005)
[26] Tsirukis, A.G., Reklaitis, G.V., Tenorio, M.F.: Nonlinear optimization using gen-
eralized hopfield networks. Neural Comput. 1, 511–521 (1989)
[27] Ashburner, M., et al.: Gene ontology: tool for the unification of biology. the gene
ontology consortium. Nature Genetics 25(1), 25–29 (2000)
[28] Agresti, A., Coull, B.A.: Approximate is better than exact for interval estimation
of binomial proportions. Statistical Science 52(2), 119–126 (1998)
[29] Brown, L.D., Cai, T.T., Dasgupta, A.: Interval estimation for a binomial propor-
tion. Statistical Science 16, 101–133 (2001)
[30] Cesa-Bianchi, N., Valentini, G.: Hierarchical cost-sensitive algorithms for genome-
wide gene function prediction. Journal of Machine Learning Research, W&C Pro-
ceedings, Machine Learning in Systems Biology 8, 14–29 (2010)
[31] Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998)
[32] Spellman, P.T., et al.: Comprehensive identification of cell cycle-regulated genes of
the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology
of the Cell 9(12), 3273–3297 (1998)
[33] Gasch, P., et al.: Genomic expression programs in the response of yeast cells to
environmental changes. Mol. Biol. Cell 11(12), 4241–4257 (2000)
[34] Stark, C., et al.: Biogrid: a general repository for interaction datasets. Nucleic
Acids Research 34(Database issue), 535–539 (2006)
[35] von Mering, C., et al.: Comparative assessment of large-scale data sets of protein-
protein interactions. Nature 417(6887), 399–403 (2002)
[36] Chua, H., Sung, W., Wong, L.: An efficient strategy for extensive integration
of diverse biological data for protein function prediction. Bioinformatics 23(24),
3364–3373 (2007)
[37] Lin, H.T., Lin, C.J., Weng, R.: A note on platt’s probabilistic outputs for support
vector machines. Machine Learning 68(3), 267–276 (2007)
[38] Brown, M.P.S., et al.: Knowledge-based analysis of microarray gene expression
data by using support vector machines. Proceedings of the National Academy of
Sciences of the United States of America 97(1), 267–276 (2000)
[39] Pavlidis, P., et al.: Learning gene functional classifications from multiple data
types. Journal of Computational Biology 9, 401–411 (2002)
[40] Wilcoxon, F.: Individual comparisons by ranking methods. Journal of Computa-
tional Biology 1(6), 80–83 (1945)
[41] Re, M., Valentini, G.: Simple ensemble methods are competitive with state-of-
the-art data integration methods for gene function prediction. Journal of Machine
Learning Research, W&C Proceedings, Machine Learning in Systems Biology 8,
98–111 (2010)
Regularized Sparse Kernel
Slow Feature Analysis

Wendelin Böhmer1 , Steffen Grünewälder2 ,


Hannes Nickisch3 , and Klaus Obermayer1
1
Neural Processing Group, Technische Universität Berlin, Germany
{wendelin,oby}@cs.tu-berlin.de
2
Centre for Computational Statistics and Machine Learning,
University College London, United Kingdom
steffen@cs.ucl.ac.uk
3
Philips Research Laboratories, Hamburg, Germany
hannes.nickisch@philips.com

Abstract. This paper develops a kernelized slow feature analysis (SFA)


algorithm. SFA is an unsupervised learning method to extract features
which encode latent variables from time series. Generative relationships
are usually complex, and current algorithms are either not powerful
enough or tend to over-fit. We make use of the kernel trick in combi-
nation with sparsification to provide a powerful function class for large
data sets. Sparsity is achieved by a novel matching pursuit approach
that can be applied to other tasks as well. For small but complex data
sets, however, the kernel SFA approach leads to over-fitting and numeri-
cal instabilities. To enforce a stable solution, we introduce regularization
to the SFA objective. Versatility and performance of our method are
demonstrated on audio and video data sets.

1 Introduction

Slow feature analysis (SFA [23]) is an unsupervised method to extract features


which encode latent variables of time series. SFA aims for temporally coherent
features out of high dimensional and/or delayed sensor measurements. Given
enough training samples, the learned features will become sensitive to slowly
changing latent variables [9, 22]. Although there have been numerous studies
highlighting its resemblance to biological sensor processing [3, 9, 23], the method
has not yet found its way in the engineering community that focuses on the same
problems. One of the reasons is undoubtedly the lack of an easily operated non-
linear extension.
This paper provides such an extension in the form of a kernel SFA algorithm.
Such an approach has previously been made by Bray and Martinez [4] and is
reported to work well with a large image data set. Small and complex sets,
however, lead to numerical instabilities in any kernel SFA algorithm. Our goal
is to provide an algorithm that can be applied to both of the above cases.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 235–248, 2011.

c Springer-Verlag Berlin Heidelberg 2011
236 W. Böhmer et al.

Although formulated as a linear algorithm, SFA was originally intended to


be applied on the space of polynomials (e.g. quadratic [23] or cubic [3]). The
polynomial expansion of potentially high dimensional data, however, spans an
impractically large space of coefficients. Hierarchical application of quadratic
SFA has been proposed to solve this problem [23]. Although proven to work in
complex tasks [9], this approach involves a multitude of hyper-parameters and no
easy way to counteract inevitable over-fitting. It appears biologically plausible
but is definitely not easy to operate.
A powerful alternative to polynomial expansions are kernel methods. Here
the considered feature maps y : X → IR are elements of a reproducing kernel
Hilbert space H. The representer theorem [21] ensures the optimal solution for
a given training set exists within the span of kernel functions, parametrized by
training samples. Depending on the kernel, the considered Hilbert space can be
equivalent to the space of continuous functions [17] and a mapping y ∈ H is thus
very powerful.
There are, however, fundamental drawbacks in a kernel approach to SFA.
First, choosing feature mappings from a powerful Hilbert space is naturally prone
to over-fitting. More to the point, kernel SFA shows numerical instabilities due
to it’s unit variance constraint (see Sections 2 and 4). This tendency has been
analytically shown for the related kernel canonical correlation analysis [10]. We
introduce a regularization term to the SFA objective to enforce a stable solution.
Secondly, kernel SFA is based on a kernel matrix of size O(n2 ), where n is the
number of training samples. This is not feasible for large training sets. Our
approach approximates the optimal solution by projecting into a sparse subset
of the data. The choice of this subset is a crucial decision.
The question how many samples should be selected can only be answered em-
pirically. We compare two state-of-the-art sparse subset selection algorithms that
approach this problem very differently: (1) A fast online algorithm [5] that must
recompute the whole solution to change the subset’s size. (2) A costly matching
pursuit approach to sparse kernel PCA [18] that incrementally augments the se-
lected subset. To obtain a method that is both fast and incremental we derive a
novel matching pursuit approach to the first algorithm.
Bray and Martinez [4] have previously introduced a kernel SFA algorithm
that incorporates a simplistic sparsity scheme. Instead of the well-established
framework of Wiskott and Sejnowski [23], they utilize the cost function of Stone
[19] based on long and short term variances without explicit constraints. Due to
a high level of sparsity, their approach does not require function regularization.
We will show that the same holds for our algorithm if the sparse subset is only
a small fraction of the training data. However, for larger fractions additional
regularization becomes inevitable.
In the following section, we first introduce the general SFA optimization prob-
lem and derive a regularized sparse kernel SFA algorithm. In Section 3 the sparse
subset selection is introduced and a novel matching pursuit algorithm derived.
Section 4 evaluates the algorithms on multiple real-world data sets, followed by
a discussion of the results in Section 5.
Regularized Sparse Kernel SFA 237

2 Slow Feature Analysis


Let {xt }nt=1 ⊂ X be a sequence of n observations. The goal of slow feature
analysis (SFA) is to find a set of mappings yi : X → IR, i ∈ {1, . . . , p}, such that
the values yi (xt ) change slowly over time [23]. A mapping yi ’s change over time
is measured by the discrete temporal derivative ẏi (xt ) := yi (xt ) − yi (xt−1 ). The
SFA objective (called slowness, Equation 1) is to minimize the squared mean of
this derivative, where Et [·] is the sample mean1 over all available indices t:

min s(yi ) := Et [ẏi2 (xt )] (Slowness) (1)

To avoid trivial solutions as well as to deal with mixed sensor observations in


different scales, the mappings are forced to change uniformly, i.e. to exhibit unit
variance (Equations 2 and 3). Decorrelation ensures every mapping to extract
unique information (Equation 4). The last degree of freedom is eliminated by
demanding order (Equation 5), leading to the following constraints:

Et [yi (xt )] = 0 (Zero Mean) (2)


Et [yi2 (xt )] =1 (Unit Variance) (3)
Et [yi (xt )yj (xt )] = 0, ∀j 
= i (Decorrelation) (4)
∀j > i : s(yi ) ≤ s(yj ) (Order) (5)

The principle of slowness, although not the above definition, has been used
very early in the context of neural networks [2, 8]. Recent variations of SFA
differ either in the objective [4] or the constraints [7, 24]. For some simplified
cases, given an infinite time series and unrestricted function class, it can be
analytically shown that SFA solutions converge to trigonometric polynomials
w.r.t. the underlying latent variables [9, 22]. In reality those conditions are never
met and one requires a function class that can be adjusted to the data set at
hand.

2.1 Kernel SFA


Let the considered mappings yi : X → IR be elements of a reproducing kernel
Hilbert space (RKHS) H, with corresponding positive semi-definite kernel κ :
X × X → IR. The reproducing property of those kernels allows ∀y ∈ H : y(x) =
y, κ(·, x) H , in particular κ(·, x), κ(·, x ) H = κ(x, x ). The representer theorem
ensures the solution to be in the span of support functions, parametrized by the
training data [21], i.e. y = nt=1 at κ(·, xt ). Together those two relationships set
the basis for the kernel trick (for an introduction see e.g. Shawe-Taylor and
Cristianini [17]).
The zero mean constraint can be achieved by centring all involved support func-
tions {κ(·, xt )}nt=1 in H (for details see Section 2.2). Afterwards, the combined
kernel SFA (K-SFA) optimization problem for all p mappings y(·) ∈ Hp is
1
The samples are not i.i.d and must be drawn by an ergodic Markov chain in order
for the empirical mean to converge in the limit [15].
238 W. Böhmer et al.


p    
minp Et ẏi2 (xt ) , s.t. Et y(xt )y(xt ) = I . (6)
y∈H i=1

Through application of the kernel trick, the problem can be reformulated as


1
   

min
n×p n−1 tr A KDD K A
A∈IR (7)
1  
s.t. n A KK A = I,

where Kij = κ(xi , xj ) is the kernel matrix and D ∈ IRn×n−1 the temporal
derivation matrix with all zero entries except ∀t ∈ {1, . . . , n − 1} : Dt,t = −1
and Dt+1,t = 1.

Sparse Kernel SFA. If one assumes the feature mappings within the span of
another set of data {zi }m i=1 ⊂ X (e.g. a sparse subset of the training data, of-
ten called support vectors), the sparse kernel matrix K ∈ IRm×n is defined as
Kij = κ(zi , xj ) instead. The resulting algorithm will be called sparse kernel
SFA. Note that the representer theorem no longer applies and therefore the so-
lution merely approximates the optimal mappings in H. Both optimization prob-
lems have identical solutions if ∀t ∈ {1, . . . , n} : κ(·, xt ) ∈ span({κ(·, zi )}m
i=1 ),
e.g. {zi }ni=1 = {xt }nt=1 .

Regularized Sparse Kernel SFA. The Hilbert spaces corresponding to some of the
most popular kernels are equivalent to the infinite dimensional space of continu-
ous functions [17]. One example is the Gaussian kernel κ(x, x ) = exp(− 2σ1 2
x−
x
22 ). Depending on hyper-parameter σ and data distribution, this can obviously
lead to over-fitting. Less obvious, however, is the tendency of kernel SFA to be-
come numerically unstable for large σ, i.e. to violate the unit variance constraint.
Fukumizu et al. [10] have shown this analytically for the related kernel canoni-
cal correlation analysis. Note that both problems do not affect sufficiently sparse
solutions, as sparsity reduces the function complexity and sparse kernel matrices
KK are more robust w.r.t. eigenvalue decompositions.
One countermeasure is to introduce a regularization term to stabilize the
sparse kernel SFA algorithm, which thereafter will be called regularized sparse
kernel SFA (RSK-SFA). Our approach penalizes the squared Hilbert-norm of the
selected functions
yi
2H by a regularization parameter λ. Analogous to K-SFA
the kernel trick can be utilized to obtain the new objective:
 
min n−11
tr A KDD K A + λ tr(A K̄A)
A∈IRm×p
1  
s.t. n A KK A = I, (8)
where K̄ij = κ(zi , zj ) is the kernel matrix of the support vectors.

2.2 The RSK-SFA Algorithm


The RSK-SFA algorithm (Algorithm 1) is closely related to the linear SFA al-
gorithm of Wiskott and Sejnowski [23]. It consists of three phases: (1) fulfilling
zero mean by centring, (2) fulfilling unit variance and decorrelation by sphering
and (3) minimizing the objective by rotation.
Regularized Sparse Kernel SFA 239

Zero Mean. To fulfil the zero mean constraint, one centres the support functions
{gi }m
i=1 ⊂ H w.r.t. the data distribution, i.e. gi (·) := κ(·, zi )−Et [κ(xt , zi )] 1H (·),
where ∀x ∈ X : 1H (x) = 1H , κ(·, x) H := 1 is the constant function in Hilbert
space H. Although ∀y ∈ span({gi }m i=1 ) : Et [y(xt )] = 0 already holds, it is of
advantage to centre the support functions as well w.r.t. each other [16], i.e. ĝi :=
gi − Ej [gj ]. The resulting transformation of support functions on the training
data can be applied directly onto the kernel matrices K and K̄:

K̂ := (I − 1
m 1m 1m ) K (I − n1 1n 1
n) (9)
ˆ := (I − 
K̄ (I − 
m 1m 1m ) ,
1 1
K̄ m 1m 1m ) (10)

where 1m and 1n are one-vectors of dimensionality m and n, respectively.

Unit Variance and Decorrelation. Analogue to linear SFA, we first project into
the normalized eigenspace of n1 K̂K̂ =: UΛU . The procedure is called spher-
ing or whitening and fulfils the constraint in Equation 8, invariant to further
rotations R ∈ IRm×p : R R = I.
1
A := UΛ− 2 R ⇒ 1 
n
A K̂K̂ A = R R = I (11)

Note that an inversion of the diagonal matrix Λ requires the removal of zero
eigenvalues and corresponding eigenvectors first.

Minimization of the Objective. Application of Equation 11 allows us to solve


Equation 8 with a second eigenvalue decomposition:

1 1

min tr R Λ− 2 U BUΛ− 2 R (12)
R

s.t. R R = I with B := 1   ˆ.
+ λK̄
n−1 K̂DD K̂

Note that R is composed of the eigenvectors to the p smallest eigenvalues of the


above expression.

Solution. After the above calculations the i’th RSK-SFA solution is



m
yi (x) = Aji ĝj , κ(·, x) H (13)
j=1

Grouping the kernel functions of all support vectors together in a column vector,
i.e. k(x) = [κ(z1 , x), . . . , κ(zm , x)] , the combined solution y(·) ∈ Hp can be
expressed more compactly:

y(x) = Â k(x) − ĉ (14)


 
with  := I − m1
1m 1
m A and ĉ :=
1 
n  K1n .

The computational complexity is O(m2 n). For illustrative purposes, Algorithm


1 exhibits a memory complexity of O(mn). An online calculation of K̂K̂ and
K̂DD K̂ reduces this to O(m2 ).
240 W. Böhmer et al.

Algorithm 1. Regularized Sparse Kernel Slow Feature Analysis (RSK-SFA)


K ∈ IRm×n , K̄ ∈ IRm×m , p ∈ IN,
Input:
λ ∈ IR+ , D ∈ IRn×n−1
K̂ = (I − m 1
1m 1 1 
m )K(I − n 1n 1n ) (Eq. 9)
ˆ
K̄ = (I − m 1m 1m )K̄(I − m 1m 1
1  1
m) (Eq. 10)
 
UΛU = eig n1 K̂K̂
(Ur , Λr ) = remove zero eigenvalues(U, Λ)
B = n−11
K̂DD K̂ + λK̄ ˆ (Eq. 12)

−1/2  −1/2
RΣR = eig Λr Ur BUr Λr (Eq. 12)
(Rp , Σp ) = keep lowest p eigenvalues(R, Σ, p)
  −1/2
 = I − m 1
1m 1
m Ur Λ r Rp (Eq. 11 + 14)
ĉ = n1 Â K1n (Eq. 14)
Output: Â, ĉ

3 Sparse Subset Selection

The representer theorem guarantees the optimal feature maps yi∗ ∈ H for training
set {xt }nt=1 can be found within span({κ(·, xt )}nt=1 ). For sparse K-SFA, however,
no such guarantee exists. The quality of such a sparse approximation depends
exclusively on the set of support vectors {zi }mi=1 .
Without restriction on y ∗ , it is straight forward to select a subset of the
training data, indicated by an index vector2 i ∈ INm with {zj }m m
j=1 := {xij }j=1 ,
that minimizes the approximation error

m 2

it := minm κ(·, xt ) − αj κ(·, xij )
α∈IR j=1 H
−1
= Ktt − Kti (Kii ) Kit

for all training samples xt , where Ktj = κ(xt , xj ) is the full kernel matrix.
Finding an optimal subset is a NP hard combinatorial problem, but there exist
several greedy approximations to it.

Online Maximization of the Affine Hull. A widely used algorithm [5], which we
will call online maximization of the affine hull (online MAH) in the absence of
a generally accepted name, iterates through the data in an online fashion. At
time t, sample xt is added to the selected subset if it is larger than some given
threshold η. Exploitation of the matrix inversion lemma (MIL) allows an on-
line algorithm with computational complexity O(m2 n) and memory complexity
O(m2 ). The downside of this approach is the unpredictable dependence of the
final subset size m on hyper-parameter η. Downsizing of the subset therefore
requires a complete re-computation with larger η. The resulting subset size is
not predictable, although monotonically dependent on η.
2
Let “:” denote the index vector of all available indices.
Regularized Sparse Kernel SFA 241

Algorithm 2. Matching Pursuit Maximization of the Affine Hull (MP MAH)


Input: {xt }n t=1 ⊂ X , κ : X × X → IR, m ∈ IN
K = ∅; K−1 1 = ∅;
∀t ∈ {1, . . . , n} : 1t = κ(xt , xt ) (Eq. 16)
i1 = argmax{1t }n t=1 (Eq. 15)
t
for j ∈ {1, . . . , m − 1} do
 
αj = K(j,:) K−1 j , −1 (MIL)
 −1  
Kj 0 αj αj
K−1j+1 =  + (MIL)
0 0 ji j
for t ∈ {1, . . . , n} do
K(t,j) = κ(xt , xi j )
 2
j+1
t = jt − j1 K(t,:) αj (Eq. 16)
i
j
end for
ij+1 = argmax{j+1
t }n
t=1 (Eq. 15)
t
end for
Output: {i1 , . . . , im }

Matching Pursuit for Sparse Kernel PCA. This handicap is addressed by match-
ing pursuit methods [14]. Applied on kernels, some criterion selects the best
fitting sample, followed by an orthogonalization of all remaining candidate sup-
port functions in Hilbert space H. A resulting sequence of m selected samples
therefore contains all sequences up to length m as well. The batch algorithm
 of
Smola and Schölkopf [18] chooses the sample xj that minimizes3 Et it ∪ j . It was
shown later that this algorithm performs sparse PCA in H [12]. The algorithm,
which we will call in the following matching pursuit for sparse kernel PCA (MP
KPCA), has a computational complexity of O(n2 m) and a memory complexity
of O(n2 ). In practice it is therefore not applicable to large data sets.

3.1 Matching Pursuit for Online MAH


The ability to shrink the size of the selected subset without re-computation is a
powerful property of MP KPCA. Run-time and memory consumption, however,
make this algorithm infeasible for most applications. To extend the desired prop-
erty to the fast online algorithm, we derive a novel matching pursuit for online
MAH algorithm (MP MAH). Online MAH selects samples with approximation
errors that exceed the threshold η and therefore forces the supremum norm L∞
of all samples below η. This is analogous to a successive selection of the worst ap-
proximated sample, until the approximation error of all samples drops below η.
The matching pursuit approach therefore minimizes the supremum norm L∞ of
the approximation error4. At iteration j, given the current subset i ∈ IRj ,
3
it is non-negative and MP KPCA therefore minimizes the L1 norm of approximation
error.
4
An exact minimization of the L∞ norm is as expensive as the MP KPCA algorithm.
However, since it ∪ t = 0, selecting the worst approximated sample xt effectively
minimizes the supremum norm L∞ .
242 W. Böhmer et al.

ij+1 := argmin
[i1 ∪ t , . . . , in∪ t ]
∞ ≈ argmax it (15)
t t

Straight forward re-computation of the approximation error in every iteration


is expensive. Using the matrix inversion lemma (MIL), this computation can be
performed iteratively:
1 2
it ∪ j = it − K tj − K ti (K ii )−1
K ij . (16)
ij

Algorithm 2 iterates between sample selection (Equation 15) and error update
(Equation 16). The complexity is O(m2 n) in time and O(mn) in memory (to
avoid re-computations of K(:,i) ).

4 Empirical Validation

Slow feature analysis is not restricted to any specific type of time series data.
To give a proper evaluation of our kernel SFA algorithm, we therefore chose
benchmark data from very different domains. The common element, however, is
the existence of a low dimensional underlying cause.
We evaluated all algorithms on two data sets: Audio recordings from a vowel
classification task and a video depicting a random sequence of two hand-signs.
The second task covers high dimensional image data (40 × 30 pixels), which
is a very common setting for SFA. In contrast, mono audio data is one di-
mensional. Multiple time steps have to be grouped into a sample to create
a high dimensional space in which the state space is embedded as a mani-
fold (see Takens
 Theorem [11, 20]). All experiments employ a Gaussian kernel
k(x, x ) = exp − 2σ1 2
x − x
22 .
The true latent variables of the examined data are not known. To ensure that
any meaningful information is extracted, we measure the test slowness, i.e. the
slowness of the learned feature mappings applied on a previously unseen test
sequence drawn from the same distribution. The variance of a slow feature on
unseen data is not strictly specified. This changes the feature’s slowness and for
comparison we normalized all test outputs to unit variance before measuring the
test slowness.

4.1 Benchmark Data Sets

Audio Data. The “north Texas vowel database”5 contains uncompressed audio
files with English words of the form h...d, spoken multiple times by multiple per-
sons [1]. The natural task is to predict the central vowels of unseen instances. We
selected two data sets: (1) A small set with four training and four test instances
for each of the words “heed” and “head”, spoken by the same person. (2) A large
training set of four speakers with eight instances per person and each of the
5
http://www.utdallas.edu/~assmann/KIDVOW1/North_Texas_vowel_database.
html
Regularized Sparse Kernel SFA 243

words “heed” and “head”. The corresponding test set consists of eight instances
of each word spoken by a fifth person. The spoken words are provided as mono
audio streams of varying length at 48kHz, i.e. as a series of amplitude readings
{a1 , a2 , . . .}. To obtain an embedding of the latent variables, one groups a num-
ber of amplitude readings into a sample xt = [aδt , aδt+ , aδt+2 , . . . , aδt+(l−1) ] .
We evaluated the parameters δ,  and l empirically and chose δ = 50,  = 5 and
l = 500. This provided us with 3719 samples xt ∈ [−1, 1]500 for the small and
25448 samples for the large training set. Although the choice of embedding pa-
rameters change the resulting slowness in a nontrivial fashion, we want to point
out that this change appears to be smooth and the presented shapes similar over
a wide range of embedding parameters. The output of two RSK-SFA features is
plotted in Figure 2c.

Video Data. To obtain a video with a simple underlying cause, we recorded a


hand showing random sequences of the hand-signs “two-finger salute” and “open
palm” with an intermediate “fist” between each sign (Figure 2b). The hand was
well lit and had a good contrast to the mostly dark background. The frames were
scaled down to 40 × 30 gray-scale pixels, i.e xt ∈ [0, 1]1200 ⊂ IR1200 . Training
and test set consist of 3600 frames each, recorded at 24Hz and showing roughly
one sign per second.

4.2 Algorithm Performance

Figure 1 shows the test slowness of RSK-SFA features6 on all three data sets for
multiple kernel parameter σ and regularization parameter λ. The small audio
data set (Column a) and video data set (Column c) use the complete training
set as support vectors, whereas for the large audio data set (Column b) a full
kernel approach is not feasible. Instead we selected a subset of size 2500 (based
on kernel parameter σ = 2) with the MP MAH algorithm before training.
In the absence of significant sparseness (Figure 1a and 1c), unregularized
kernel SFA (λ = 0, equivalent to K-SFA, Equations 6 and 7) shows both over-
fitting and numerical instability. Over-fitting can be seen at small σ, where the
features fulfil the unit variance constraint (lower right plot), but do not reach
the minimal test slowness (lower left plot). The bad performance for larger σ,
on the other hand, must be blamed on numerical instability, as indicated by
a significantly violated unit variance constraint. Both can be counteracted by
proper regularization. Although optimal regularization parameters λ are quite
small and can reach computational precision, there is a wide range of kernel
parameters σ for which the same minimal test slowness is reachable. E.g. in
Figure 1a, a fitting λ can be found between σ = 0.5 and σ = 20.
A more common case, depicted in Figure 1b, is a large training set from which
a small subset is selected by MP MAH. Here no regularization is necessary and
6
Comparison to linear SFA features is omitted due to scale, e.g. test slowness was
slightly above 0.5 for both audio data sets. RSK-SFA can therefore outperform linear
SFA up to a factor of 10, a magnitude we observed in other experiments as well.
244 W. Böhmer et al.

a) Small Audio Set (m=n=3719) b) Large Audio Set (m=2500,n=25448) c) Video Data Set (m=n=3600)
0.2 0.2 0.25
Mean Test Slowness of 200 Features

Mean Test Slowness of 200 Features

Mean Test Slowness of 200 Features


λ= 0
λ= 1e−11
λ= 1e−10
0.15 λ= 1e−09 0.15
λ= 1e−07 0.2

0.1 0.1

0.15
0.05 0.05

0 0 0.1
0 10 20 30 0 50 100 0 250 500
Gaussian Kernel width σ Gaussian Kernel width σ Gaussian Kernel width σ
Magnification Training Variance Magnification Training Variance Magnification Training Variance
0.2 0.2 0.2 1.5
1.02 1.02
0.15 1 0.15 1 1
0.1 0.98 0.1 0.98 0.15
0.5
0.05 0.96 0.05 0.96
0.94 0.94 0.11 0
0 2 4 0 10 20 30 0 2 4 0 50 100 0 10 20 30 0 250 500
Kernel width σ Kernel width σ Kernel width σ Kernel width σ Kernel width σ Kernel width σ

Fig. 1. (Regularization) Mean test slowness of 200 RSK-SFA features over varying
kernel parameter σ for different regularization parameters λ: (a) Small audio data set
with all m = n = 3719 training samples as support vectors, (b) large audio data set
with m = 2500 support vectors selected out of n = 25448 training samples by MP MAH
and (c) video data set with all m = n = 3600 available support vectors. The lower left
plots magnify the left side of the above plot. Notice the difference in scale. The lower
right plots show the variance of the training output. Significant deviation from one
violates the unit variance constraint and thus demonstrates numerical instability. The
legend applies to all plots.

unregularized sparse kernel SFA (λ = 0) learns mappings of minimal slowness


in the range from σ = 1 to far beyond σ = 100.

4.3 Sparsity
To evaluate the behaviour of RSK-SFA for sparse subsets of different size, Figure
2a plots the test slowness of audio data for all discussed sparse subset selection
algorithms. As a baseline, we plotted mean and standard deviation of a random
selection scheme. One can observe that all algorithms surpass the random se-
lection significantly but do not differ much w.r.t. each other. As expected, the
MP MAH and Online MAH algorithms perform virtually identical. The novel
MP MAH, however, allows unproblematic and fast fine tuning of the selected
subset’s size.

5 Discussion
To provide a powerful but easily operated algorithm that performs non-linear
slow feature analysis, we derived a kernelized SFA algorithm (RSK-SFA). The
Regularized Sparse Kernel SFA 245

a) Small Audio Data Set b) Hand Signs


0.8
Mean Test Slowness of 200 Features

MP MAH
0.7 Online MAH
0.6 MP KPCA
Random
0.5 "two−finger salute" "fist" "open palm"
0.4 c) Example Audio Features

Feature 2
0.3

0.2

0.1

Feature 4
0
1000 2000 3000
Sparse Subset Size (out of 3719) "heed" "head"

Fig. 2. (Sparseness) (a) Mean test slowness of 200 RSK-SFA features of the small
audio data set (λ = 10−7 , σ = 2) over sparse subset size, selected by different algo-
rithms. For random selection the mean of 10 trials is plotted with standard deviation
error bars. (b) Examples from the video data set depicting the performed hand-signs.
(c) Two RSK-SFA features applied on the large audio test set (λ = 0, σ = 20). Vertical
lines separate eight instances of “heed” and eight instances of “head”.

novel algorithm is capable of handling small data sets by regularization and large
data sets through sparsity. To select a sparse subset for the latter, we developed
a matching pursuit approach to a widely used algorithm.
As suggested by previous works, our experiments show that for large data
sets no explicit regularization is needed. The implicit regularization introduced
by sparseness is sufficient to generate features that generalize well over a wide
range of Gaussian kernels. The performance of sparse kernel SFA depends on
the sparse subset, selected in a pre-processing step. In this setting, the subset
size m takes the place of regularization parameter λ. It is therefore imperative
to control m with minimal computational overhead.
We compared two state-of-the-art algorithms that select sparse subsets in
polynomial time. Online MAH is well suited to process large data sets, but
selects unpredictably large subsets. A change of m therefore requires a full re-
computation without the ability to target a specific size. Matching pursuit for
sparse kernel PCA (MP KPCA), on the other hand, returns an ordered list of
selected samples. After selection of a sufficiently large subset, lowering m yields
no additional cost. The downside is a quadratic dependency on the training set’s
size, both in time and memory. Both algorithms showed similar performance and
significantly outperformed a random selection scheme.
The subsets selected by the novel matching pursuit to online MAH (MP MAH)
algorithm yielded virtually the same performance as those selected by Online
MAH. There is no difference in computation time, but the memory complexity
of MP MAH is linearly dependent on the training set’s size. However, reducing
m works just as with MP KPCA, which makes this algorithm the better choice
246 W. Böhmer et al.

if one can afford the memory. If not, Online MAH can be applied several times
with slowly decreasing hyper-parameter η. Although a subset of suitable size will
eventually be found, this approach will take much more time than MP MAH.
The major advancement of our approach over the kernel SFA algorithm of
Bray and Martinez [4] is the ability to obtain features that generalize well for
small data sets. If one is forced to use a large proportion of the training set as sup-
port vectors, e.g. for small training sets of complex data, the solution can violate
the unit variance constraint. Fukumizu et al. [10] have shown this analytically
for the related kernel canonical correlation analysis and demonstrated the use
of regularization. In difference to their approach we penalize the Hilbert norm of
the selected function, rather than regularizing the unit variance constraint. This
leads to a faster learning procedure: First one fulfils the constraints as layed out
in Section 2.1, followed by repeated optimizations of the objective with slowly
increasing λ (starting at 0), until the constraints are no longer violated. The
resulting features are numerically stable, but may still exhibit over-fitting. The
latter can be reduced by raising λ even further or by additional sparsification.
Note that our empirical results in Section 4.2 suggest that for a fitting λ the
RSK-SFA solution remains optimal over large regimes of kernel parameter σ,
rendering an expensive parameter search unnecessary.
Our experimental results on audio data suggest that RSK-SFA is a promis-
ing pre-processing method for audio detection, description, clustering and many
other applications. Applied on large unlabelled natural language data bases,
e.g. telephone records or audio books, RSK-SFA in combination with MP MAH
or Online MAH will construct features that are sensitive to speech patterns of
the presented language. If the function class is powerful enough, i.e. provided
enough support vectors, those features will encode vowels, syllables or words,
depending on the embedding parameters. Based on those features, a small la-
belled data base might be sufficient to learn the intended task. The amount of
support vectors necessary for a practical task, however, is yet unknown and calls
for further investigation.
Although few practical video applications resemble our benchmark data, pre-
vious studies of SFA on sub-images show it’s usefulness in principle [3]. Land-
mark recognition in camera-based simultaneous localization and mapping (Vi-
sual SLAM [6]) algorithms is one possible field of application. To evaluate the
potential of RSK-SFA in this area, future works must include comparisons to
state-of-the-art feature extractions, e.g. the scale invariant feature transform
(SIFT [13]).
The presented results show RSK-SFA to be a powerful and reliable SFA al-
gorithm. Together with Online MAH or MP MAH, it combines the advantages
of regularization, fast feature extraction and large training sets with the perfor-
mance of kernel methods.

Acknowledgements. This work has been supported by the Integrated Gradu-


ate Program on Human-Centric Communication at Technische Universität Berlin
and the German Federal Ministry of Education and Research (grant 01GQ0850).
Regularized Sparse Kernel SFA 247

References
[1] Assmann, P.F., Nearey, T.M., Bharadwaj, S.: Analysis and classification of a vowel
database. Canadian Acoustics 36(3), 148–149 (2008)
[2] Becker, S., Hinton, G.E.: A self-organizing neural network that discovers surfaces
in randomdot stereograms. Nature 355(6356), 161–163 (1992)
[3] Berkes, P., Wiskott, L.: Slow feature analysis yields a rich repertoire of complex
cell properties. Journal of Vision 5, 579–602 (2005)
[4] Bray, A., Martinez, D.: Kernel-based extraction of Slow features: Complex cells
learn disparity and translation invariance from natural images. In: Neural Infor-
mation Processing Systems, vol. 15, pp. 253–260 (2002)
[5] Csató, L., Opper, M.: Sparse on-line gaussian processes. Neural Computa-
tion 14(3), 641–668 (2002)
[6] Davison, A.J.: Real-time simultaneous localisation and mapping with a single
camera. In: IEEE International Conference on Computer Vision, pp. 1403–1410
(2003)
[7] Einhäuser, W., Hipp, J., Eggert, J., Körner, E., König, P.: Learning viewpoint in-
variant object representations using temporal coherence principle. Biological Cy-
bernetics 93(1), 79–90 (2005)
[8] Földiák, P.: Learning invariance from transformation sequences. Neural Compu-
tation 3(2), 194–200 (1991)
[9] Franzius, M., Sprekeler, H., Wiskott, L.: Slowness and sparseness leads to place,
head-direction, and spatial-view cells. PLoS Computational Biology 3(8), e166
(2007)
[10] Fukumizu, K., Bach, F.R., Gretton, A.: Statistical consistency of kernel canonical
correlation analysis. Journal of Machine Learning Research 8, 361–383 (2007)
[11] Huke, J.P.: Embedding nonlinear dynamical systems: A guide to takens’ theorem.
Technical report, University of Manchester (2006)
[12] Hussain, Z., Shawe-Taylor, J.: Theory of matching pursuit. In: Advances in Neural
Information Processing Systems, vol. 21, pp. 721–728 (2008)
[13] Lowe, D.G.: Object recognition from local scale-invariant features. In: Interna-
tional Conference on Computer Vision, pp. 1150–1157 (1999)
[14] Mallat, S., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE
Transactions On Signal Processing 41, 3397–3415 (1993)
[15] Meyn, S.P., Tweedie, R.L.: Markov chains and stochastic stability. Springer, Lon-
don (1993)
[16] Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 10(5), 1299–1319 (1998)
[17] Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge
University Press, Cambridge (2004)
[18] Smola, A.J., Schölkopf, B.: Sparse greedy matrix approximation for machine learn-
ing. In: Proceedings to the 17th International Conference Machine Learning, pp.
911–918 (2000)
[19] Stone, J.V.: Blind source separation using temporal predictability. Neural Com-
putation 13(7), 1559–1574 (2001)
[20] Takens, F.: Detecting strange attractors in turbulence. Dynamical Systems and
Turbulence, 366–381 (1981)
248 W. Böhmer et al.

[21] Wahba, G.: Spline Models for Observational Data. Society for Industrial and Ap-
plied Mathematics, Philadelphia (1990)
[22] Wiskott, L.: Slow feature analysis: A theoretical analysis of optimal free responses.
Neural Computation 15(9), 2147–2177 (2003)
[23] Wiskott, L., Sejnowski, T.: Slow feature analysis: Unsupervised learning of invari-
ances. Neural Computation 14(4), 715–770 (2002)
[24] Wyss, R., König, P., Verschure, P.F.M.J.: A model of the ventral visual system
based on temporal stability and local memory. PLoS Biology 4(5), e120 (2006)
A Selecting-the-Best Method for Budgeted
Model Selection

Gianluca Bontempi and Olivier Caelen

Machine Learning Group


Computer Science Department, Faculty of Sciences
ULB, Université Libre de Bruxelles, Belgium
http://mlg.ulb.ac.be

Abstract. The paper focuses on budgeted model selection, that is the


selection between a set of alternative models when the ratio between the
number of model assessments and the number of alternatives, though
bigger than one, is low. We propose an approach based on the notion
of probability of correct selection, a notion borrowed from the domain
of Monte Carlo stochastic approximation. The idea is to estimate from
data the probability that a greedy selection returns the best alternative
and to define a sampling rule which maximises such quantity. Analytical
results in the case of two alternatives are extended to a larger number
of alternatives by using the Clark’s approximation of the maximum of a
set of random variables. Preliminary results on synthetic and real model
selection tasks show that the technique is competititive with state-of-
the-art algorithms, like the bandit UCB.

1 Introduction

Model assessment and selection [6] is one of the oldest topics in statistical model-
ing and plenty of parametric and nonparametric approaches have been proposed
in literature. The standard way of performing model assessment in machine
learning, especially when dealing with datasets of limited size, relies on a repeti-
tion of multiple training and test runs, like in cross-validation. What is emerging
in recent years is the need of using model selection in contexts where the number
of alternatives is huge and sometimes much larger than the number of samples.
This is typically the case of machine learning, where the model designer is con-
fronted with a huge number of model families and paradigms, and of feature
selection tasks (e.g. in bioinformatics or text mining) where the problems of se-
lecting the best subset among a large set of inputs can be cast in the form of a
model selection problem. In such contexts, given the huge number of alternatives
it is not feasible to carry out for each alternative an extensive (e.g. leave-one-
out) validation procedure. Then it becomes more and more important to design
strategies for assessing and comparing large number of alternatives by using a
limited number of validation runs, then making model selection affordable in a

Olivier Caelen is now with the Fraud Detection Team, Atos Worldline Belgium.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 249–262, 2011.

c Springer-Verlag Berlin Heidelberg 2011
250 G. Bontempi and O. Caelen

budgeted computational time. In a recent paper by Madani et al. [13] this prob-
lem has been denoted as ”budgeted active model selection” where the learner
can use a fixed budget of assessments (e.g. leave-one-out errors) to identify which
of a given set of model alternatives has the highest expected accuracy. Note that
budgeted model selection is an instance of the more general problem of budgeted
learning, in which there are fixed limits on the resources (e.g. number of exam-
ples or amount of computation) that are used to build the prediction model. The
research on budgeted learning is assuming an increasing role in machine learning
as witnessed by a recent ICML workshop1 .
The problem of selecting between alternatives in a stochastic setting and with
limited resources has been discussed within two research topics: multi-armed
bandit and selecting-the-best. The multi-armed bandit problem, first introduced
by [15], is a classical instance of an exploration/exploitation problem in which
a casino player has to decide which arm of a slot machine to pull to maximize
the total reward in a series of rounds. Each of the arms of the slot machine
returns a reward, which is randomly distributed, and unknown to the player.
Several algorithms deal with this problem and their application to model selec-
tion has already been proven to be effective [16]. However, the paradigm of the
bandit problem is not the only and probably not the most adequate manner of
interpreting a model selection problem in a stochastic environment. As stressed
in [13],[2] the main difference between a budgeted model selection problem and
a bandit problem is that in model selection there is a pure exploration phase
(characterised by spending the budget to assess the different alternatives) fol-
lowed by a pure exploitation step (i.e. the selection of the best model). This is
not the case of typical bandit problem where there is an immediate exploitation
(and consequent reward) following each exploration action. This difference led
us to focus on another computation framework extensively dealt with by the
community of stochastic simulation: the ”selecting-the-best” problem [12,11].
Researchers of stochastic simulation proposed a set of strategies (hereafter de-
noted as selecting-the-best algorithms) to sample alternative options in order to
maximize the probability of correct selection in a finite number of steps [9]. A
detailed review of the existing approaches to sequential selection as well as a
comparison of selecting-the-best and bandit strategies is discussed in [2].
The paper proposes a selecting-the-best strategy to explore the alternatives in
order to maximise the gain of a greedy selection, which is the selection returning
the alternative with the best sampled mean. The idea is to estimate from data the
probability that a greedy selection returns the best alternative and consequently
to estimate the expectation of the reward once a greedy action is done. The
probability of success of a greedy action is well-known in simulation literature
as the probability of correct selection (PCS) [11,3]. Here we use such notion
to study how the expected reward of a greedy model selection depends on the
number of model assessments. In particular we derive an optimal sampling rule
for maximising the greedy reward in the case of K = 2 normally distributed
configurations. Then we extend this rule to problem with K > 2 configurations
1
http://www.dmargineantu.net
A Selecting-the-Best Method for Budgeted Model Selection 251

by taking advantage of the Clark method [7] to approximate the maximum of a


set of random variables. The resulting algorithmic selection strategy is assessed
and compared with a bandit approach, an interval estimation and a greedy
strategy in a set of synthetic selection problems and real data feature selection
tasks. Note that this paper addresses tasks where the ratio between the size of
the budget and the number of alternatives, though bigger than one, is small.
In this configuration state-of-the-art selection techniques like racing [14] are not
feasible since all the budget would be consumed in a number of steps comparable
to the ratio without allowing any effecive selection.

2 Expected Greedy Reward for K = 2 Alternatives


Let us consider a set {zk } of K independent random2 variables zk with mean
μk and standard deviation σk such that
k ∗ = arg max μk , μ∗ = max μk
Let us consider a sequential budgeted setup where at most L assessments are
allowed, L/K > 1 is small and, at each round l, one alternative kl ∈ {1, . . . , K}
is chosen and assessed. We wish to design a strategy of samplings kl such that, if
we select the alternative with the highest sampled mean when the L assessments
are done, the probability of selecting k ∗ is maximal.
Let us denote by zik the ith observation of the kth alternative. Let N (l) =
[n1 (l), . . . , nK (l)] be a counting vector whose kth term denotes the number
of timesthat the kth alternative  was selected during the l first rounds and
nk (l)
Zk (l) = zk , zk , zk , . . . , zk
1 2 3
be the vector of identically and independently
distributed observed rewards of the kth alternative up to time l. A greedy selec-
tion at the lth step returns the alternative
k̂lg = arg max µ̂lk
with the greatest sampled mean, where
nk (l) i
l z
µ̂k = i=1 k .
nk (l)
Note that, since the sampled mean is a random variable, the greedy selection k̂lg
is random, too. Let us define by
 
plk = Prob k̂lg = k , k = 1, . . . , K

the probability that the greedy selection returns the kth alternative at the lth
step. It follows that plk∗ is then the probability of making the correct selection

at the lth step. Since K k=1 pk = 1 the expected gain of a greedy selection is
K

μlg = plk μk (1)
k=1
2
Boldface denotes random variables.
252 G. Bontempi and O. Caelen

and the expected regret (i.e. the loss due to a greedy selection) is
K

r = μ∗ − plk μk . (2)
k=1

If we wish to design an exploration strategy which maximises the expected gain


(or equivalently minimizes the regret) of a greedy selection we need to estimate
both the terms pk and μk . While the plug-in estimation of μk is trivial, deriving
pk from data is much more complex.
The analytical expression of the probability pk for K > 1 in the case of
independent Gaussian distributed alternatives zk ∼ N (μk , σk ) is derived in [5]
and is equal to

pk = Prob{δ̂1 > 0, . . . , δ̂k−1 > 0, δ̂k+1 > 0, . . . , δ̂K > 0} (3)


 T
where δ̂1 , . . . , δ̂k−1 , δ̂k+1 , . . . , δ̂K follows a multivariate normal distribution
 T
δ̂1 , . . . , δ̂k−1 , δ̂k+1 , . . . , δ̂K ∼ N [Γ, Σ]

with mean ⎛ ⎞
μk − μ1
⎜ .. ⎟
⎜ . ⎟
⎜ ⎟
⎜ μk − μk−1 ⎟

Γ =⎜ ⎟,

⎜ μk − μk+1 ⎟
⎜ .. ⎟
⎝ . ⎠
μk − μK
and covariance matrix Σ
⎛ σ2 σ12 2
σk 2
σk 2
σk

nk + n1 · · · ···
k
nk nk nk
⎜ ⎟
⎜ .. .. .. .. .. ⎟
⎜ . . . . . ⎟
⎜ σ2 2 ⎟
⎜ k
···
2
σk σk−1 2
σk
···
2
σk ⎟
⎜ nk nk + nk−1 nk nk ⎟
⎜ σ2 2 2 σ2 2 ⎟
⎜ k
···
σk σk
+ nk+1 ···
σk ⎟
⎜ nk nk nk nk ⎟
⎜ k+1

⎜ .
.. .. .. .. .. ⎟
⎝ . . . . ⎠
2 2 2 2 2
σk σk σk σk σK
nk ··· nk nk ··· nk + nK

where nk is the number of observations of zk . This expression shows that the


computation of (3) requires either a numeric or Monte Carlo procedure. A bandit
strategy based on multivariate Monte Carlo estimation of (3) is presented in [5].
Here we avoid having recourse to a multivariate estimation by approximating the
K variate problem with a bivariate problem, which will be detailed in Section 3.
In the K = 2 and Gaussian case the analysis of the term pk is indeed much
easier and some interesting properties can be derived analytically.
A Selecting-the-Best Method for Budgeted Model Selection 253

The first property is that when only two alternatives are in competition,
testing one of them leads invariably to an increase of the expected gain. This is
formalized in the following theorem.
Theorem 1. Let z1 ∼ N (μ1 , σ1 ) and z2 ∼ N (μ2 , σ2 ) be two independent Gaus-
sian distributed alternatives and μlg the expected greedy gain at step l. Let μl+1
g (k)
denote the value of the expected greedy gain at step l + 1 if the kth alternative is
tested. Then
∀k ∈ {1, 2}, μl+1g (k) ≥ μg
l

Proof. Without loss of generality let us assume that μ2 > μ1 , i.e. k ∗ = 2. Let
us first remark that the statement ∀k ∈ {1, 2}, μl+1 l
g (k) ≥ μg is equivalent to the
following statement
∀k ∈ {1, 2}, pl+1
2 (k) ≥ p2
l

where pl+1
2 (k) is the probability of selecting the best alternative (i.e. the second
one) at the (l +1)th step when the kth alternative is sampled. Since by definition
   
plk∗ = pl2 = Prob µ̂l2 ≥ µ̂l1 = Prob µ̂l2 − µ̂l1 ≥ 0 (4)

and because of normality and the unbiasedness of sampled means (Figure 1)


 
l l σ22 σ12
(µ̂2 − µ̂1 ) ∼ N μ2 − μ1 , +
n2 (l) n1 (l)
it follows from the relation
    
x−μ 1 x−μ
Prob {x ≤ x} = Φ = 1 + erf √
σ 2 2σ 2
that the equation (4) can be rewritten as follows
⎡ ⎛ ⎞⎤
  1⎢ ⎜ μ1 − μ2 ⎟⎥
plk∗ = 1 − Prob µ̂l2 − µ̂l1 < 0 = 1 − ⎢ ⎣ 1 + erf ⎜
⎝  
⎟⎥
⎠⎦
2 2
σ2 σ12
2 n2 (l) + n1 (l)

where Φ is the normal cumulative function and erf() is the Gauss error function,
whose derivative is known to be always positive. Since μ2 > μ1 by definition,
the sampling of each of the alternative will bring to an increase of either n2 (l) or
n1 (l) and consequently to a decrease of the argument of the erf function (given
that the numerator is negative). Since erf is monotonically increasing this leads
to a decrease of the value of the erf function and consequently to an increase of
the probability of making a correct selection.

The second interesting property is that, though sampling any of the two alter-
natives brings to an increase of the expected gain, this increase is not identical.
An optimal sampling policy (implemented by Algorithm 1) can be derived from
the following theorem.
254 G. Bontempi and O. Caelen

0.10
q 2 2
σ2 σ1
n2
+ n1

0.08
0.06

μ2 − μ1
Probability
0.04
0.02

p∗
0.00

−10 −5 0 5 10 15 20

b l2 − μ
μ b l1

Fig. 1. Distribution of µ̂l2 − µ̂l1 . The grey area represents the probability of correct
selection at the lth step if μ2 > μ1 .

Theorem 2. Let z1 ∼ N (μ1 , σ1 ), z2 ∼ N (μ2 , σ2 ) be two independent Gaussian


distributed alternatives and n1 = n1 (l), n2 = n2 (l) the respective number of
collected observations at step l. The sampling rule


⎨1 if NΔ < 0
kl = 2 if NΔ > 0 (5)


random if NΔ = 0

where
NΔ = n1 (n1 + 1)(σ22 − σ12 ) + σ12 (n2 + n1 + 1)(n1 − n2 ) (6)
maximises the value of μg at the step l + 1.

Proof. Without loss of generality let us assume that μ2 > μ1 , i.e. k ∗ = 2. By


using (1) and the equality pl+1 l+1 l+1 l+1
1 (1) + p2 (1) = p1 (2) + p2 (2) we obtain

μl+1 l+1
g (1) − μg (2) =

= pl+1 l+1 l+1 l+1


2 (1)μ2 + p1 (1)μ1 − p2 (2)μ2 − p1 (2)μ1 =
  l+1
= pl+1 l+1 l+1
2 (1) − p2 (2) μ2 + p1 (1) − p1 (2) μ1 =
 l+1
= p2 (1) − pl+1
2 (2) (μ2 − μ1 ).

This intermediary result is intuitive since it proves that the sign of μl+1 g (1) −
l+1 l+1
μl+1
g (2) is the same as the sign of p 2 (1) − p 2 (2) or in other terms that in or-
der to increase the expected gain we have to increase the probability of correct
selection (i.e. the probability of selecting 2). Let V l+1 (1) and V l+1 (2) denote
the variances of µ̂l+1
2 − µ̂l+1
1 at the (l + 1)th step if we sample at the step l the
A Selecting-the-Best Method for Budgeted Model Selection 255

Algorithm 1. SRule(μ1 , σ1 , n1 , μ2 , σ2 , n2 )
1: Input: μ1 , σ1 : parameters of the first alternative
μ2 , σ2 : parameters of the second alternative
2: Compute NΔ by Equation (6)
3: if NΔ ≤ 0 then
4: return 1
5: end if
6: if NΔ > 0 then
7: return 2
8: end if

alternative 1 or 2, respectively. Since µ̂l+12 and µ̂l+1


1 are unbiased estimators,
l+1 l+1
if we reduce the
 variance of µ̂
 2 − µ̂ 1 we increase the probability of correct
l+1 l+1
selection Prob µ̂2 > µ̂1 . The best sampling move is then the one which
reduces to the largest extent the variance of µ̂l+1
2 − µ̂l+1
1

σ12 σ2 σ2 σ22 NΔ
ΔV = V l+1 (1) − V l+1 (2) = + 2 − 1 − =
n1 + 1 n2 n1 n2 + 1 DΔ

where NΔ is given by (6) and

DΔ = (n1 + 1)n2 n1 (n2 + 1)

is always positive. It follows that by applying the sampling rule (5) we are
guaranteed to reduce the variance of µ̂l+12 − µ̂l+1
1 , and consequently increase
the probability of correct selection and finally increasing the expected gain of a
greedy selection.

3 Extension to K > 2 Alternatives


The previous theorems support the idea that in case of only K = 2 Gaussian dis-
tributed alternatives it is possible to control the evolution of the expected gain
by estimating the parameters of the two distributions. This is no longer true
in case of K > 2 alternatives and the evolution of μg is much less predictable
in those cases [2], [4]. For this reason we propose to transform the problem of
selection among K > 2 alternatives into an approximate problem where only
two configurations are given. The rationale of the approach is the following: if
we have to select the best alternative among K > 2 configurations, the problem
can be reformulated as the problem of choosing between one configuration and
the maximum of all the others. By doing in this way we reduce the problem to
a two-configuration setting where one of the alternatives is assessed against the
maximum of the remaining K − 1 ones. Since we are able to deal optimally with
K = 2 Gaussian configurations by using the sampling rule (5) we propose the
256 G. Bontempi and O. Caelen

Algorithm 2. SELBEST: Budgeted model selection with Clark approximation


1: Input: K: number of alternatives,
{Zk }: the observations of K random variables,
I: number of inititalization steps
2: for k = 1 to K do
3: nk ← 0
4: end for
5: for k = 1 to K do
6: for i = 1 to I do
7: Sample zk
8: nk ← nk + 1
9: end for
10: end for
11: for l = 1 to L do
12: for k = 1 to K do
13: Compute μ̂k , σ̂k
14: end for
15: [k] ← decreasing order of μ̂k
16: for k = 1 to K − 1 do
17: Compute μ̂m , σ̂m by Clark approximation
where ẑm ≈ max{ẑ[k+1] , . . . , ẑ[K] }
18: s ←SRule(μ[k] , σ[k] , n[k] , μ̂m , σ̂m , maxK
j=k+1 n[j] )
19: if s = 1 then
20: kl ← [k]
21: break;
22: end if
23: if (s = 2) AND (k = K − 1) then
24: kl ← [K]
25: end if
26: end for
27: Sample zkl
28: n[kl ] ← n[kl ] + 1
29: end for
30: return arg maxk μ̂k

following strategy: in the one-vs-maximum test, if the configuration returned


by (5) is the one standing alone we stop and the sampling is done, otherwise we
iterate within the set of K − 1 configurations until we have found which configu-
ration to sample. Such an approach requires however to estimate the maximum
of a set of configurations. For this purpose, we adopt the Clark approximation,
which is a fast and elegant way to approximate the maximum of a set of Gaussian
random variables without having recourse to analytical or numerical integration.

3.1 The Clark Approximation


The method proposed by Clark in [7] to approximate the distribution of the max-
imum of K normal variables consists in decomposing recursively a multivariate
maximum into the maximum of two terms where each term is fit by a normal
A Selecting-the-Best Method for Budgeted Model Selection 257

distribution. Let us suppose at first we want to compute the mean μm and the
variance σm2
of the random variable zm = max{z1 , z2 } where z1 ∼ N (μ1 , σ12 )
and z2 ∼ N (μ2 , σ22 ). If we define
μ1 − μ2
a2 = σ12 + σ22 − 2σ1 σ2 ρ12 , z=
a
where ρ12 is the correlation coefficient between z1 and z2 , it is possible to show
that
μm = μ1 Φ(z) + μ2 Φ(−z) + aφ(z)
2
σm = [(μ21 + σ12 )Φ(z) + (μ22 + σ22 )Φ(−z)+
+ (μ1 + μ2 )aφ(z)] − μ2m
where φ is the standard normal density function and Φ is the associated cumu-
lative distribution. Assume now that we have a third variable z3 ∼ N (μ3 , σ32 )
and that we wish to find the distribution of zM = max{z1 , z2 , z3 }. The Clark
approach solves this problem by making the approximation
zM = max{z1 , z2 , z3 } ≈ max{zm , z3 }
and using iteratively the procedure sketched above for 2 variables.
The Clark’s technique is then a fast and simple manner to approximate a
set of K − 1 configurations by their own maximum. This allows us to reduce a
K > 2 problem into a series of 2-configuration tasks. The resulting selecting-the-
best algorithm is detailed in Algorithm 2. During the initialisation each model is
assessed at least I times (lines 5-10). The ordering of the alternatives is made in
line 15 where the notation [k] is used to design the rank of the alternative with
the kth largest sampled mean (e.g. [1] = arg maxk μ̂k and [K] = arg mink μ̂k ).
Once the ordering is done the loop in line 16 performs the set of comparisons
between the kth best alternative and the maximum of the configurations ranging
from k + 1 to K. If the kth alternative is the one to be sampled (i.e. s = 1) the
choice is done and the executions gets out of the loop (line 21). Otherwise we
move to the k + 1th configuration until k = K − 1. Hereafter we will refer to this
algorithm as the SELBEST.
Note that in the algorithm as well as in the following experiments the number
L of assessments does not include the number I × K of assessments made during
the initialization.

4 Experiments
In order to assess the potential of the proposed technique, we carried out both
synthetic and real selection problems. For the sake of benchmarking, we com-
pared the SELBEST algorithm with three reference methods:
1. a greedy algorithm which for each l samples the model with the highest
estimated performance
kl = arg max μ̂k
258 G. Bontempi and O. Caelen

2. an interval estimation algorithm which samples the model with the highest
upper value of the confidence interval
! "
kl = arg max μ̂k + t0.05,nk (l)−1 σ̂k

where t0.05,n is the upper 95% upper critical point of the Student distribution
with n degrees of freedom and
3. the UCB bandit algorithm [1] which implements the following sampling strat-
egy # $ %
l 2 log l
k̂ = arg max μ̂k +
nk (l)

Note that the initialisation step is the same for all techniques and described by
the lines 5-10 of Algorithm 2.
The synthetic experimental setting consists of three parts: in the first one
we carried out 5000 experiments where each experiment is characterised by a
number K of independent Gaussian alternatives and K is uniformly sampled in
the interval [10, 200]. The values of the means μk , k = 1, . . . , K and standard
deviations σk , k = 1, . . . , K are obtained by sampling the uniform distributions
U(0, 1) and U(0.5, 1) respectively. Table 1 reports the average regrets of the four
assessed sampling strategies for I = 3 and a number L of assessment which takes
values in the set {20, 40, . . . , 200}. The second part is identical to the first one
with the only difference that standard deviations σk , k = 1, . . . , K are obtained
by sampling the distributions U(1, 2.5). The results are in Table 2.
The third part aims to assess the robustness of the approach in a non Gaussian
configuration which is similar to the one often encountered in model selection
where typically a mean squared error has to be minimized. For this reason the
alternatives, whose number K is uniformly sampled within the interval [10, 200],
have a chi-squared distribution of degree thirty and their means are uniformly
sampled within [1, 2]. The average regrets over 5000 experiments are presented
in Table 3.
The real data experimental setting consists of a set of feature selection prob-
lems applied to 26 UCI regression tasks [8], reported in Table 4. Here we consider
a feature selection task as an instance of model selection. In particular each al-
ternative corresponds to a different feature set and the assessment is done by
using a wrapper approach, i.e. by using the performance of a given learner (in
this case a 5-nearest neighbour) as a measure of the performance of the feature
set. Since we want to compare the accuracy of different selection techniques but
it is not possible, unlike the simulated case, to measure the expected generaliza-
tion performance from a finite sample, we partition each dataset in two parts.
The first part is used for assessment and the second is used to compute what
we consider a reliable estimate of the generalization accuracy μk , k = 1, . . . , K
of the K alternatives.
The number L of assessments takes values in the set {20, 40, . . . , 200}, I = 20
and the number of alternatives is uniformly sampled in the range K ∈ [60, 100].
A Selecting-the-Best Method for Budgeted Model Selection 259

Table 1. Synthetic experiment with Gaussian distributed alternatives and standard


deviations uniformly sampled in U(0.5, 1): average regrets over 5000 repetitions for L
total assessments. Bold notation is used for regrets which are statistically significantly
worse (paired permutation test pv< 0.05) than the SELBEST performance.

L SELBEST GREEDY IE UCB


20 0.125 0.137 0.136 0.12
40 0.115 0.129 0.13 0.115
60 0.108 0.123 0.125 0.111
80 0.102 0.119 0.121 0.108
100 0.097 0.114 0.118 0.105
120 0.093 0.11 0.115 0.102
140 0.09 0.106 0.112 0.1
160 0.087 0.103 0.109 0.098
180 0.084 0.1 0.106 0.096
200 0.082 0.097 0.104 0.094

Table 2. Synthetic experiment with Gaussian distributed alternatives and standard


deviations uniformly sampled in U(1, 2.5): average regrets over 5000 repetitions for L
total assessments. Bold notation is used for regrets which are statistically significantly
worse (paired permutation test pv< 0.05) than the SELBEST performance.

L SELBEST GREEDY IE UCB


20 0.267 0.282 0.284 0.26
40 0.254 0.272 0.275 0.249
60 0.243 0.266 0.269 0.242
80 0.233 0.26 0.264 0.235
100 0.225 0.254 0.258 0.229
120 0.218 0.25 0.254 0.224
140 0.212 0.245 0.25 0.219
160 0.206 0.24 0.246 0.215
180 0.201 0.236 0.242 0.212
200 0.196 0.232 0.238 0.208

Table 3. Synthetic experiment with chi-squared distributed alternatives: average re-


grets over 5000 repetitions for L total assessments. Bold notation is used for regrets
which are statistically significantly worse (paired permutation test pv< 0.05) than the
SELBEST performance.

L SELBEST GREEDY IE UCB


20 0.082 0.092 0.089 0.082
40 0.075 0.086 0.084 0.079
60 0.07 0.082 0.08 0.075
80 0.066 0.078 0.077 0.073
100 0.063 0.075 0.074 0.071
120 0.06 0.072 0.072 0.069
140 0.058 0.069 0.07 0.067
160 0.056 0.067 0.068 0.066
180 0.054 0.065 0.066 0.065
200 0.053 0.063 0.064 0.064
260 G. Bontempi and O. Caelen

Table 4. Datasets used for feature selection and their number of features

Dataset Number of variables


Abalone 10
Ailerons 40
AutoPrice 16
Bank32fh 32
Bank32fm 32
Bank32nh 32
Bank32nm 32
Bank-8fh 8
Bank-8fm 8
Bank-8nh 8
Bank-8nm 8
Bodyfat 13
Census 137
Elevators 18
Housing 13
Kin32fh 32
Kin32fm 32
Kin32nh 32
Kin32nm 32
Kin8fh 8
Kin8fm 8
Kin8nh 8
Kin8nm 8
Mpg 7
Ozone 8
Pol 48

Table 5. Real data experiment: average relative regrets over 500 repetitions for L total
assessments. Bold notation is used for regrets which are statistically significantly worse
(paired permutation test pv< 0.05) than the SELBEST performance.

L SELBEST GREEDY IE UCB


20 19.07 19.68 19.78 19.59
40 18.56 19.56 19.74 19.56
60 18.17 19.24 19.65 19.42
80 17.94 19.27 19.68 19.27
100 17.55 18.99 19.59 19.12

Note that each alternative corresponds to a different feature set with a num-
ber of inputs equal to five. The goal of the budgeted model selection is to return
among the K alternatives the one which has the highest generalization accuracy
μk in the test set. The computation of leave-one-out (providing the values zkl )
A Selecting-the-Best Method for Budgeted Model Selection 261

and of the generalization accuracy (providing the values μk ) are done using a 5
nearest neighbour. Table 5 reports for each technique and for different values of
L the average relative regret (to be minimized),
K
μ∗ − k=1 plk μk
μ∗

that is the regret (2) normalized with respect to the maximal gain. The average is
done over the 26 datasets and 500 repetitions (each characterized by a resampling
of the training errors).
The results indicate that the SELBEST technique is significantly better than
the GREEDY and the IE technique for all ranges of L. As far as the comparison
with UCB is concerned, the results show that while SELBEST and UCB are
comparable for low values of L, the bandit technique is outperformed when L
increases. This is compatible with the fact that the bandit technique has not been
designed for budgeted model selection tasks and as a consequence its effectiveness
is reduced when the exploration phase is sufficiently large to take advantage of
selecting-the-best approaches.

5 Conclusion

Model selection is more and more confronted with tasks characterised by a huge
number of alternatives with respect to the amount of learning data and the
time allowed for assessment. This is the case of bioinformatics, text mining and
large-scale optimization problems (for instance Monte Carlo tree searches in
games [10]). These new challenges ask for techniques able to compare stochas-
tic configurations more rapidly and with a smaller budget of observations than
conventional leave-one-out techniques. Bandit techniques are at the moment the
most commonly used approaches in spite of the fact that they have not been
conceived for problems where assessment and selection are sequential and not
intertwined like in multi-armed problems. This paper explores an alternative
family of approaches which relies on notions of stochastic optimisation and low
variate approximation of a large number of alternatives. The promising results
open the way to additional validations in real configurations like feature selec-
tion in large feature to sample ratio dataset, notably in bioinformatics and text
mining.

References
1. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed
bandit problem. Machine Learning 47(2/3), 235–256 (2002)
2. Caelen, O.: Sélection Séquentielle en Environnement Aléatoire Appliquée à
l’Apprentissage Supervisé. PhD thesis, ULB (2009)
3. Caelen, O., Bontempi, G.: Improving the exploration strategy in bandit algorithms.
In: Proceedings of Learning and Intelligent OptimizatioN LION II, pp. 56–68 (2007)
262 G. Bontempi and O. Caelen

4. Caelen, O., Bontempi, G.: On the evolution of the expected gain of a greedy action
in the bandit problem. Technical report, Département d’Informatique, Université
Libre de Bruxelles, Brussels, Belgium (2008)
5. Caelen, O., Bontempi, G.: A dynamic programming strategy to balance exploration
and exploitation in the bandit problem. In: Annals of Mathematics and Artificial
Intelligence (2010)
6. Claeskens, G., Hjort, N.L.: Model selection and model averaging. Cambridge Uni-
versity Press, Cambridge (2008)
7. Clark, C.E.: The greatest of a finite set of random variables. Operations Research,
145–162 (1961)
8. Frank, A., Asuncion, A.: UCI machine learning repository (2010)
9. Inoue, K., Chick, S.E., Chen, C.-H.: An empirical evaluation of several methods
to select the best system. ACM Transactions on Modeling and Computer Simula-
tion 9(4), 381–407 (1999)
10. Iolis, B., Bontempi, G.: Comparison of selection strategies in monte carlo tree
search for computer poker. In: Proceedings of the Annual Machine Learning Con-
ference of Belgium and The Netherlands, BeNeLearn 2010 (2010)
11. Kim, S., Nelson, B.: Selecting the Best System. In: Handbooks in Operations Re-
search and Management Science: Simulation. Elsevier, Amsterdam (2006)
12. Law, A.M., Kelton, W.D.: Simulation Modeling & analysis, 2nd edn. McGraw-Hill
International, New York (1991)
13. Madani, O., Lizotte, D., Greiner, R.: Active model selection. In: Proceedings of
the Proceedings of the Twentieth Conference Annual Conference on Uncertainty
in Artificial Intelligence (UAI 2004), pp. 357–365. AUAI Press (2004)
14. Maron, O., Moore, A.W.: The racing algorithm: Model selection for lazy learners.
Artificial Intelligence Review 11(1-5), 193–225 (1997)
15. Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the
American Mathematical Society 58(5), 527–535 (1952)
16. Schneider, J., Moore, A.: Active learning in discrete input spaces. In: Proceedings
of the 34th Interface Symposium (2002)
A Robust Ranking Methodology Based on
Diverse Calibration of AdaBoost

Róbert Busa-Fekete1,2 , Balázs Kégl1,3 , Tamás Éltető3 , and György Szarvas2,4


1
Linear Accelerator Laboratory (LAL), University of Paris-Sud,
CNRS Orsay, 91898, France
2
Research Group on Artificial Intelligence of the Hungarian Academy of Sciences
and University of Szeged, Aradi vértanúk tere 1., H-6720 Szeged, Hungary
3
Computer Science Laboratory (LRI), University of Paris-Sud,
CNRS and INRIA-Saclay, 91405 Orsay, France
4
Ubiquitous Knowledge Processing (UKP) Lab, Computer Science Department
Technische Universität Darmstadt, D-64289 Darmstadt, Germany

Abstract. In subset ranking, the goal is to learn a ranking function


that approximates a gold standard partial ordering of a set of objects
(in our case, relevance labels of a set of documents retrieved for the
same query). In this paper we introduce a learning to rank approach
to subset ranking based on multi-class classification. Our technique can
be summarized in three major steps. First, a multi-class classification
model (AdaBoost.MH) is trained to predict the relevance label of each
object. Second, the trained model is calibrated using various calibra-
tion techniques to obtain diverse class probability estimates. Finally, the
Bayes-scoring function (which optimizes the popular Information Re-
trieval performance measure NDCG), is approximated through mixing
these estimates into an ultimate scoring function. An important novelty
of our approach is that many different methods are applied to estimate
the same probability distribution, and all these hypotheses are combined
into an improved model. It is well known that mixing different condi-
tional distributions according to a prior is usually more efficient than
selecting one “optimal” distribution. Accordingly, using all the calibra-
tion techniques, our approach does not require the estimation of the best
suited calibration method and is therefore less prone to overfitting. In an
experimental study, our method outperformed many standard ranking
algorithms on the LETOR benchmark datasets, most of which are based
on significantly more complex learning to rank algorithms than ours.

Keywords: Learning-to-rank, AdaBoost, Class Probability Calibration.

1 Introduction

In the past, the result lists in Information Retrieval were ranked by probabilistic
models, such as the BM25 measure [16], based on a small number of attributes
(the frequency of query terms in the document, in the collection, etc.). The
parameters of these models were usually set empirically. As the number of useful

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 263–279, 2011.

c Springer-Verlag Berlin Heidelberg 2011
264 R. Busa-Fekete et al.

features increase, these manually crafted models become increasingly laborious


to configure. Alternatively, one can use as many (possibly redundant) attributes
as possible, and employ Machine Learning techniques to induce a ranking model.
This approach alleviates the human effort needed to design the ranking function,
and also provides a natural way to directly optimize the retrieval performance
for any particular application and evaluation metric. As a result, Learning to
Rank has gained considerable research interest in the past decade.
Machine Learning based ranking systems are traditionally classified into three
categories. In the simplest pointwise approach, the instances are first assigned
a relevance score using classical regression or classification techniques, and then
ranked by posterior scores obtained using the trained model [12]. In the pairwise
approach, the order of pairs of instances is treated as a binary label and learned
by a classification method [8]. Finally, in the most complex listwise approach,
the fully ranked lists are learned by a tailor-made learning method which aims
to optimize a ranking-specific evaluation metric during the learning process [18].
In web page ranking or subset ranking [7] the training data is given in the
form of query-document-relevance label triplets. The relevance label of a train-
ing instance indicates the usefulness of the document to its corresponding query,
and the ranking for a particular query is usually evaluated using the (normal-
ized) Discounted Cumulative Gain ((N)DCG) or the Expected Reciprocal Rank
(ERR) [5] measures. It is rather difficult to extend classical learning methods
to directly optimize these evaluation metrics. However, since the DCG can be
bounded by the 0 − 1 loss [12], the traditional classification error can be consid-
ered as a surrogate function of DCG to be minimized.
Calibrating the output of a particular learning method, such as Support Vec-
tor Machines or AdaBoost, is crucial in applications with quality measures dif-
ferent from the 0 − 1 error [14]. Our approach is based on the calibration of a
multi-class classification model, AdaBoost [9]. In our setup, the class labels are
assumed to be random variables and the goal is the estimation of the probability
distribution of the class labels given a feature vector. An important novelty in
our approach is that instead of using a single calibration technique, we apply
several methods to estimate the same probability distribution and, in a final
step, we combine these estimates. Both the Bayesian paradigm and the Mini-
mum Description Length principle [15] suggest that it is usually more efficient
to mix different conditional distributions according to a prior than to select one
“optimal” distribution.
We use both regression-based calibration (RBC) and class probability-based
calibration (CPC) to transform the output scores of AdaBoost into relevance
label estimates that are comparable to each other. In the case of RBC, the real-
valued scores are obtained by a regression function fit to the output of AdaBoost
as the independent variable and the relevance labels as the dependent variable.
These scores are then used to rank the objects. In the case of CPC, the poste-
rior probability distribution is used to approximate the so-called Bayes-scoring
function [7], which is shown to optimize the expected DCG in a probabilistic
setup.
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 265

Fig. 1. The schematic overview of our approach. In the first level a multi-class method
(AdaBoost.MH) is trained using different hyperparameter settings. Then we calibrate
the multi-class models in many ways to obtain diverse scoring functions. In the last
step we simply aggregate the scoring functions using an exponential weighting.

The proper choice of the prior on the set of conditional distributions obtained
by the calibration of AdaBoost is an important decision in practice. In this paper,
we use an exponential scheme based on the quality of the rankings implied
by the conditional distributions (via their corresponding conditional ranking
functions) which is theoretically more well-founded than the uniformly weighted
aggregation used by McRank [12].
Figure 1 provides a structural overview of our system. Our approach belongs to
the simplest, pointwise category of learning-to-rank models. It is based on a series
of standard techniques (i) multiclass classification, ii) output score calibration
and iii) exponentially weighted forecaster to combine the various hypotheses. As
opposed to previous studies, we found our approach to be competitive to the
standard methods (AdaRank, ListNET, rankSVM, and rankBoost) of the
theoretically more complex pairwise and listwise approaches.
We attribute this surprising result to the presence of label noise in the ranking
task and the robustness of our approach to this noise. Label noise is inherent in
web page ranking for multiple reasons. First, the relevance labels depend on hu-
man decisions, and so they are subjective and noisy. Second, the common feature
representations account only for simple keyword matches (such as query term
frequency in the document) or query-independent measures (such as PageRank)
that are unable to capture query-document relations with more complex seman-
tics like the use of synonyms, analogy, etc. Query-document pairs that are not
characterized well by the features can be considered as noise from the perspective
of the learning algorithm. Our method suites especially well to practical prob-
lems with label noise due to the robustness guaranteed by the meta-ensemble
step that combines a wide variety of hypotheses. As our results demonstrate, it
compares favorably to theoretically more complex approaches.
The paper is organized as follows. In Section 2 we provide a brief overview
of the related works. Section 3 describes the formal setup. Section 4 is devoted
266 R. Busa-Fekete et al.

to the description of calibration technique. The ensemble scheme is presented


in Section 5. In Section 6 we investigate the theoretical properties of our CPC
approach. Our experimental results are presented in Section 7 and we draw
conclusions in Section 8.

2 Related Work
Among the plethora of ranking algorithms, our approach is the closest to the
McRank algorithm [12]. We both use a multi-class classification algorithm at
the core (they use gradient boosting whereas we apply AdaBoost.MH). The
major novelties in our approach are that we use product base classifiers besides
the popular decision tree base classifiers and apply several different calibration
approaches. Both elements add more diversity to our models that we exploit
by a final meta-ensemble technique. In addition, MCRank’s implementation is
inefficient in the sense that the number of decision trees trained in each boosting
iteration is as large as the number of different classes in the dataset.
Even though McRank is not considered a state-of-the-art method itself, its
importance is unquestionable. It can be viewed as a milestone which proved
the raison d’etre of classification based learning-to-rank methods. It attracted
the attention of researchers working on learning-to-rank to classification-based
ranking algorithms. The most remarkable method motivated by McRank is
LambdaMart [19], which adapts the MART algorithm to the subset ranking
problem. In the Yahoo! Learning-to-rank Challenge this method achieved the
best performance in the first track [4].
In the Yahoo! challenge [4], a general conclusion was that listwise and pair-
wise methods achieved the best scores in general, but tailor-made pointwise
approaches also achieved very competitive results. In particular, the approach
presented here is based on our previous work [1]. The main contributions of this
work are that we evaluate a state of the art multiclass classification based ap-
proach on publicly available benchmark datasets, and that we present a novel
calibration approach, namely sigmoid-based class probability calibration (CPC),
which is theoretically better grounded than regression-based calibration. We
also provide an upper bound on the difference between the DCG value of the
Bayes optimal score function and the DCG value achieved by its estimate using
CPC.
In the LambdaMART [19] paper there is a second interesting contribution,
namely a linear combination scheme for two rankers with an O(n2 ) algorithm
where n is the number of documents. This method is simply based on a line
search optimization among the convex combination of two rankers. This rank-
ing combination is then used for adjusting the weights of weak learners. This
combination method has the appealing property that it gives optimal convex
combination of two rankers. However, it is not obvious how to extend it for more
than two rankers, so it is not directly applicable to our setting.
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 267

3 Definition of the Ranking Problem


In this section we briefly summarize the part of [7] relevant to our approach,
and, at the same time, we introduce the notation that will be used in the rest of
the paper.
Let us assume that we are given a set of query objects Q = {Q1 , . . . , QM }.
For a query object Qk we will define the set of feature vectors

Dk = {xk1 , . . . , xkj , . . . , xkmk },

where xkj are the real valued feature vectors that encode the set of documents
retrieved for Qk . The upper index will always refer to the query index. When it
is not confusing, we will omit the query index and simply write xj for the jth
document of a given query.
The relevance grade of xkj is denoted by yjk . The set of possible relevance
grades is Y = {γ1 , . . . , γK }, usually referred to as relevance labels, i.e. integer
numbers up to a threshold:  = 1, . . . , K. In this case, the relation between the
k
relevance grades and relevance labels is yjk = 2j + 1, where kj is the relevance
label for response j given a query Qk .
The goal of the ranker is to output a permutation J = [j1 , . . . , jm ] over the
integers (1, . . . , m). The Discounted Cumulative Gain (DCG) is defined as

m
DCG(J, [yji ]) = ci yji , (1)
i=1

where ci is the discount factor of the ith document in the permutation. The
most commonly used discount factor is ci = log(1+i) 1
. One can also define the
normalized DCG (NDCG) score by dividing (1) with the DCG score of the best
permutation.
We will consider yjk as a random variable with discrete probability distribution
P [yjk = γ|xkj ] = p∗yk |xk (γ) over the relevance grades for document j and query
j j

Qk . The Bayes-scoring function is defined as


   ∗
v ∗ (xkj ) = Ep∗k k yjk = γpyk |xk (γ).
y |x j j
j j
γ∈Y

Since yjK is a random variable, we can define the expected DCG for any permu-
tation J = [j1 , . . . , jmk ] as

m
 k  m
DCG(J, [yjki ]) = ci Ep∗k yji = ci v ∗ (xkji ).
y |xk
ji ji
i=1 i=1

Let the optimal Bayes permutation J ∗k = [j1∗k , . . . , jm


∗k
k ] over the documents of

query Q be the one which maximizes the expected DCG-value, that is,
k

J ∗k = arg max DCG(J, [yjik ]).


J
268 R. Busa-Fekete et al.

According to Theorem 1 of [7], J ∗k has the property that if ci > ci then for the
Bayes-scoring function it holds that v ∗ (xji∗k ) > v ∗ (xj ∗k

). Our goal is to estimate
i
p∗yk |xk (γ) by pA
y k |xk
(γ), which defines the following scoring function
j j j j

 k  A
v A (xkj ) = EpAk yj = γpyk |xk (γ), (2)
y |xk j j
j j
γ∈Y

where the label A will refer to the method that generates the probability esti-
mates.

4 The Calibration of Multi-class Classification Models


Our basic modeling tool is multi-class AdaBoost.MH  introduced by [17]. The in-
put training set is the set of feature vectors X = x11 , . . . , x1m1 , . . . , xM1 , . . . , xmM
M
 1 
and a set of labels Z = z1 , . . . , zm1 , . . . , z1 , . . . , zmM . Each feature vector xjk ∈
1 M M

Rd encodes a (query, document) pair. Each label vector zkj ∈ {+1, −1}K encodes
the relevance label using a one-out-of-K scheme, that is, zj, k
= 1 if kj =  and −1
otherwise. We used two well-boostable base learners, i.e. decision trees and deci-
sion products [11]. Instead of using uniform weighting for training instances, we
up-weighted relevant instances exponentially proportionally to their relevance,
so, for example, an instance xjk with relevance jk = 3 was twice as important
in the global training cost than an instance with relevance jk = 2, and four
times as important than an instance with relevance jk = 1. Formally, the initial
(unnormalized) weight of th label of the xjk th instance is
 k
2 j if kj = ,
wj, =
k
k
2j /(K − 1) otherwise.
The weights are then normalized to sum to 1. This weighting scheme was moti-
vated by the evaluation metric: the weight of an instance in the NDCG score is
exponentially proportional to the relevance label of the  instance itself.
T
AdaBoost.MH outputs a strong classifier f (T ) (x) = t=1 α(t) h(t) (x), where
h (x) is a {−1, +1} -valued base classifier (K is number of relevance label
(t) K

values), and α(t) is its weight. In multi-class classification the elements of f (T ) (x)
are treated as posterior scores corresponding to the labels, and the predicted
label is  kj = arg max=1,...,K f (xkj ). When posterior probability estimates are
(T )

required, in the simplest case the output vector can be shifted into [0, 1]K using


(T )
1 f (x)
f  (x) =
(T )
1 + T ,
t=1 α
2 (t)

and then the posterior probabilities can be obtained by simple normalization

f
(T )
(xkj )
pstandard
yjk |xk
(γ ) = K (T )
. (3)
j
 =1 f   (xkj )
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 269

4.1 Regression Based Pointwise Calibration


An obvious way to calibrate the AdaBoost output is to re-learn the relevance
grades by applying a regression method. Using the raw K-dimensional output
of AdaBoost we can obtain relevance grade estimates in the form of yjk ≈
 
g f (T ) (xkj ) . where g : RK → R is the regression function. We shall refer to
this scheme as regression-based calibration (RBC).
In our experiments we used five different regression methods: Gaussian process
regression, logistic regression, linear regression, neural network regression, and
polynomial regression of degree between 2 and 5. From a practical point of view,
the relevance grade estimates provided by the different regression methods for
an unseen test document are on a different scale, so these values cannot be
aggregated in a direct way. In Section 5 we describe a way we normalized these
values.

4.2 Class Probability Calibration and Its Implementation


   
Let us recall that the output of AdaBoost.MH is f xki = f xki =1,...,K .
Then, the class-probability-based sigmoidal calibration for label  is
sΘ (f (xki ))
psyΘk |xk (γ ) = K (4)
 =i sΘ (f (xi ))
i i  
k

where the sigmoid function can be written as sΘ={a,b} (x) = 1+exp(−a(x−b))


1
. The
parameters of the sigmoid function can be tuned by minimizing a so-called tar-
get calibration function (TCF) LA (Θ, f ). Generally speaking, LA (Θ, f ) is a loss
function calculated using the relevance label probability distribution estimates
defined in (4). LA is parametrized by the parameter Θ of sigmoid function and
the multi-class classifier f output by AdaBoost. LA (Θ, f ) is also naturally a func-
tion of the validation data set (which is not necessarily the same as the training
set), but we will omit this dependency for easing the notation.
Given a TCF LA and a multi-class classifier f , our goal is to find the optimal
calibration parameters
ΘA,f = arg min LA (Θ, f ).
Θ

The output of this calibration step is a probability distribution pA,f yik |xk
(·) on the
i
relevance grades for each document in each query and a Bayes-scoring function
v A,f (·) defined in (2). We will refer to this scheme as class probability-based
calibration (CPC). The upper index A refers to the type of the particular TCF,
so the ensemble of probability distributions is indexed by the type A and the
multi-class classifier f output by AdaBoost.
Given the ensemble of the probability distributions pA,f yik |xk
(·) and an appro-
i
priately chosen prior π(A, f ), we follow a Bayesian approach and calculate a
posterior conditional distribution by

pposterior
k
y |x k (·) = π(A, f )pA,f
y k |xk
(·).
i i i i
A,f
270 R. Busa-Fekete et al.

Then we obtain a “posterior” estimate for the Bayes-scoring function


K 
K 
v posterior
(xki ) = γ pposterior
yik |xk
(γ ) = γ π(A, f )pA,f
y k |xk
(xki ).
i i i
=1 =1 A,f

which can be written as

 
K 
v posterior (xki ) = π(A, f ) γ pA,f
y k |xk
(xki ) = π(A, f )v A,f (xki ). (5)
i i
A,f =1 A,f

The proper selection of the prior π(·, ·) can further increase the quality of the
posterior estimation. In Section 5 we will describe a reasonable prior definition
borrowed from the theory of experts.
In the simplest case, the TCF can be


M 
mk
sΘ (fki (xki ))
LLS (Θ, f ) = − log K .
 =1 sΘ (f (xi ))
 
k
k=1 i=1

We refer to this function as the log-sigmoid TCF. The motivation of the log-
sigmoid TCF is that the resulting probability distribution minimizes the relative
entropy


M 
mk 
D(p∗ ||p) = − p∗yk |xk (γ ) log pyik |xki (γ ) + p∗yk |xk (γ ) log p∗yk |xk (γ ),
i i i i i i
k=1 i=1 

between the Bayes optimal probability distribution p∗ and p. In practice distri-


butions being less (or more) uniform over the labels might be preferred. This
preference can be expressed by introducing the entropy weighted version of the
log-sigmoid TCF, that can be written as


M 
mk
sΘ (fki (xki ))
LEWLS
C (Θ) = − log K ×
 =1 sΘ (f (xi ))

k
k=1 i=1
       C
sΘ f1 xki sΘ fK xki
HM K , . . . , K ,
 =i sΘ (f (xi ))  =i sΘ (f (xi ))
 
k  
k

K
where HM (p1 , . . . , pK ) = =1 [p (− log p )], and C is a hyperparameter. The
minimization in LS and EWLS TCF can be considered as an attempt to minimize
a cost function, the sum of the negative logarithms of the class probabilities.
Usually, there is a cost function for misclassification associated to the learning
task. This cost function can be used for defining the expected loss TCF
 
M mk  K
L , ki sΘ (f (xki ))
L EL
(Θ) = K ,
 =1 sΘ (f (xi ))
 
k
k=1 i=1 =1
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 271

where ki is the correct label of xki , and L(, ki ) expresses the loss if  is predicted
instead of the correct label. We used the standard square loss (L2 ) setup, so

L(,  ) = ( −  )2 .
If the labels have some structure, e.g., they are ordinal as in our case, it is
possible to calculate an expected label based on a CPC distribution. In this case
we can define the expected label loss TCF
K 
M mk  sΘ (f (xki ))
L ELL
(Θ) = L K , i .
k

 =1 sΘ (f (xi ))
 
k
k=1 i=1 =1

Here, the goal is to minimize the incurred loss between the expected label and the
correct label ki . We used here also L2 as loss function. Note that the definition of
L(·, ·) might need to be updated if a weighted average of labels is to be calculated,
as the weighted average might not be a label at all.
Finally, we can apply the idea of SmoothGrad [6] to obtain a TCF. In
SmoothGrad a smooth surrogate function is used to optimize the NDCG
metric. In particular, the soft indicator variable can be written as
 2

(vsΘ (xki )−vsΘ (xki ))
exp − σ
hΘ,σ (xki , xki ) =  .
mk (vsΘ (xkj )−vsΘ (xki ))2
j=1 exp − σ

Then the TCF can be written in the form:



M 
mk 
mk
LSN
σ (Θ) = − yik ct hΘ,σ (xki , xkjt )
k=1 i=1 t=1

where J = [j1 , . . . , jm ] is the permutation based on the scoring function v sΘ (·).


The parameter σ controls the smoothness of LSN σ . That is, the higher σ, the
smoother is the function but the bigger is the difference between the NDCG
value and the value of surrogate function. If σ → 0 then LSN σ tends to the NDCG
value, but, at the same time, to optimize the surrogate function becomes harder.
We refer this TCF as the SN target calibration function.
Note that in our experience, the more diverse the set of TCFs, the better is
the performance of the ultimate scoring function, and we could use other, even
fundamentally different TCFs in our system.

5 Ensemble of Ensembles

The output of calibration is a set of relevance predictions v A,f (x, S) for each
TCF type A and AdaBoost output f . Each relevance prediction can be used as a
scoring function to rank the query-document pairs represented by the vector xki .
Until this point it is a pure pointwise approach except that the smoothed version
of NDCG was optimized in SN. To fine-tune the algorithm and to make use of
272 R. Busa-Fekete et al.

the diversity of our models, we combine them using an exponentially weighted


forecaster [3]. The reason of using this particular weighting scheme is twofold.
First, it is simple and computationally efficient to tune which is important when
we have a large number of models. Second, theoretical guarantees over the cu-
mulative regret of a mixture of experts on individual (model-less) sequences [3]
makes the technique robust against overfitting the validation set.
The weights of the models are tuned on the NDCG score of the ranking,
giving a slight listwise touch to our approach. Formally, the final scoring function
is obtained by using π(X) = exp(cω A,f ). Plugging it into (5),

v posterior (x) = exp(cω A,f )v A,f (x), (6)
A,f

where ω A,f is the NDCG10 score of the ranking obtained by using v A,f (x). The
parameter c controls the dependence of the weights on the NDCG10 values.
A similar ensemble method can be applied to outputs calibrated by regres-
sion methods. A major difference between the two types of calibration is that
the regression-based scores have to be normalized/rescaled before the exponen-
tially weighted ensemble scheme is applied. We simply rescaled the output of
the regression models into [0, 1] before using them in the exponential ensemble
scheme.

6 DCG Bound for Class Probability Estimation

In (2) we described a way to obtain a scoring function v based on the estimate of


the probability distribution of relevance grades. Based on the estimated scoring
function v, it is straightforward to obtain a ranking and the associated permu-
tation on the set of documents D.1 More formally, let J v = [j1v , . . . , jm
v
] be such
that if jk > jk then v(xjk ) > v(xjk ).
v v

The following proposition gives an upper bound for the difference between
DCG value of the Bayes optimal score function and the DCG value achieved
by its estimate in terms of the quality of the relevance probability estimate.
Proposition: Let p, q ∈ [1, ∞] and 1/p + 1/q = 1. Then

DCG(J ∗ , [yji∗ ]) − DCG(J, [yjiv ]) ≤

⎛ ⎞1 ⎛ ⎞1
m 
 p p  m  q q
   
≤⎝ (cj v − cj ∗ )γ  ⎠ ⎝ pyi |xi (γ) − p∗yi |xi (γ) ⎠ ,
i i
i=1 γ∈Y i=1 γ∈Y

where  jiv and 


ji∗ are the inverse permutations of jiv and ji∗ . The relation between
pyi |xi (·) and v(·) is defined in (2).
1
In this section we omit the indexing over the queries.
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 273

Proof: Following the lines of Theorem 2 of the [7],



m 
m 
m

DCG(J, [yji ]) = ci v (x ) = jiv ci v(x ) +
jiv ci (v ∗ (xjiv ) − v(xjiv ))
i=1 i=1 i=1


m 
m
≥ ci v(xji∗ ) + ci (v ∗ (xjiv ) − v(xjiv ))
i=1 i=1

m m 
m
= ci v ∗ (xji∗ ) + ci (v ∗ (xjiv ) − v(xjiv )) + ci (v(xji∗ ) − v ∗ (xji∗ ))
i=1 i=1 i=1

m 
m
= DCG(J ∗ , [yji∗ ]) + ci (v ∗ (xjiv ) − v(xjiv )) + ci (v(xji∗ ) − v ∗ (xji∗ )).
i=1 i=1
m m
Here i=1 ci v(xjiv ) ≥ i=1 ci v(xji∗ ), because J is an optimal permutation v

for scoring function v. Then,



m
DCG(J ∗ , [yji∗ ]) − DCG(J v , [yjiv ]) ≤ (cj v − cj ∗ )(v(xi ) − v ∗ (xi ))
i i
i=1


m 
= (cj v − cj ∗ )γ(pyi |xi (γ) − p∗yi |xi (γ)),
i i
i=1 γ∈Y

where 
jiv and 
ji∗ are the inverse permutations of jiv and ji∗ . Then, the Hölder
inequality implies, that
m  
 
 
(cjiv − cji∗ )γ(pyi |xi (γ) − p∗yi |xi (γ))
i=1 γ∈Y

⎛ ⎞1 ⎛ ⎞1
m  
 p p  m   q q
   
≤⎝ (cj v − cj ∗ )γ  ⎠ ⎝ pyi |xi (γ) − p∗yi |xi (γ) ⎠ . 
i i
i=1 γ∈Y i=1 γ∈Y

Corollary:
⎛ ⎞1
m 
 q q
 
DCG(J ∗ , [yji∗ ]) − DCG(J v , [yjiv ]) ≤ C · ⎝ pyi |xi (γ) − p∗yi |xi (γ) ⎠ ,
i=1 γ∈Y

where ⎛ ⎞1
m  
 p p
 
C = max ⎝ (cj v − cj ∗ )γ  ⎠ ,
j v ,
j∗ i i
i=1 γ∈Y

j v and 
 j ∗ are the permutations of 1, . . . , m.
The Corollary shows that as the distance between the “exact” and the esti-
mated conditional distributions over the relevance labels tends to 0, the difference
in the DCG values also tends to 0.
274 R. Busa-Fekete et al.

Table 1. The statistics of the datasets we used in our experiments

Number of Number of Number of Docs. per


documents queries features query
LETOR 3.0/Ohsumed 16140 106 45 152
LETOR 4.0/MQ2007 69623 1692 46 41
LETOR 4.0/MQ2008 15211 784 46 19

7 Experiments
In our experiments we used the Ohsumed dataset taken from LETOR 3.0 and
both datasets of LETOR 4.02 . We are only interested in datasets that contain
more than 2 levels of relevance. On the one hand, this has a technical reason:
calibration for binary relevance labels does not make too much sense. On the
other hand, we believe that in this case the difference between various learning
algorithms is more significant. All LETOR datasets we used contain 3 levels of
relevance. We summarize their main statistics in Table 1.
For each LETOR dataset there is a 5-fold train/valid/test split given. We used
this split except that we divided the official train set by a random 80% − 20%
split into training and calibration sets which were used to adjust the parameters
of the different calibration methods. We did not apply any feature engineering
or preprocessing to the official feature set. The NDCG values we report in this
section have been calculated using the provided evaluation tools.
We compared our algorithm to five state-of-the-art ranking methods whose
outputs are available at the LETOR website for each dataset we used:

1. AdaRank-MAP [20]: a listwise boosting approach aiming to optimize MAP.


2. AdaRank-NDCG [20]: a listwise boosting approach with the NDCG as
objective function, minimized by the AdaBoost mechanism.
3. ListNet [2]: a probabilistic listwise method which employs cross entropy
loss as the listwise loss function in gradient descent.
4. RankBoost [8]: a pairwise approach which casts the ranking problem into
a binary classification task. In each boosting iteration the weak classifier is
chosen based on NDCG instead of error rate.
5. RankSVM [10] a pairwise method based on SVM, also based on binary
classification as RankBoost.

To train AdaBoost.MH, we used our open source implementation available


at multiboost.org. We did not validate the hyperparameter of weak learners.
Instead, we calibrated and used all the trained models with various number of
tree leafs and product terms. The number of tree leaves ranged from 5 to 45
uniformly with a step size of 5, and the number of product terms from 2 to
10. The training was performed on a grid3 which allowed us to fully parallelize
the training process, and thus it took less than one day to obtain all strong
classifiers.
2
http://research.microsoft.com/en-us/um/beijing/projects/letor/
3
http://www.egi.eu
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 275

We only tuned the number of iterations T and the base parameter c in the ex-
ponential weighting scheme (6) on the validation set. In the exponential weighting
combination (6) we set the weights using the NDCG10 performances of the cali-
brated models and c and T were selected based on the performance of v posterior (·)
in terms of NDCG10 . The hyperparameter optimization was performed using a
simple grid search where c ranged from 0 (corresponding to uniform weighting)
to 200 and for T from 10 to 10000. Interestingly, the best number of iterations is
very low compared to the ones reported by [11] for classification tasks. For LETOR
3.0 the best number of iterations is T = 100 and for both LETOR 4.0 datasets
T = 50. The best base parameter is c = 100 for all databases. This value is rela-
tively high considering that it is used in the exponent, but the performances of the
best models were relatively close to each other. We used fixed parameters C = 2
in the TCN function LEWLS
C , and σ = 0.01 in LSN
σ .

7.1 Comparison to Standard Learning to Rank Methods


Figure 2 shows the NDCGk values for different truncation levels k. Our approach
consistently outperforms the baseline methods for almost every truncation level
on the LETOR 3.0 and MQ2008 datasets. In the case of LETOR 3.0, our method
is noticeably better for high truncation levels whereas it outperforms only slightly

0.56 AdaRank−MAP
AdaRank−NCDG
0.54 ListNet
RankBoost
0.52 RankSVM
Exp. weighted ensemble 0.4621
0.5
NCDGk

0.4496
0.48
0.4429
0.46
0.441
0.44
0.4302

0.42 0.414

0.4
1 2 3 4 5 6 7 8 9 10 11
Position

(a) LETOR 3.0/Ohsumed


0.46
AdaRank−MAP 0.2328
0.5
0.45 AdaRank−NCDG 0.4464
ListNet
RankBoost 0.45 0.2307
0.44
RankSVM 0.444
Exp. weighted ensemble
0.43 0.4
0.4439 0.2303
NDCGk

NDCGk

0.42
0.35
0.4403 0.2288
0.41
0.3 AdaRank−MAP
0.4369 AdaRank−NCDG
0.4
ListNet 0.2279
0.25 RankBoost
0.39 0.4335 RankSVM
Exp. weighted ensemble 0.2255
0.38 0.2
2 4 6 8 10 2 4 6 8 10
Position Position

(b) LETOR 4.0/MQ2007 (c) LETOR 4.0/MQ2008

Fig. 2. NDCGk values on the LETOR datasets. We blow up the NDCG10 values to
see the differences.
276 R. Busa-Fekete et al.

Table 2. The NDCG values for various ranking algorithms. In the last three lines the
results of our method are shown using only CPC, only RBC and both.

Method Letor 3.0 Letor 4.0 Letor 4.0


Database Ohsumed MQ2007 MQ2008
Eval. metric NDCG10 Avg. NDCG Avg. NDCG
AdaRank-MAP 0.4429 0.4891 0.4915
AdaRank-NDCG 0.4496 0.4914 0.4950
ListNet 0.4410 0.4988 0.4914
RankBoost 0.4302 0.5003 0.4850
RankSVM 0.4140 0.4966 0.4832
Exp. w. ensemble with CPC 0.4621 0.4975 0.4998
Exp. w. ensemble with RBC 0.4493 0.4976 0.5004
Exp. w. ensemble with CPC+RBC 0.4561 0.4974 0.5006
AdaBoost.MH+Decision Tree 0.4164 0.4868 0.4843
AdaBoost.MH+Decision Product 0.4162 0.4785 0.4768

but consistently the baseline methods on MQ2008. The picture is not so clear for
MQ2007 because our approach shows some improvement only for low truncation
levels.4
Table 2 shows the average NDCG values for LETOR 4.0 along with NCDG10
for LETOR 3.0 (the tool for LETOR 3.0 does not output the average NDCG).
We also calculate the performance of our approach where we put only RBC,
only CPC, and both calibration ensembles into the pool of score functions used
in the aggregation step (three rows in middle of the Table 2). Our approach
consistently achieves the best performance among all methods.
We also evaluate the original AdaBoost.MH with decision tree and deci-
sion product, i.e. without our calibration and ensemble setup (last two rows in
Table 2). The posteriors were calculated according to (3) and here we validated
the i) iteration number and ii) the hyperparameters of base learners (the number
of leaves and number of terms) in order to select a single best setting. Thus, these
runs correspond to a standard classification approach using AdaBoost.MH.
These results shows that our ensemble learning scheme where we calibrated the
individual classifiers improves the standard AdaBoost.MH ranking setup sig-
nificantly.

7.2 The Diversity of CPC Outputs


To investigate how diverse the score values of different class probability cali-
brated models are, we compared the scores obtained by the five CPC methods
described in Section 4.2 using t-test. We obtained 5 p-values for each CPC pair.
Then we applied Fischer’s method to get one overall p-value assuming that these
5 p-values are coming form independent statistical tests. Here, we used the out-
put of boosted trees only with the number of tree leaves set to 30.
4
The official evaluation tool of LETOR datasets returns with zero NDCGk value for
a given query if k is bigger than the number of relevant documents. This results in
poor performance of the ranking methods for k ≥ 9 in the case of MQ2008. In our
ensemble scheme, we used the NDCG10 values calculated according to (1) with a
truncation level of 10.
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 277

Fig. 3. The p-values for different calibrations obtained by Fischer’s method on foldwise
p-values of t-test, Letor 4.0/MQ2007. The calibration is calculated according to (3).

The results in Figure 3 indicate that for a subset of TCFs, the estimated
probability distributions were quite close to each other. Although the TCFs are
rather different, it seems that they approximate a similar distribution with just
small differences. We believe that one reason for the experienced efficiency of
the proposed method is that these small differences within the cluster are due
to the estimation noise, so by mixing them, the level of the noise decreases.

8 Conclusions
In this paper we presented a simple learning-to-rank approach based on multi-
class classification, model calibration, and the aggregation of scoring functions.
We showed that this approach is competitive with more complex methods such as
RankSVM or ListNet on three benchmark datasets. We suggested the use of a
sigmoid-based class probability calibration which is theoretically better grounded
than regression based calibration, and thus we expected it to yield better results.
Interestingly, this expectation was confirmed only for the Ohsumed dataset which
is the most balanced set in terms of containing a relatively high number of
highly relevant documents. This suggests that CPC has an advantage over RBC
when all relevance levels are well represented in the data. Nevertheless, the CPC
method was strongly competitive on the other two datasets as well, and it also
has the advantage of coming with an upper bound on the NDCG measure.
Finally, we found AdaBoost to overfit the NDCG score for low number of
iterations during the validation process. This fact indicates the presence of la-
bel noise in the learning-to-rank datasets, according to experiments conducted
by [13] using artificial data. We note here that noise might come either real
noise in the labeling, or from the deficiency of the overly simplistic feature rep-
resentation which is unable to capture nontrivial semantics between a query
and document. As a future work, we plan to investigate the robustness of our
method to label noise using synthetic data since this is an important issue in a
learning-to-rank application: while noise due to labeling might be reduced simply
by improving the consistency of the data, it is less trivial to obtain significantly
278 R. Busa-Fekete et al.

more complex feature representations. That said, the development of learning-to-


rank approaches that are proven to be robust to label noise is of great practical
importance.

Acknowledgments. This work was supported by the ANR-2010-COSI-002


grant of the French National Research Agency.

References
1. Busa-Fekete, R., Kégl, B., Éltető, T., Szarvas, G.: Ranking by calibrated AdaBoost.
In: JMLR W&CP, vol. 14, pp. 37–48 (2011)
2. Cao, Z., Qin, T., Liu, T., Tsai, M., Li, H.: Learning to rank: from pairwise ap-
proach to listwise approach. In: Proceedings of the 24rd International Conference
on Machine Learning, pp. 129–136 (2007)
3. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge Uni-
versity Press, New York (2006)
4. Chapelle, O., Chang, Y.: Yahoo! Learning to Rank Challenge Overview. In: Yahoo
Learning to Rank Challenge (JMLR W&CP), Haifa, Israel, vol. 14, pp. 1–24 (2010)
5. Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for
graded relevance. In: Proceeding of the 18th ACM Conference on Information and
Knowledge Management, pp. 621–630. ACM, New York (2009)
6. Chapelle, O., Wu, M.: Gradient descent optimization of smoothed information
retrieval metrics. Information Retrievel 13(3), 216–235 (2010)
7. Cossock, D., Zhang, T.: Statistical analysis of Bayes optimal subset ranking. IEEE
Transactions on Information Theory 54(11), 5140–5154 (2008)
8. Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for
combining preferences. Journal of Machine Learning Research 4, 933–969 (2003)
9. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences 55,
119–139 (1997)
10. Herbrich, R., Graepel, T., Obermayer, K.: Large margin rank boundaries for or-
dinal regression. In: Smola, B., Schoelkopf, S. (eds.) Advances in Large Margin
Classifiers, pp. 115–132. MIT Press, Cambridge (2000)
11. Kégl, B., Busa-Fekete, R.: Boosting products of base classifiers. In: International
Conference on Machine Learning, Montreal, Canada, vol. 26, pp. 497–504 (2009)
12. Li, P., Burges, C., Wu, Q.: McRank: Learning to rank using multiple classification
and gradient boosting. In: Advances in Neural Information Processing Systems,
vol. 19, pp. 897–904. The MIT Press, Cambridge (2007)
13. Mease, D., Wyner, A.: Evidence contrary to the statistical view of boosting. Journal
of Machine Learning Research 9, 131–156 (2007)
14. Niculescu-Mizil, A., Caruana, R.: Obtaining calibrated probabilities from boosting.
In: Proceedings of the 21st International Conference on Uncertainty in Artificial
Intelligence, pp. 413–420 (2005)
15. Rissanen, J.: A universal prior for integers and estimation by minimum description
length. Annals of Statistics 11, 416–431 (1983)
16. Robertson, S., Zaragoza, H.: The probabilistic relevance framework: BM25 and
beyond. Found. Trends Inf. Retr. 3, 333–389 (2009)
A Robust Ranking Methodology Based on Diverse Calibration of AdaBoost 279

17. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated
predictions. Machine Learning 37(3), 297–336 (1999)
18. Valizadegan, H., Jin, R., Zhang, R., Mao, J.: Learning to rank by optimizing NDCG
measure. In: Advances in Neural Information Processing Systems, vol. 22, pp. 1883–
1891 (2009)
19. Wu, Q., Burges, C.J.C., Svore, K.M., Gao, J.: Adapting boosting for information
retrieval measures. Inf. Retr. 13(3), 254–270 (2010)
20. Xu, J., Li, H.: AdaRank: a boosting algorithm for information retrieval. In: SIGIR
2007: Proceedings of the 30th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 391–398. ACM, New York
(2007)
Active Learning of Model Parameters for Influence
Maximization

Tianyu Cao1 , Xindong Wu1 , Tony Xiaohua Hu2 , and Song Wang1
1
Department of Computer Science, University of Vermont, USA
2
College of Information Science and Technology, Drexel University, USA

Abstract. Previous research efforts on the influence maximization problem as-


sume that the network model parameters are known beforehand. However, this is
rarely true in real world networks. This paper deals with the situation when the
network information diffusion parameters are unknown. To this end, we firstly
examine the parameter sensitivity of a popular diffusion model in influence maxi-
mization, i.e., the linear threshold model, to motivate the necessity of learning the
unknown model parameters. Experiments show that the influence maximization
problem is sensitive to the model parameters under the linear threshold model. In
the sequel, we formally define the problem of finding the model parameters for
influence maximization as an active learning problem under the linear threshold
model. We then propose a weighted sampling algorithm to solve this active learn-
ing problem. Extensive experimental evaluations on five popular network datasets
demonstrate that the proposed weighted sampling algorithm outperforms pure
random sampling in terms of both model accuracy and the proposed objective
function.

Keywords: Influence maximization, Social network analysis, Active Learning.

1 Introduction
Social networks have become a hot research topic recently. Popular social networks
such as Facebook and Twitter are widely used. An important application based on so-
cial networks is the so-called “viral marketing”, the core part of which is the influence
maximization problem [5, 11, 9].
A social network is modeled as a graph G = (V, E), where V is the set of users
(nodes) in the network, and E is the set of edges between nodes, representing the con-
nectivity and relationship of users in that network. Under this model, the influence max-
imization problem in a social network is defined as extracting a set of k nodes to target
for initial activation such that these k nodes yield the largest expected spread of in-
fluence, or interchangeably, the largest diffusion size (i.e., the largest number of nodes
activated), where k is a pre-specified positive integer. Two information diffusion mod-
els , i.e., the independent cascade model (IC model) and the linear threshold model (LT
model) are usually used as the underlying information diffusion models. The influence
maximization problem has been investigated extensively recently [3, 2, 4, 1, 14].
To the best of our knowledge, all previous algorithms on influence maximization
assume that the model parameters (i.e., diffusion probabilities in the IC model and

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 280–295, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Active Learning of Model Parameters for Influence Maximization 281

thresholds in the LT model) are given. However, this is rarely true in real world so-
cial networks. In this paper we relax this constraint for the LT model, assuming that
the model parameters are unknown beforehand. Instead, we propose a framework of
active learning to obtain those parameters. In this work, we focus on learning the model
parameters under the LT model since it is relatively simple, and we will investigate the
same problem under the IC model in the future.
Learning the information diffusion models has been studied in [7, 13, 12]. However,
there is a problem with the methods from [7, 13, 12], as these methods assume that a
certain amount of information diffusion data (propagation logs in [7]) on the network
is available. This data is usually held by the social network site and not immediately
available to outsiders. In some cases, it is not available at all due to privacy considera-
tions. Considering a scenario in which we wish to use “viral marketing” techniques to
market some products on Facebook, most likely we cannot get any data of information
diffusion on Facebook due to privacy reasons.
Therefore, we need to actively construct the data of information diffusion in or-
der to learn the information diffusion model parameters. This naturally falls into the
framework of active learning. There are two advantages using the active learning ap-
proach. Firstly we are no longer restricted by the social network sites’ privacy terms.
Secondly we have the additional advantage that we can explicitly control what social
influence to measure. For example, we can restrict the influence scope to either “mu-
sic” or “computer devices”. Based on the scope, we can learn the diffusion model with
a finer granularity. Therefore we can do influence maximization on the “music” prod-
ucts and the “computer device” products separately. Intuitively, the influential nodes of
“music” products and the “computer devices” should be different.
A simple way to construct the information diffusion data is to send free products
to some users and see how their social neighbors react. All social neighbors’ reaction
constitutes the information diffusion data. This is our basic idea to acquire the diffu-
sion data. Now we rephrase this process in the context of information diffusion. We set
some nodes in a network to be active. Then we observe which nodes become active at
the following time steps. The observed activation sequences can be used as the infor-
mation diffusion data. We can then make inference of the model parameters based on
this observed activation sequences.
In this context, we would naturally want to achieve the following two goals: (1) we
would like to send as few free products as possible to learn the information diffusion
model as accurately as possible; (2) we would like to make sure the learned diffusion
model is useful for influence maximization. Ultimately, we would like to make sure
that the set of influential nodes found by using a greedy algorithm on a learned model is
more influential than that found by the greedy algorithm on a randomly guessed model.
Motivated by these two objectives, in this paper we firstly empirically show that the
influence maximization problem is sensitive to model parameters under the LT model.
We define the problem of active learning of the LT model for influence maximization.
In the sequel, we propose a weighted sampling algorithm to solve this active learn-
ing problem. Extensive experiments are conducted to evaluate our proposed algorithm
on five networks. Results show that the proposed algorithm outperforms pure random
sampling under the linear threshold model.
282 T. Cao et al.

The rest of the paper is organized as follows. Section 2 reviews the preliminaries,
including the LT information diffusion model and the study of parameter sensitivity un-
der the LT model. We also define the problem of finding model parameters as an active
learning problem in this section. Section 3 details our weighted sampling algorithm to
learn the model parameters. Experimental results are shown in section 4. Finally, we
review related work in section 5 and conclude in section 6.

2 Preliminaries and Motivation

In this section, we first introduce the LT information diffusion model. We then detail
our study on the sensitivity of model parameters for the LT model, which motivates
the necessity to learn the model parameters when they are unknown. We then formally
define the problem of finding the model parameters for influence maximization as an
active learning problem.

2.1 The Linear Threshold Model

In [9], because the threshold of each node is unknown, the influence maximization
problem is defined as finding the set of nodes that can activate the largest expected
number of nodes under all possible thresholds distributions. However, in our research,
we assume that each nodes ni in the network has a fixed threshold θi , which we intend
to learn.
The linear threshold model [8, 9] assumes that one node u will be activated if the
fraction of its activated neighbors are larger than a certain threshold θu . In a more
general case, each neighbor v may have a different weight w(u, v) to node u’s decision.
In this case, a node u becomes active if the sum of the weights of its activated neighbors
is greater than θu . The requirement of a node u to become active can be described by
the following equation:

Σv w(u, v) ≥ θu ;

where v is an activated neighbor of u. In our research, we focus on finding the set of


θu s under the LT model.
For the convenience of presentation, we define the influence spread of a given set of
nodes |S| as the number of the nodes that are activated by the set S when the diffusion
process terminates. In this paper we focus on the influence maximization problem on
the LT model with fixed static unknown thresholds.

2.2 Sensitivity of Model Parameters

In this section we will empirically check whether the influence maximization problem
is sensitive to the model parameters under the LT model.
To check the sensitivity of model parameters, we assume that there is a true model
with the true parameters. We also have an estimated model. We use a greedy algorithm
on the estimated model and get a set of influential nodes. Denote this set as Sestimate .
We perform the greedy algorithm on the true model, get another set of influential nodes,
Active Learning of Model Parameters for Influence Maximization 283

2000

1800

1600

1400

1200

influence spread
1000

800

600

400

200
true model
guessed model
0
0 5 10 15 20 25 30 35 40 45 50
number of initial seeds

Fig. 1. Model Parameter Sensitivity Test on the GEOM network

and denote this set as Strue . We check the influence spread of Sestimate and Strue on
the true model. The sensitivity of models is then defined as follows: if the difference
between Strue and Sestimate is smaller than a given small number, we can infer that the
influence maximization problem is not very sensitive to model parameters; otherwise it
is sensitive to model parameters.
To test the sensitivity of the LT model, we assume that the thresholds of the true
model are drawn from a truncated normal distribution with mean 0.5 and standard de-
viation of 0.25. Suppose all the thresholds in the estimated model are 0.5. Figure 1
shows that the influence spread of Sestimate is significantly lower than that of the set
Strue . We have conducted similar experiments on other collaboration networks and ci-
tation networks, and similar pattern can be found in those networks. These observations
motivate us to find the parameters in the LT model for influence maximization.

2.3 Active Model Parameter Learning for Influence Maximization

Since the influence spread under the LT model is quite sensitive to model parameters,
we now present a formal definition of active model parameter learning for the LT model
for influence maximization. Notations used in our problem definition are presented in
Table 1.
We assume that there is a true fixed threshold θi of each node ni in the social network
G(V, E). Our goal is to learn θ as accurately as possible. In order to make the problem
definition easier, we will actively construct the information diffusion data D over mul-
tiple iterations. In each iteration, we can use at most κ nodes to activate other nodes
in the network. After we acquire the activation sequences in each iteration, the model
parameters can be inferred. More specifically, we can infer the lower bound θlowbd and
the upper bound θupbd of the thresholds of some nodes according to the activation se-
quences D. The details of the inference will be introduced in Section 3. With more and
more iterations, we can get the thresholds θ more tightly bounded or even hit the ac-
curate threshold value. The activation sequences of different iterations are assumed to
be independent. That means at the beginning of each iteration, none of the nodes are
activated (influenced). The above process can be summarized into the following three
functions.
284 T. Cao et al.

Table 1. Notations

Symbol Meaning
G(V, E) the social network
κ the budget for learning in each iteration
θ the true thresholds of all nodes
θ̂ the estimated thresholds of all nodes
D the activation sequences
M (θ) the true Model
M (θ̂) the estimated Model
Strue the set of influential nodes found
by using the true model M (θ)
Sestimate the set of influential nodes
found by using the estimated model M (θ̂)
f (Strue , M (θ)) the influence spread
of the set Strue on M (θ)
f (Sestimate , M (θ)) the influence spread
of the set Sestimate on M (θ)
f (Sestimate , M (θ̂)) the influence spread
of the set Sestimate on M (θ̂)

f1 : (G, θ̂, θlowbd , θupbd ) 


→S
s.t.|S| = κ (1)

f2 : (G, M (θ), S) 
→D (2)

→ {θ̂ , θlowbd
f3 : (G, D, θ̂, θlowbd , θupbd )   
, θupbd } (3)

Function (1) is the process of finding which set of nodes to target in each iteration.
Function (2) is the process of acquiring the activation sequences D. Function (3) is the
process of threshold inference based on the activation sequences and the old threshold
estimate. In each iteration these three functions are performed in sequence.
In this setting, there are two questions to ask: (1) How to select the set S in each
iteration so that the parameters learned are the most accurate; (2) When will the learned
model parameters θ̂ be good enough so that it is useful for the purpose of influence max-
imization? More specifically, when will the influential nodes found on the estimated
model provide a significantly higher influence spread than that found on a randomly
guessed model. Our solution is guided by these two questions (or interchangeably, ob-
jectives). However it is difficult to combine these two questions into one objective func-
tion. Since our final goal is to conduct influence maximization, we rephrase these two
objectives in the context of influence maximization as follows.
The first goal is that the influence spread of a set of nodes on the estimated model is
close to the influence spread of the same set of nodes on the true model. This goal is in
essence a prediction error. If they are close, it implies that the two models are close. The
second goal is that the influential nodes found by using the estimated model will give
Active Learning of Model Parameters for Influence Maximization 285

an influence spread very close to that found by using the true model. The second goal
measures the quality of the estimated model in the context of influence maximization.
We combine these two goals in the following equation.

M inimize|f (Strue, M (θ)) − f (Sestimate , M (θ))| +


|f (Sestimate , M (θ)) − f (Sestimate , M (θ̂))| (4)
s.t.iterations = t.

|f (Strue , M (θ)) − f (Sestimate , M (θ))| measures whether the set of influential nodes
determined by using the estimated model can give an influence spread close to the
influence spread of the set of influential nodes determined by using the true model.
|f (Sestimate , M (θ))−f (Sestimate , M (θ̂))| measures the difference of influence spreads
between the true model and the estimated model. It is an approximation to measure the
“model distance”, which we will define in section 4.

3 The Weighted Sampling Algorithm

In this section we will firstly show the difficulty of the active learning problem and then
present our algorithmic solution: the Weighted Sampling algorithm.
The difficulty of the above active learning problem is two-fold. The first difficulty is
that even learning the exact threshold of a single node is quite expensive if the edges of
a network are weighted.
Assume for each edge E in a social network G(V, E), there is an associated weight
w. For the simplicity of analysis, we assume ∀w, w ∈ Z + . An edge e = {u, v} is
active if either u or v is active. What we can observe from the diffusion process is then
a sequence of node activations (ni , ti ). In this setting, suppose that at time t the sum of
weights of the active edges of an inactive node ni is ct . At some future time tk , the node
ni becomes active and the sum of the weight of the active edge at time is ctk . We use wi
to denote the sum of weights of all edges that connect to node ni . We can infer that the
threshold of node ni ∈ [ct /wi ctk /wi ]. More specially if ctk = ct + 1, the threshold is
exactly ctk /wi , and if this is the case, a binary search method can be used to determine
the threshold of a node ni deterministically, which is detailed as follows.
Assume that the set of edges connected to a node ni is Ei . There is a weight w
associated with each edge in Ei . S is the set of weights associated with Ei . Because
w ∈ Z + , this means that S is a set of integers. There is a response
  F :T 
function →
{0, 1} based on a threshold θ, where T ⊆ S. Here θ = θ ∗ (S), and (S) means
the sum of elements of set S.
 
1 (T ) ≥ θ
F (T ) = (5)
0 (T ) < θ

Given this response function, we can define θ in equation (6).



θ = min( (T ))
s.t. F (T ) = 1 (6)
286 T. Cao et al.

Therefore the actual threshold θ is defined in equation (7).


 
θ = min( (T ))/ (S)
s.t. F (T ) = 1; (7)

Now we analyze the time and space complexity


 of this deterministic binary search. As-
sume that the set of possible values of (T ) is T. To find θ , we can sort T firstly and
perform a binary search on the sorted list of T. The time complexity is O(log|T|). |T|
is O(2|S| ). So the time complexity of binary search is O(log(|T|)) = O(log(2|S| )) =
O(|S|). However, sorting the set T will take O(|T|log|T|) steps. So the overall time
complexity is O(2|S| ∗ |S|). In addition, the space requirement is O(2|S| ). In short, a
deterministic binary search algorithm to learn the threshold of just one node is expen-
sive. It will be infeasible to extend this approach to a large scale network with a large
number of nodes.
Next we will introduce the second difficulty. This difficulty comes from the perspec-
tive of active learning algorithm design. We define the following function.

Γ : (S, G, M (θ̂), M (θ)) 


→ E(Red) (8)

This function maps the target set S, the graph G, the estimated model M (θ̂) and the
true model M (θ) to the expected reduction in threshold uncertainty E(Red) if we set
S as the initial active nodes. Γ (S, G, M (θ̂), M (θ)) measures the gain if we select S as
the initial target nodes. Since we do not know the true model parameters and therefore
we cannot possibly know the activation sequence of a target set S under the true model
parameters. It is therefore impossible to know the exact value of Γ (S, G, M (θ̂), M (θ)).
Γ (S, G, M (θ̂), M (θ)) is not a monotonically non-decreasing function with respect to
set S, which means even if we know the value of Γ (S, G, M (θ̂), M (θ)), a deterministic
greedy algorithm is not a good solution. However, we still want to choose the set S that
maximizes Γ (S, G, M (θ̂), M (θ)) in each learning iteration. We use weighted sampling
to approximate this goal. In each iteration we sample a set of κ nodes according to the
following three probabilities.

pi ∝ I(i, j) (9)
j


pi ∝ I(i, j) ∗ w(i, j) (10)
j


pi ∝ I(i, j) ∗ (θ(j)upbd − θ(j)lowbd ) (11)
j

I(i, j) is the indicator function. It is equal to 1 if there is an edge between i and j and the
threshold of node j is unknown, otherwise it is 0. Essentially we are trying to sample κ
nodes that connect to the most number of nodes with the most uncertainty of thresholds.
There are different ways to measure the uncertainty of the threshold of a node. Formula
Active Learning of Model Parameters for Influence Maximization 287

(11) measures the uncertainty by how tight the bounds of the threshold are. In formula
(9) the uncertainty value is 1 if the threshold is unknown and 0 otherwise. Formula
(10) differs from formula 9 in that weights of edges w(i, j) are added. We perform
weighted sampling on the nodes without replacement. The hope is that the sampled set
S can yield a high value of Γ (S, G, M (θ̂), M (θ)). The pseudo code of our weighted
sampling algorithm is summarized in Algorithm 1.

Algorithm 1. Active Learning based on Weighted Sampling


Input: A social network G, the budget κ of each iteration, and the number of learning iterations
t.
Output: The estimated threshold θ̂
Method:
1: Let pi =the number of nodes with unknown thresholds that node i connects to.
2: normalize p to 1.
3: set θ̂ to be 0.5 for all nodes.
4: set θlowbd to be 1 for all nodes.
5: set θupbd to be the number of edges that each node connects to.
6: for i = 1 to t do
7: set all nodes in G to be inactive.
8: let S=sample κ nodes according to p without replacement.
9: set S as the initial nodes and start the diffusion process on the true model M (θ).
10: let (n, t) be the observed activation sequence.
11: update θ̂, θlowbd , and θupbd according to the activation sequence (ni , ti ).
12: update the sampling probability p and normalize p.
13: end for
14: return θ̂

Steps 5 to 13 show the learning process. We sample κ nodes in each iteration accord-
ing to the sampling probability p. We then set the κ nodes as the initial active nodes
and simulate the diffusion process on the true model M (θ). After that we can observe a
series of activation sequences (ni , ti ). We can update the threshold estimation θ̂, θlowbd
and θupbd accordingly. After that we update the sampling probability and the iteration
ends.

4 Experimental Evaluation
In this section, we present the network datasets, the experimental setup and experimen-
tal results.

4.1 Network Datasets

Table 2 lists the datasets that we use. NetHEPT is from [3]. NETSCI, GEOM, LEDERB
and ZEWAIL are from Pajek’s network collections.
288 T. Cao et al.

Table 2. Network datasets

Dataset Name n(# nodes) m(# edges)


NETHEPT the High Energy Physics Theory Collaboration network 15233 58891
GEOM the Computational Geometry Collaboration network 7343 11898
NETSCI the Network Science Collaboration network 1589 2742
ZEWAIL the Zewail Citation network 6752 54253
LEDERB the Lederberg Citation network 8843 41609

4.2 Experimental Setting


Even though [13] deals with a very similar problem to the problem in this paper, the
methods in [13] is not directly usable in our problem setting. [13] assumes the diffusion
data is available and the diffusion model is the IC model. In this paper we actively select
seeds to obtain the diffusion data and we focus on the LT model. Methods in [13] can be
used after we obtained the diffusion data. However that is not the focus here. Therefore
we do not compare our methods to methods in [13].
For the simplicity of experiments, we treat all the networks as undirected. In the
collaboration networks, the weight of an edge is set to be the number of times that two
authors collaborated. For the citation network, the weights of all edges are randomly
chosen from 1 up to 5. In the experiments we draw the thresholds of all nodes from a
truncated normal distribution with mean 0.5 and standard deviation of 0.25, where all
the thresholds are between 0 and 1.0. The number of iterations of learning is set to be
3000. The budget for each iteration is set to be 50, which means in each iteration 50
nodes are set as the initial seeds. In order to evaluate the objective function, the budget
for influence maximization is also set to be 50 nodes.
Results of two versions of the weighted sampling are reported. The first version,
denoted as mctut, uses the weighting scheme in formula (9). The second version, de-
noted as metut, uses the weighting scheme in formula (10). Experiments were also
conducted on the weighting scheme according to formula (11). The results were slightly
worse than the former two weighting schemes. The reason why the weighting scheme
according to formula (11) is worse than (9) and (10) is because formula (11) biases
towards learning the thresholds of the high degree nodes. The thresholds of the high
degree nodes are more difficult to learn. It costs more iterations to infer the thresh-
olds of the high degree nodes. However, in the distance calculation, the high degree
nodes are treated the same as the low degree nodes. Therefore given the same itera-
tions, the weighting schemes by formulas (9) and (10) would infer more thresholds
than the weighting scheme by formula (11). In addition to these two schemes, we use
un-weighted random sampling as a baseline, and denote it as random.

4.3 Experimental Results


To evaluate the effectiveness of our algorithm, we use two performance metrics: the
objective function (equation 4) and model accuracy. First we define model distance
as the average threshold difference between the true model and the estimated model.
Model accuracy is basically the inverse of model distance. So if the model distance is
low, model accuracy is high and vice versa.
Active Learning of Model Parameters for Influence Maximization 289

Figures 2(a),2(b),2(c),2(d),and 2(e) show that both metut and mctut are better than
random on all the networks in terms of model accuracy. There is little difference
between metut and mctut in all these figures. The difference between metut and
random is more noticeable in Figures 2(a),2(d) and 2(e). In Figure 2(b), the differ-
ence become noticeable when the number of learning iteration approaches 3000. In
Figure 2(c) the difference is the largest between iterations 200 and 1000. After that the
difference between metut and random decreases over iterations. The difference be-
comes quite small when the number of learning iterations approaches 3000. To sum up,
we can see that both metut and mctut perform well. They are better than the baseline
method and the model distance decreases with iterations.
Figures 3(a),3(b),3(c),3(d) and 3(e) show that metut beats random on all the net-
works in terms of objective function. mctut outperforms random in Figures 3(a),
3(b),3(c) and 3(d). In Figure 3(e), mctut is worse than random at iteration 500. But
after that mctut is better random. metut is more stable than mctut in a sense. The
absolute value of difference of the objective function is actually very large. In Figure
3(d) we can see that the largest difference between metut and random is about 500.
So far we can assume that metut is the best of the three in terms of objective function
and model distance.
Finally we devise a way to measure the quality of the estimated model. If the influ-
ence spread of the solution found on the estimated model is very close to that found
on the true model, we can say the estimated model has good quality, otherwise the es-
timated model has low quality. Figure 4(a) shows the quality of the estimated model
found by metut in learning iterations of 0 and 3000. We can see that the initial guessed
model at iteration 0 has extremely low quality. The influence spread of influential nodes
obtained by using the greedy algorithm on the estimated model at iteration 0 is notice-
ably less than that obtained by using the greedy algorithm on the true model. However
after 3000 iterations of learning, this gap is narrowed sharply. The influence spread of
influential nodes obtained by using the greedy algorithm on the estimated model at it-
eration 3000 is very close to that obtained by the using the greedy algorithm on the true
model. This shows that the proposed algorithm can indeed narrow the gap between the
influence spread of the influential nodes on the estimated model and the true model.
Figures 4(b) and 4(c) show the quality of the estimated model found by mutut and
random in learning iterations 0 and 3000.
From Figures 4(a), 4(b) and 4(c) we can also observe that the estimated models found
by metut and mutut have higher quality than the estimated model found by random
at iteration 3000. We can notice that the influence spread of the solution found by using
the estimated models of metut and mutut is larger than the influence spread of the
solution using the estimated model of random at iteration 3000. It indirectly shows
that metut and mutut learn models more accurately than random. Similar patterns
can be observed in Figures 4(d), 4(e), 4(f), 4(j), 5(a), 5(b), 5(c), 5(d) and 5(e). Although
the differences between metut, mutut and random are not as obvious as the case of
Figures 4(a), 4(b) and 4(c). Figures 4(g), 4(h) and 4(i) are exceptions. Figures 4(g),
4(h) and 4(i) show that the learned models of random, metut and mctut have almost
identical quality. This is probably because the dataset NETSCI is a small dataset.
290 T. Cao et al.

model distance against learning iterations model distance against learning iterations
0.12 0.09
metut metut
mctut mctut
random random

0.115 0.08

0.11 0.07

model distance
model distance

0.105 0.06

0.1 0.05

0.095 0.04

0.09 0.03
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
learning iterations learning iterations

(a) GEOM (b) NetHEPT


model distance against learning iterations model distance against learning iterations
0.07 0.07
metut metut
mctut mctut
random random
0.06
0.06

0.05
0.05
model distance

model distance

0.04

0.04

0.03

0.03
0.02

0.02
0.01

0 0.01
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
learning iterations learning iterations

(c) NETSCI (d) LEDERB


model distance against learning iterations
0.065
metut
mctut
0.06 random

0.055

0.05
model distance

0.045

0.04

0.035

0.03

0.025

0.02
0 500 1000 1500 2000 2500 3000
learning iterations

(e) ZEWAIL

Fig. 2. Comparison of metut,mctut and random in terms of model distance in different datasets
(the lower the better)

To this end, we have shown two points. Firstly, the learning process can indeed help
narrow the gap between an estimated model and the true model. Secondly, metut per-
forms better than random because of the fact that the solution found by using metut’s
estimated model produces influence spread larger than the influence spread of the solu-
tion found by using random’s estimated model.
Active Learning of Model Parameters for Influence Maximization 291

objective function against learning iterations objective function against learning iterations
800 900
metut metut
mctut mctut
random random
700 800

600 700

500 600
objective function

objective function
400 500

300 400

200 300

100 200

0 100
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
learning iterations learning iterations

(a) GEOM (b) NetHEPT


objective function against learning iterations objective function against learning iterations
250 3000
metut metut
mctut mctut
random random

2800
200

2600

150
objective function

objective function

2400

100

2200

50
2000

0 1800
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
learning iterations learning iterations

(c) NETSCI (d) LEDERB


objective function against learning iterations
550
metut
mctut
500 random

450

400
objective function

350

300

250

200

150

100
0 500 1000 1500 2000 2500 3000
learning iterations

(e) ZEWAIL

Fig. 3. Comparison of metut,mctut and random in terms of objective function in different datasets
(the lower the better)

5 Related Work
We review related works on influence maximization and learning information diffusion
models in this section.
292 T. Cao et al.

2000 2000

1800 1800

1600 1600

1400 1400

1200 1200
influence spread

influence spread
1000 1000

800 800

600 600

400 400

200 true model 200 true model


0 iterations 0 iterations
3000 iterations 3000 iterations
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
number of initial seeds number of initial seeds

(a) metut on GEOM (b) mutut on GEOM


2000 1400

1800
1200

1600

1000
1400

1200
influence spread

influence spread
800

1000

600
800

600
400

400

200
200 true model true model
0 iterations 0 iterations
3000 iterations 3000 iterations
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
number of initial seeds number of initial seeds

(c) random on GEOM (d) metut on NetHEPT


1400 1400

1200 1200

1000 1000
influence spread

influence spread

800 800

600 600

400 400

200 200
true model true model
0 iterations 0 iterations
3000 iterations 3000 iterations
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
number of initial seeds number of initial seeds

(e) mutut on NetHEPT (f) random on NetHEPT


450 450

400 400

350 350

300 300
influence spread

influence spread

250 250

200 200

150 150

100 100

50 true model 50 true model


0 iterations 0 iterations
3000 iterations 3000 iterations
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
number of initial seeds number of initial seeds

(g) metut on NETSCI (h) mutut on NETSCI


450 5000

400 4500

4000
350

3500
300

3000
influence spread

influence spread

250
2500
200
2000

150
1500

100
1000

50 true model 500 true model


0 iterations 0 iterations
3000 iterations 3000 iterations
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
number of initial seeds number of initial seeds

(i) random on NETSCI (j) metut on LEDERB

Fig. 4. Comparison of metut, metut and random on the improvement of influence spread over a
guessed model on different networks
Active Learning of Model Parameters for Influence Maximization 293

5000 5000

4500 4500

4000 4000

3500 3500

3000 3000
influence spread

influence spread
2500 2500

2000 2000

1500 1500

1000 1000

500 true model 500 true model


0 iterations 0 iterations
3000 iterations 3000 iterations
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
number of initial seeds number of initial seeds

(a) mutut on LEDERB (b) random on LEDERB


1400 1400

1200 1200

1000 1000
influence spread

influence spread
800 800

600 600

400 400

200 200
true model true model
0 iterations 0 iterations
3000 iterations 3000 iterations
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
number of initial seeds number of initial seeds

(c) metut on ZEWAIL (d) mutut on ZEWAIL


1400

1200

1000
influence spread

800

600

400

200
true model
0 iterations
3000 iterations
0
0 5 10 15 20 25 30 35 40 45 50
number of initial seeds

(e) random on ZEWAIL

Fig. 5. Comparison of metut, metut and random on the improvement of influence spread over a
guessed model on different networks

First we review works on influence maximization. [9] proved that the influence max-
imization problem is an NP-hard problem under both the IC model and the LT model.
[9] defined an influence function f (A), which maps a set of active nodes A to the ex-
pected number of nodes that A activates. They proved that f (A) is submodular under
both models. Based on the submodularity of the function f (A), they proposed a greedy
algorithm which iteratively selects a node that gives the maximal margin on the func-
tion f (A). The greedy algorithm gives a good approximation to the optimal solution in
terms of diffusion size. Follow-up works mostly focus on either improving the running
time of the greedy algorithm [10, 3] or providing faster heuristics that can give influence
spread that is close to the greedy algorithm [3, 2, 4, 1, 14]. [3] used a Cost-Effective
Lazy Forward method to optimize the greedy algorithm to save some time. [3] proposed
294 T. Cao et al.

degree discount heuristics that reduce computational time significantly. [4] and [2] pro-
posed heuristics based on a most likely propagation path. The heuristics in these two
papers are tunable with respect to influence spread and computational time. [1, 14] used
a community structure to partition the network into different communities, find the in-
fluential nodes in each community, and combine them together. In [9], the problem of
calculating the influence function f (A) was left as an open problem. Later [2] and [4]
proved that it is P hard to calculate the influence function f (A) under both the IC and
models. [10] performed the bound percolation process on graphs and used strongly con-
nected component decomposition to save some time on the calculation of the influence
function f (A). All these works assume that the model parameters are given.
Then we review research efforts on learning information diffusion models. [7] pro-
posed both static and time dependent models for capturing influence from propagation
logs. [13] used the EM algorithm to learn the diffusion probabilities of IC model. [12]
proposed asynchronous time delayed IC and LT models. After that they used maximum
likelihood estimation to learn the models and evaluated the models on real world blog
networks. Interestingly, [6] focused on a different problem, i.e., learning the network
topology which information diffusion relies on. [6] used maximum likelihood estima-
tion to approximate the most probable network topology.

6 Conclusions
In this paper, we have studied the influence maximization problem under unknown
model parameters, specifically, under the linear threshold model. To this end, we first
showed that the influence maximization problem is sensitive to model parameters un-
der the LT model. Then we defined the problem of finding the model parameters as
an active learning problem for influence maximization. We showed that a deterministic
algorithm is costly for model parameter learning. We then proposed a weighted sam-
pling algorithm to solve the active learning problem. We conducted experiments on five
datasets and compared the weighted sampling algorithm with a naive solution: pure
random sampling. Experimental results showed that the weighted sampling achieves
better results than the naive method in terms of both the objective function and model
accuracy we defined. Finally we showed that by using the learned model parameters
from the weighted sampling algorithm, we can find the influential nodes that give an
influence spread very close to the influence spread of influential nodes found on the true
model, which further justifies the effectiveness of our proposed approach. In the future,
we will investigate on how to learn the model parameters under the IC model.

Acknowledgments. The research is supported by the US National Science Foundation


(NSF) under grants CCF-0905337 and NSF CCF 0905291.

References
[1] Cao, T., Wu, X., Wang, S., Hu, X.: Oasnet: an optimal allocation approach to influence
maximization in modular social networks. In: SAC, pp. 1088–1094 (2010)
[2] Chen, W., Wang, C., Wang, Y.: Scalable influence maximization for prevalent viral market-
ing in large-scale social networks. In: KDD, pp. 1029–1038 (2010)
Active Learning of Model Parameters for Influence Maximization 295

[3] Chen, W., Wang, Y., Yang, S.: Efficient influence maximization in social networks. In:
KDD, pp. 199–208 (2009)
[4] Chen, W., Yuan, Y., Zhang, L.: Scalable influence maximization in social networks under
the linear threshold model. In: ICDM, pp. 88–97 (2010)
[5] Domingos, P., Richardson, M.: Mining the network value of customers. In: KDD, pp. 57–66
(2001)
[6] Gomez-Rodriguez, M., Leskovec, J., Krause, A.: Inferring networks of diffusion and influ-
ence. In: KDD, pp. 1019–1028 (2010)
[7] Goyal, A., Bonchi, F., Lakshmanan, L.V.S.: Learning influence probabilities in social net-
works. In: WSDM, pp. 241–250 (2010)
[8] Granovetter, M.: Threshold models of collective behavior. The American Journal of Soci-
ology 83(6), 1420–1443 (1978)
[9] Kempe, D., Kleinberg, J.M., Tardos, É.: Maximizing the spread of influence through a
social network. In: KDD, pp. 137–146 (2003)
[10] Kimura, M., Saito, K., Nakano, R.: Extracting influential nodes for information diffusion
on a social network. In: AAAI, pp. 1371–1376 (2007)
[11] Richardson, M., Domingos, P.: Mining knowledge-sharing sites for viral marketing. In:
KDD, pp. 61–70 (2002)
[12] Saito, K., Kimura, M., Ohara, K., Motoda, H.: Selecting information diffusion models over
social networks for behavioral analysis. In: ECML/PKDD (3), pp. 180–195 (2010)
[13] Saito, K., Nakano, R., Kimura, M.: Prediction of information diffusion probabilities for
independent cascade model. In: KES (3), pp. 67–75 (2008)
[14] Wang, Y., Cong, G., Song, G., Xie, K.: Community-based greedy algorithm for mining
top-k influential nodes in mobile social networks. In: KDD, pp. 1039–1048 (2010)
Sampling Table Configurations for the
Hierarchical Poisson-Dirichlet Process

Changyou Chen1,2 , Lan Du1,2 , and Wray Buntine1,2


1
Research School of Computer Science,
The Australian National University,
Canberra, ACT, Australia
2
National ICT, Canberra, ACT, Australia
{Changyou.Chen,Lan.Du,Wray.Buntine}@nicta.com.au

Abstract. Hierarchical modeling and reasoning are fundamental in ma-


chine intelligence, and for this the two-parameter Poisson-Dirichlet Pro-
cess (PDP) plays an important role. The most popular MCMC sampling
algorithm for the hierarchical PDP and hierarchical Dirichlet Process
is to conduct an incremental sampling based on the Chinese restaurant
metaphor, which originates from the Chinese restaurant process (CRP).
In this paper, with the same metaphor, we propose a new table repre-
sentation for the hierarchical PDPs by introducing an auxiliary latent
variable, called table indicator, to record which customer takes respon-
sibility for starting a new table. In this way, the new representation
allows full exchangeability that is an essential condition for a correct
Gibbs sampling algorithm. Based on this representation, we develop a
block Gibbs sampling algorithm, which can jointly sample the data item
and its table contribution. We test this out on the hierarchical Dirichlet
process variant of latent Dirichlet allocation (HDP-LDA) developed by
Teh, Jordan, Beal and Blei. Experiment results show that the proposed
algorithm outperforms their “posterior sampling by direct assignment”
algorithm in both out-of-sample perplexity and convergence speed. The
representation can be used with many other hierarchical PDP models.

Keywords: Hierarchical Poisson-Dirichlet Processes, Dirichlet Processes,


HDP-LDA, block Gibbs sampler.

1 Introduction

In general machine intelligence domains such as image and text modeling, hier-
archical reasoning is fundamental. Bayesian hierarchical modeling of problems
is now widely used with applications including n-gram modeling and smoothing
[1–3], dependency models for grammar [4, 5], data compression [6], clustering in
arbitrary dimensions [7], topic modeling over time [8], and relational modeling
[9]. Bayesian hierarchical n-gram models correspond well to versions of Kneser-
Ney smoothing [1], the state of the art method in applications and result in com-
petitive string compression algorithms [6]. These hierarchical Bayesian models

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 296–311, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Sampling Table Configurations for the Hierarchical PDP 297

are intriguing from the probability perspective, as well as sometimes being com-
petitive with performance based approaches. Newer methods and applications
are reviewed in [10].
The two-parameter Poisson-Dirichlet process (PDP), also referred to as the
Pitman-Yor process (named so in [11]), is an extension of the Dirichlet process
(DP). Related is a particular interpretation of a marginalized version of the
model known as the Chinese restaurant process (CRP). The CRP gives an ele-
gant analogy of incremental sampling for these models. These provide the basis
of many Bayesian hierarchical modeling techniques. One particular use of the
PDP/DP is in the area of topic models where hierarchical PDPs and hierar-
chical DPs provide elegant machinery for improving the standard simple topic
model [12, 13], for instance, with flexible selection of the number of topics using
HDP-LDA [14], and allowing document structure to be incorporated into the
modeling [15, 16].
This paper proposes a new sampler for the hierarchical PDP based on a new
table representation and is organized as follows: Section 2 gives a brief review
of the hierarchical Poisson-Dirichlet process. Section 3 then reviews methods for
sampling the hierarchical PDP. We present the new table representation for the
HPDP in Section 4. A block Gibbs sampler is developed in Section 5, where we
also apply our block Gibbs sampler to the HDP-LDA model. Experiment results
are reported in Section 6.

2 The Hierarchical Poisson-Dirichlet Process


The basic Poisson-Dirichlet Process is a device for introducing infinite mixture
models and for hierarchical modeling of discrete distributions. The basic form
has as input a base probability distribution H(·) on a measurable space X , and
yields a discrete distribution on a finite or countably infinite subset of X .


pk δXk∗ (·) (1)
k=1
∞
where p = (p1 , p2 , ...) is a probability vector so 0 ≤ pk ≤ 1 and k=1 pk = 1.
Also, δXk∗ (·) is a discrete measure concentrated at Xk∗ . We assume the values
Xk∗ ∈ X are independently and identically distributed according to H(·), which
is referred to as the base distribution. We also assume the base distribution is
discrete, so H(X) > 0 for all samples X ∼ H(·), although this is not generally
the case for the PDP. The probability vector p follows a two-parameter Poisson-
Dirichlet distribution [17]. A common definition for it is the “stick-breaking”
model, as follows:

Definition 1 (Poisson-Dirichlet distribution). For 0 ≤ a < 1 and b > −a,


suppose that a probability Pa,b governs independent random variables Vk such
that Vk has Beta(1 − a, b + k a) distribution. Let

p1 = V1 , pk = (1 − V1 ) · · · (1 − Vk−1 )Vk k≥2, (2)


298 C. Chen, L. Du, and W. Buntine

p0
Probability vector hierarchy:
This depicts, for instance, that
vectors p1 to pK should be p1 p2 pK
similar to p0 . So for the j2 -th
node branching off node j1 ,
pj2 ∼ PDP(aj1 , bj1 , pj1 ). The
root node p0 could be Dirichlet p... p... pj1
distributed if it is finite or could
have a PDD distribution if
infinite. p... pj2 p...

Fig. 1. Probability vector hierarchy

yielding p = (p1 , p2 , ...). Define the Poisson-Dirichlet distribution with parame-


ters a, b, abbreviated PDD(a, b) to be the Pa,b distribution of p.

Note this does assume a particular ordering of the entries in p. Here our a param-
eter is usually called the discount parameter in the literature, and b is called the
concentration parameter. The DP is the special case where a =  0, and has some
quite distinct properties such as slower convergence of the sum ∞ k=1 pk to one.
General results for the discrete case of the PDP are reviewed in [18].
A suitable definition of a Poisson-Dirichlet process is that it extends the
Poisson-Dirichlet distribution using Formula (1), referred to as PDP(a, b, H(·)).
Thus the PDP is a functional on distributions: it takes as input a base distribu-
tion and yields as output a discrete distribution with a finite or countable set of
possible values on the same domain.
The output distribution of a PDP can subsequently be used as a base dis-
tribution for another PDP, and so-forth, to create a hierarchy of distributions.
This situation is depicted in the graphical model of Figure 1 where the distri-
bution is over vectors p indexed by their positions in the hierarchy. Each vector
represents a discrete probability distribution so is a suitable base distribution
for the next level of PDPs. The hierarchical case for the DP is presented in [14],
and the hierarchical and discrete case of the PDP in [19, 20]. The hierarchical
PDP (HPDP) thus depends on the discrete distribution at the root node, and
on the hyper-parameters a, b used at each node. This hierarchical occurrence of
probability vectors could be a model in itself, as is the case for n-gram models
of strings, or it could occur as part of some larger model, as is the case for the
HDP-LDA model.
Intuitively, the HPDP structure can be well explained using the nested CRP
mechanism, which has been widely used as a component of different topic models
[21]. It goes as follows: a Chinese restaurant has an infinite number of tables,
each of which has infinite seating capacity. Each table serves a dish, and multiple
tables can serve the same dish. In the nested CRP, each restaurant is also linked
to its parent restaurant and child restaurants in a tree-like structure. A newly
Sampling Table Configurations for the Hierarchical PDP 299

c c c j-1,10

c c c c
j-1,8 j-1,6
j-1,2 j-1,9 j-1,5 j-1,11

c t =1 c t =3
j-1,1 j-1,3 c t =2
j-1,4 c
j-1,7 t =1 。。。
c
j-1,1 j-1,2 j-1,3 j-1,4

t t
j,1 j,4
j-1,12

t t j,2
j,3 j-1

c c j,10
j,12
c
c c c
j,6
j,2 j,9 j,5

c t =1
j,1 c t =2
j,3 c t =2
j,4 c
j,7 t =3 。。。
c
j,1 j,2 j,3 j,4

t j,8 c
t
j,11

t t
j+1,1
j+1,4
j+1,2 j+1,3 j

c j+1,10 c
c c c c
j+1,6
j+1,2 j+1,9 j+1,5 j+1,12

c t =1 c t =2
j+1,1 j+1,3 c t =2
j+1,4 c t =1
j+1,7
。。。
c
j+1,1 j+1,2 j+1,3 j+1,4
j+1,8 c
j+1,11

j+1

Fig. 2. Nested CRP representation of the HPDP, where rectangles correspond to


restaurants, cycles correspond to tables, and C means customers

arrived customer can choose to sit at an active table (i.e., a table which at least
has one customer), or choose a new table. If a new table is chosen (i.e. activated),
this table will be sent as a new customer to the corresponding parent restaurant,
which means a table in any given restaurant reappears as a proxy customer [3]
in its parent restaurant. This procedure is illustrated in Figure 2.

3 Related Methods

For the hierarchical PDP, the most popular MCMC algorithm is the Gibbs
sampling method based on the Chinese restaurant representation [10, 14]. For
instance, samplers proposed in [14], e.g., the Chinese restaurant franchise sam-
pler, the augmented Chinese restaurant franchise sampler, and the sampler for
direct assignment. In the CRP representation, each restaurant is represented by
a seating arrangement that contains the total number of customers, the total
number of occupied tables, the customer-table association, the customer-dish
association, and the table-dish association.
With the global probability measure marginalized out, the Chinese restaurant
franchise sampler keeps track of the customer-table association (i.e., recording
table assignments of all customers), which results in extra storage space require-
ment. Its extension to the HPDP is a Gibbs sampler, so called “sampling for
seating arrangements” by Teh [19]. Another sampler for HDP, termed “posterior
sampling with an augmented representation”, introduces an auxiliary variable
to construct the global measure for its children DPs so that these DPs can be
decoupled. A further extension of the augmented sampler gives the sampler for
300 C. Chen, L. Du, and W. Buntine

“direct assignment”, in which each data point is directly assigned to one com-
ponent, instead of sampling at which table it sits. This is the sampler used in
Teh’s implementation of HDP-LDA[22].
Recently, a new collapsed Gibbs sampling algorithm for the HPDP is proposed
in [15, 18]. It sums out all the seating arrangements by introducing a constrained
latent variable, called the table count that represents the number of tables serv-
ing the same dish, similar to the representation in the direct assignment sampler.
Du et al. [16] have applied it to a first order Markov chain.
Except for the sampling based algorithms, there also exist variational based
inference algorithms for the HPDP. For example, Wang et al. proposed a vari-
ational algorithm for the nested Chinese restaurant process [23], and recently
proposed an online variational inference for the HDP [24]. As a compromise,
Teh et al. developed a sampling-variational hybrid algorithm for HDP in [25].
In this paper, we propose a new table representation for the HPDP by intro-
ducing another auxiliary latent variable, called table indicator variable, to track
which level the data (or customer) has contributed a table count (i.e. the cre-
ation of a new table) in the hierarchy. The aforementioned table count variable
can be easily constructed from the table indicator variables by summation, which
indicates the exchangeability of the proposed representation. To apply the new
representation, we develop a block Gibbs sampling algorithm to jointly sample
the dish and table indicator for each customer.

4 New Table Representation of the HPDP


In this section, we introduce a new table representation for the HPDP on top
of the nested CRP configuration, as shown in Figure 2. Indeed, this new repre-
sentation can be generalized to a framework for doing inference on the HPDP
based models. We prove that the new representation allows full exchangeability
by the joint posterior distribution, thus guarantees a correct Gibbs sampler to be
developed. Although, in this work, we just study how the representation works
on tree structure based graphical models, in which data items can be attached to
any nodes in the tree, it can also be easily extended to arbitrary structures [20].
Note that, in the following elaboration, nodes and data items also correspond
to restaurants and customers in the CRP, respectively; and dishes, as compo-
nents in mixture models, correspond to, for instance, topics in probabilistic topic
modeling1 . Before further proceeding to the details of the new table representa-
tion, we first give three definitions that are used in the following sections.

Definition 2 (Node index j). In the tree structure, each node (i.e., a restau-
rant) is indexed with an integer starting from 0 in a top-down and left to right
manner. The root of the tree is indicated by j = 0. Based on this definition, we
define a mapping d : Z + → Z + which maps the node j to its level d(j) in the
tree, here Z + means non-negative integers.

1
We will use these terminologies interchangeably in this paper.
Sampling Table Configurations for the Hierarchical PDP 301

Definition 3 (Multiplicity tk ). For the Chinese restaurant version of the


PDP, assume the base distribution H(· ) is discrete, it means the same dish can
be served by multiple tables. The multiplicity tk is thus defined as the number of
tables serving the same dish k.

Definition 4 (Table indicator ul ). The table indicator ul for each data item
l (i.e., a customer) is an auxiliary latent variable which indicates up to which
level in the tree l has contributed a table count (i.e. activated a new table). If
ul = 0, it means the data item l takes the responsibility of creating a new table
in each node between the root and the current node.

Here, what do we mean “table contribution” in Definition 4? In the nested CRP,


as discussed in Section 2, if a newly arrived customer, denoted by l, chooses to
sit at a new table, the table count in the corresponding restaurant, indexed by j,
will be increased by one. Herein, opening up a new table is defined as the table
contribution of this customer. Therefore, ul is set to d(j). Then, the new table is
sent as a proxy customer to its parent restaurant, pa(j). If the proxy customer
again chooses to sit at a new table to contribute a table count, uj will be set to
d (pa(j)). This procedure will proceed towards the root in a recursive way until
there is no more table contribution. The introduction of table indicator variables
is the core merit of the new representation.
In the tree structure, for each node j, every data item l in j is associated with
a sampled value zl , which can take K distinct values, i.e. K distinct dishes in
the context of CRP; and a table indicator variable ul is attached to l to trace
its table contribution up towards the root. Thereby, taking advantage of table
indicator variables, some statistics can be represented as follows:
  
l∈D(j) δzl =k , for D(j) =∅
n0jk = , tjk = δzl =k δul ≤d(j)
0, others  
j ∈T (j) l∈D(j )
  
0
njk = njk + tj  k , Tj = tjk , Nj = njk (3)
j  ∈C(j) k k

where n0jk is the number of actual data points in j with zl = k (k ∈ {1, . . . , K}),
tjk is the number of tables serving k, njk is the total number of data points
(including those sent by the child nodes of j) with zl = k, Tj is the total number
of tables, and Nj is the total number of data points; and D(j) is a set of data
points attached to j, C(j) is a set of child nodes of j and T (j) is its closure,
the set of nodes in the sub-tree rooted at j, Obviously, the multiplicity for each
distinct value, i.e., each dish, can be constructed from table indicator variables.

Lemma 1. Given finite samples z1 , z2 , · · · , zN from a Poisson-Dirichlet Process


(i.e., P DP (a, b, H)), the joint probability of the samples and their multiplicities
[18] t1 , t2 , · · · , tK is given by:

(b|a)T   
K
Pr (z1 , z2 , · · · , zN , t1 , · · · , tK ) = H(zk∗ )tk Stnkk,a (4)
(b)N
k=1
302 C. Chen, L. Du, and W. Buntine

N
where SM,a is the generalized Stirling number2 , (x|y)N denotes the Pochhammer

symbol with increment y, and T = K i=1 ti .

Making use of Lemma 1 by [18], we can derive, based on the new table repre-
sentation of the HPDP, the joint probability of samples z 1:J and table indicator
variables u1:J as

Theorem 1. Given the base distribution H0 for the root node, the joint posterior
distribution of z 1:J and u1:J for the HPDP in a tree structure is 3 :

 (bj |aj )Tj  njk tjk !(njk − tjk )!
Pr (z 1:J , u1:J | H0 ) = Stjk ,aj . (5)
(bj )Nj njk !
j≥0 k

Proof (sketch). Let us consider just one restaurant (i.e., one node in the tree) at a
time. The set t = {t1 , · · · , tK } indicates the table configuration, i.e., the number
of tables serving each dish; and the set u = {u1 , · · · , uN } as the table indicator
configuration, i.e., table indicators attached to each data item. Clearly, only one
table configuration can be reconstructed from a table indicator configuration,

as
shown by Eq. (3). But given one table configuration, one can yield k tk !(nnkk−t !
k )!
possible table indicator configurations. Thereby, the joint posterior distribution
of z and t is computed as
 nk !
Pr (z, t) = Pr (z, u) (6)
tk !(nk − tk )!
k

where z = (z1 , z2 , · · · , zN ), tk is the number of tables serving dish k, nk is


the number of customer eating dish k in the CRP metaphor. Note the chosen
term says we can equally likely make just tk of the nk data items those that
contribute tables. This formula lets us convert from the (z, t) representation of
the PDP to (z, u) representation at a given node, assuming all its lower level
nodes have already been converted. Thus we apply it recursively from the leaf
nodes up. Combining this with Eq. (4) in Lemma 1 we can write down the joint
probability of z and u for the tree structure as in Eq. (5).

Corollary 1. The joint posterior distribution of z 1:J and u1:J is exchangeable


in the pairs (zj , uj ).

Proof (sketch). Follows by inspection of the posterior and that the statistics used
are all sums over data items.
2
A generalized Stirling number is given by the linear recursion [15, 16, 18, 19] as
N+1 N N N
SM,a = SM −1,a + (N −M a)SM,a , for M ≤ N . It is 0 otherwise and S0,a = δN,0 .
These numbers rapidly become very large so computation needs to be done in log
space using a logarithmic addition.
3
Note that t0k ≤ 1 since this node is the PDD defined in definition 1.
Sampling Table Configurations for the Hierarchical PDP 303

5 Block Gibbs Sampling Algorithm

In this section, we elaborate on our block Gibbs sampling algorithm for the new
table representation, and show, as an example, how it can be applied to the
HDP-LDA model.

5.1 Block Gibbs Sampler


In Gibbs sampling, the new state is reached by sequentially sampling all vari-
ables from their distribution conditioned on the previous state, i.e., the current
values of all other variables and the data. Unlike the collapsed Gibbs sampling
algorithms proposed in [15, 16, 20], all of which divide the Gibbs sampler into
two parts, i.e. sampling z and t separately, instead, we develop for the new ta-
ble representation a block Gibbs sampling algorithm that jointly samples the
variable z and its table indicator variable u based on Theorem 1. The full joint
conditional distribution Pr (zl , ul | z 1:J − zl , u1:J − ul ) can be obtained by a prob-
abilistic argument or by cancellation of terms in Eq. (5).
While doing the joint sampling, the old value of table indicator ul first needs
to be removed simultaneously along with the old value of zl in the probabilistic
argument. The challenge here is to deal with the coupling between ul and zl , and
that between ul and data points. Note that for some cases, the tables created
by data item l up to the level ul cannot be removed since they are being shared
with other data items, discussed in Subsection 5.2. In such a case, we need to
skip the table operation for the datum l and proceed to other data items until
the tables linked to l can be removed. More specifically, the sampling procedure
goes as follows:
First, define path(l) containing a set of nodes to be a path from the root to the
leaf to which the data item has table contribution; Tj , Nj , tjk , njk are defined in
Section 4, Tj , Nj , tjk , njk are the corresponding values after removing the datum
l, and Tj , Nj , tjk , njk are those after adding l back again. To sample (zl , ul ), we
need to consider all nodes j ∈ path(l) because removing l would probably change
table configurations for these nodes. It is easy to show that these variables are
subject to the following relations:
For all node j ∈ path(l), after removing l (note l only belongs to one node in
the path), we have
 
 Tj , if ul > d(j)  tjk , if ul > d(j)
Tj = , tjk =
Tj − 1, if ul ≤ d(j) tjk − 1, if ul ≤ d(j) & zl = k

Nj , if l ∈
/ D(j) & ul > d(j) + 1
Nj =
Nj − 1, if l ∈
/ D(j) & ul ≤ d(j) + 1 or l ∈ D(j)


njk , if l ∈
/ D(j) & ul > d(j) + 1
njk = . (7)
njk − 1, if (l ∈
/ D(j) & ul ≤ d(j) + 1 or l ∈ D(j)) & zl = k
304 C. Chen, L. Du, and W. Buntine

After adding l, ul :
 
Tj , if ul > d(j) tjk , if ul > d(j)
Tj =  , tjk = 
Tj + 1, if ul ≤ d(j) tjk + 1, if ul ≤ d(j) & zl = k


Nj , if l ∈
/ D(j) & ul > d(j) + 1
Nj =
Nj + 1, if l ∈
/ D(j) & ul ≤ d(j) + 1 or l ∈ D(j)


njk , if l ∈
/ D(j) & ul > d(j) + 1
njk = . (8)
njk + 1, if (l ∈
/ D(j) & ul ≤ d(j) + 1 or l ∈ D(j)) & zl = k

With respect to the above analysis, the full joint conditional probability for zl
and ul is:

Pr (zl , ul | z 1:J − zl , u1:J − ul , H0 )


⎛  ⎞δn =n t =t
δ njk jk jk jk jk
 (bj + aj Tj ) (Tj =Tj ) ⎜ Stjk ,aj ⎟
= ⎝ n ⎠
 δ(Nj 
=N  )
j∈path(l) (bj + Nj )
j
St jk,aj
jk

δ   δ    
(tjk ) tjk =tjk (njk − tjk ) njk −tjk =njk −tjk
δn
. (9)
=n 
(njk ) jk

jk

5.2 Constraint Analysis

As discussed in the previous section, when we add or remove a data item from
the tree, the statistics associated with this data item cannot be changed unless
some conditions are satisfied. For removing a data item l from the tree, the
related statistics can be changed if one of the following conditions holds:

1. Data l has no table contribution, i.e., ul = L, or


2. No other data points share the tables created by data l, i.e., ∀j∈path(l) (d(j) ≥
ul =⇒ njk = 1), or
3. There are other tables created by other data points sharing the same dishes
with those created by data l in the same nodes, i.e., ∀j∈path(l) (d(j) ≥ ul =⇒
tjk > 1).

When l is added back to the tree, its table indicator cannot always take any
value from 0 to L, they should be constrained in some range, i.e., if l is added
to menu item k with current table count tk = 0, l will created a new table and
contribute one table count to the current node (i.e., tk + 1), and ul will be set to
d(j) such that ul < L. Thus, the value of ul should be in the following interval:

ul ∈ [umin
l , umax
l ]
Sampling Table Configurations for the Hierarchical PDP 305

where umin
l denotes the minimum value of the table indicator for j, and umax
l
the maximum value. These can be shown as

min {d(j) : j ∈ path(l), tjk = 0} , if ∃j, tjk = 0
umax =
l L, if ∀j, tjk > 0

0, if t0k = 0
umin
l = . (10)
1, others

5.3 Application to the HDP-LDA Model


In this section, we apply our proposed table representation and block Gibbs
sampler to the hierarchical Dirichlet process variant of latent Dirichlet allocation
(HDP-LDA)[14] in topic modeling. In the HDP-LDA model, each document
corresponds to a restaurant in the Chinese restaurant representation, or a DP in
the HDP. All the restaurants share the same parent, that is to say all the DPs
share the same global probability measure drawn from another DP.
To develop the sampling formulas for the HDP-LDA model, we need to in-
corporate the prior distribution for the topics in the joint distribution shown by
Eq. (5). The prior distribution of the topic-word matrix adopted in our work is
the Dirichlet distribution. Thus, given K topics, W words in vocabulary, and a
topic-word matrix ΦK×W which has a Dirichlet distributed prior γ, the joint
distribution of z 1:J , u1:J can be derived by incorporating the prior in Eq. (5)
and integrating out all φk as
Pr (z 1:J , u1:J )

 (bj |aj )Tj  n tjk !(njk − tjk )!   BetaW (γ + M k ) 
= Stjkjk,aj (11)
(bj )Nj njk ! BetaW (γ)
j≥0 k k

where M k denotes the number of words attached to topic k in the document


collection, BetaW (γ) is W dimensional beta function that normalizes the Dirich-
let. Specifically, in the case of HDP-LDA, we have aj = 0, ∀j, and documents
are indexed by j ∈ [1, D].
Now, beginning with the joint distribution, Eq. (11), using the chain rule,
we obtain the full joint conditional distribution according to the statistics be-
fore/after removing j and after adding j as follows:
1. If ∀j  , tj  k = 0,
b0 b1 γ + Mkl
Pr (zl = knew , ul = u | z 1:J − zl , u1:J − ul ) ∝   l
b0 + k T t[k] l (γl + Mkl )
(12)
2. If tjk 
= 0, t0k 
= 0,
Pr (zl = k, ul = u | z 1:J − zl , u1:J − ul )
n
Stjk,0 (t )δtjk =tjk (n − t )δnjk −tjk =njk −tjk γl + Mkl
jk jk jk
∝ jk
n δ  
 (13)
(njk ) njk =njk l (γl + Mkl )
 
St jk,0
jk
306 C. Chen, L. Du, and W. Buntine

3. If tjk = 0, t0k 
= 0,

Pr (zl = k, ul = u | z 1:J − zl , u1:J − ul )


b1 T t[k]2 γ + Mkl
∝   l (14)
(T t[k] + 1)( k T t[k] + b0 ) l (γl + Mkl )
where T t[k] denotes the number of tables serving dish k (i.e., topic k), Mkl
indicates the total number of words l assigned to k in the document collection.
Note that implementing the block sampler requires keeping track of the table
indicators for each data item. This can in fact be avoided in the same way that
the Sampling for Seating Arrangements approach [19, 26] works, the values can
be randomly reconstructed bottom up as needed. A single ul can be sampled
this way, and the remaining u1:J − ul do not need to be sampled since only their
statistics are needed and these can be reconstructed from t and ul . Thus our
implementation records the t but not u1:J .

6 Experiments
We compared the proposed algorithm with Teh et al.’s [14] “posterior sampling
by direct assignment” sampler4 as well as Buntine and Hutter’s collapsed sampler
[18] on five datasets, namely, Health dataset, Person dataset, Obama dataset,
NIPS dataset and Enron dataset. All three algorithms are implemented in C,
and run on a desktop with Intel(R) Core(TM) Qaud CPU (2.4GHz), although
our code is not multi-threaded.
The Obama dataset came from a collection of 7M Blogs (from ICWSM 2009,
posts from Aug-Sep 2008) by issuing the query “obama debate” under Lucene.
We performed fairly standard tokenization, created a vocabulary of terms that
occurred at least five times after excluding stopwords, and then built the bag
of words. The Health data set is similar except the source is 1M News articles
(LDC Gigaword) using the query “health medicine insurance” and words needed
to occur at least twice. The Person dataset has the source of 805k News articles
(Reuters RCV1) using the query “person” and using words that occurred at least
four times. The Enron and NIPS datasets have been obtained as preprocessed
bagged data from UCI where the data has been used in several papers. We list
some statistics of these datasets in Table 1.

Table 1. Statistics of the five datasets

Health Person Obama NIPS Enron


# words 1,119,678 1,656,574 1,382,667 1,932,365 6,412,172
# documents 1,655 8,616 9,295 1,500 39,861
vocabulary size 12,863 32,946 18,138 12,419 28,102

4
The “posterior sampling by direct assignment” is preferred than the other two sam-
plers in [14], due to its straightforward bookkeeping, as suggested by Teh et al.
Sampling Table Configurations for the Hierarchical PDP 307

6.1 Experiment Setup and Evaluation Criteria

The algorithms tested are: our proposed block Gibbs sampler for HDP-LDA,
Sampling by Table Configurations, denoted as STC, Teh et al.’s “Sampling by
Direct Assignment” algorithm[14], denoted as SDA, and the Collapsed Gibbs
Table Sampler by Buntine et al.[18], denoted as CTS, and finally, a variance of
the proposed STC by initializing word topics with SDA and sampling tables for
each document using STC, denoted as SDA+STC. The reason for using the fourth
algorithm is to isolate the impact of the new sampler.
Coming to the evaluation criteria for topic models, there are many differ-
ent evaluation methods such as importance sampling methods, Harmonic mean
method, “left-to-right” algorithm, etc., see [27, 28] for a complete survey. In this
paper, we adopt the “left-to-right” algorithm to calculate the test perplexities
because it is unbiased [27]. This algorithm calculates the perplexity over words
following a “left-to-right” manner for each document, which is defined as[27, 28]:

Pr (w|Φ, αm) = Pr (wn |w<n , Φ, αm) (15)
n

where w is all the words in the documents, Φ is the topic distribution matrix, α is
the concentration parameter of the Dirichlet prior over topics, and m is its base
measure. The perplexity is computed in the log space. Since table counts are
also latent variables in the formulation of perplexity, we force all table counts
to be less than or equal to one so that the unbiased method can be applied
directly. This condition has been observed to hold generally when the PDP
hyperparameters are well optimized/fit.

6.2 Parameter Setting


While there are many parameters in the proposed model, e.g., the two param-
eters of the PDP ai , bi for each document, for simplicity, we set all the bi ’s in
the same level of the tree structure to the same value, and optimize it by sam-
pling, meanwhile we set all ai ’s to 0 since we are testing the HDP model. More
specifically, we follow Teh et al.[1], to sample bi by introducing Beta distributed
auxiliary variables for each document.
For other parameters, we need to set the number of test documents for each
dataset. In the experiments, we set these approximately to 5% of the total
size of dataset. Specifically, the number of test documents for Health dataset
is 250/1655, and 500/8616 for Person dataset, 500/9259 for Obama dataset,
50/1500 for NIPS dataset, and 500/39861 for Enron dataset. Here, x/y means
x out of total y documents are used for testing. Moreover, we set the symmetric
Dirichlet prior of word distributions over topics (this is Dirichlet prior for the
topic matrix Φ in the LDA[13] case) to 0.01, a default value used in most of LDA
experiments.
308 C. Chen, L. Du, and W. Buntine

Table 2. Test log2 (perplexities) on the five datasets

Dataset Health Person Obama


I = 100 I = 200 I = 1000 I = 2000 I = 1000 I = 2000
SDA 11.628281 11.619546 11.930657 11.904425 11.144188 11.134732
CTS 11.655493 11.636743 11.940532 11.947740 11.191377 11.174327
SDA+STC 11.582969 11.573457 11.844319 11.829628 11.094079 11.090389
STC 11.547999 11.551453 11.858719 11.852253 11.210295 11.201241
Dataset Enron NIPS
I = 500 I = 1000 I = 1000 I = 2000
SDA 10.847454 10.768568 10.564221 10.558330
SDA+STC 10.768568 10.659724 10.534148 10.518792
STC 10.899923 10.810127 10.474467 10.425393

12.4
SDA SDA SDA
STC STC STC
12.2 CTS 12.2 CTS CTS
testing perplexity

testing perplexity

testing perplexity
13
SDA+STC SDA+STC SDA+STC
12 12
12.5
11.8 11.8

11.6 12
11.6
22 9600 19232 28822 36010 32 9619 19215 28830 36046 83 18025 32409 46832 61218 75611 68405
training time (s) training time (s) training time (s)
(a) Health data with I = 100 (b) Health data with I = 200 (c) Person data with I = 1000

13.5 12.5 12.5


SDA SDA SDA
STC STC STC
CTS CTS CTS
testing perplexity

testing perplexity

testing perplexity

13 SDA+STC 12 SDA+STC 12 SDA+STC

12.5
11.5 11.5

12
11 11
116 18041 32439 46817 61246 75629 67 18030 32428 46800 61214 75611 103 18013 32444 46842 61207 75629
training time (s) training time (s) training time (s)
(d) Person data with I = 2000 (e) Obama data with I = 1000 (f ) Obama data with I = 2000

12.5 SDA
STC SDA SDA
SDA+STC 11.4 STC 11.4 STC
testing perplexity

SDA+STC SDA+STC
testing perplexity

testing perplexity

12
11.2 11.2
11.5 11 11

10.8 10.8
11
10.6 10.6
808 14422 29060 43283 57700 72231 86455 10.4 10.4
training time (s) 58 19205 38417 57650 76805 138 19210 38456 57627 76934
training time (s) training time (s)
(g) Enron data with I = 1000 (h) NIPS data with I = 1000 (i) NIPS data with I = 2000

Fig. 3. Test log 2 (perplexities) evolved with training time, I means initial number of
topics

6.3 Perplexities

We first give the experimental results for the four algorithms on the five datasets
in term of testing perplexity using the “left-to-right” algorithm [27, 28]. We also
Sampling Table Configurations for the Hierarchical PDP 309

need to set the initial number of topics for each of these algorithms though
the final values can be sampled by these algorithms. Generally, to accelerate
the convergent speed, we initialized more topics to large datasets than those
to small datasets. Specifically, we initialized Health dataset with 100 and 200
topics, Enron dataset with 500 and 1000 topics5 , and other three datasets with
1000 and 2000 topics, respectively. Furthermore, we used 2000 major cycles to
burn-in6 . Table 2 summarizes the test perplexities for these algorithms.
From Table 2, we can see that the proposed fully exchangeable block Gibbs
sampler STC obtains significantly better results than the “sampling by direct
assignment” sampler does in most cases7 , while using SDA to burn in, and sam-
pling tables with the proposed corrected sampler, SDA+STC obtains consistently
better results than SDA. One interesting observation is that the collapsed Gibbs
sampler does not perform as well as expected, and we expect this is due to poor
mixing of the Gibbs sampler since the table counts are updated separately from
the topics.

6.4 Convergence Speed


Although our proposed sampler looks much more complex, the convergence speed
is almost as fast as Teh et al.’s and actually converges better. To verify this, we
set up another set of experiments. We started all the algorithms with a clock to
record the total training time, and randomly initialized these samplers. When
the training time has elapsed for an amount of time, say 1 hour for large datasets
and 40 minutes for small datasets, we calculated the testing perplexities on the
corresponding testing datasets using the current statistics, and recorded the per-
plexities. In order to do a fair comparison, all codes for these algorithms have
been carefully implemented to reach their optimal speed. We observed that al-
though the running time for each Gibbs cycle of our algorithm is slightly longer
than that of Teh et al.’s algorithm, our algorithm can still converge very fast
in most cases. Figure 3 plots the trends of testing perplexities on each datasets
corresponding to the training time. We can see from the figures that the pro-
posed algorithm converges well except on the largest dataset Enron. Subsequent
experiments show this could be overcome with better initialization.

7 Conclusion
In this paper, we have proposed a new table representation that inherits the full
exchangeability from the two-parameter Poisson-Dirichlet processes or Dirichlet
Processes to their hierarchical extensions, and developed a block Gibbs sam-
pling algorithm for doing inference on this representation. Meanwhile, we have
5
This dataset is too large to initialize with large number of initial topics.
6
For some large datasets, a 2000 burn-in is impractical, thus we set a maximum
burn-in time for 24 hours.
7
There is a case, i.e., Obama dataset, that STC is not as good as SDA, this may be due
to the initialization.
310 C. Chen, L. Du, and W. Buntine

applied the proposed algorithm to the HDP-LDA model, and used the block
Gibbs sampler to jointly sample the topic and table indicator configuration for
each word.
Experimental results showed that with proper initializations and sampling
only tables, SDA+STC can yield consistently better results than “sampling by
direct assignment” algorithm SDA does in term of testing perplexity; and the
block Gibbs sampler STC with full exchangeability can always outperform SDA.
Furthermore, we have demonstrated that though the proposed model is more
complicated than the original one, its convergence speed is always faster than the
“sampling by direct assignment” algorithm’s. Interestingly, an earlier alternative
collapsed Gibbs sampler performed poorly, and we expect this is because STC
allows better mixing of the HPDP/HDP parts of the model with the other parts
of the model.
We claim that our new representation and the performance improvements
should extend to many of the other Gibbs algorithms now in use for different
models embedding the HPDP or HDP. Thus our methods can be applied quite
broadly within the non-parameteric Bayesian community. The superiority of our
method over the various CRP-based approaches mentioned in Section 3 and our
earlier collapsed sampler are:
– The introduction of the table indicator variable guarantees full exchangeabil-
ity. This is important to eliminate sequential effects from the Gibbs sampler.
– Tracking the table contribution can reduce the information loss that may
result from summing out all the seating arrangements. This makes the Gibbs
sampler more rapidly mixing.

Acknowledgments. NICTA is funded by the Australian Government as rep-


resented by the Department of Broadband, Communications and the Digital
Economy and the Australian Research Council through the ICT Center of Ex-
cellence program.

References
1. Teh, Y.W.: A hierarchical Bayesian language model based on Pitman-Yor pro-
cesses. In: ACL 2006, pp. 985–992 (2006)
2. Goldwater, S., Griffiths, T., Johnson, M.: Interpolating between types and tokens
by estimating power-law generators. In: NIPS 2006, pp. 459–466 (2006)
3. Mochihashi, D., Sumita, E.: The infinite Markov model. In: NIPS 2008, pp. 1017–
1024 (2008)
4. Johnson, M., Griffiths, T., Goldwater, S.: Adaptor grammars: A framework for
specifying compositional nonparametric Bayesian models. In: NIPS 2007, pp. 641–
648 (2007)
5. Wallach, H., Sutton, C., McCallum, A.: Bayesian modeling of dependency trees
using hierarchical Pitman-Yor priors. In: Proceedings of the Workshop on Prior
Knowledge for Text and Language (in Conjunction with ICML/UAI/COLT), pp.
15–20 (2008)
6. Wood, F., Archambeau, C., Gasthaus, J., James, L., Teh, Y.: A stochastic memo-
izer for sequence data. In: ICML 2009, pp. 119–116 (2009)
Sampling Table Configurations for the Hierarchical PDP 311

7. Rasmussen, C.: The infinite Gaussian mixture model. In: NIPS 2000, pp. 554–560
(2000)
8. Pruteanu-Malinici, I., Ren, L., Paisley, J., Wang, E., Carin, L.: Hierarchical
Bayesian modeling of topics in time-stamped documents. TPAMI 32, 996–1011
(2010)
9. Xu, Z., Tresp, V., Yu, K., Kriegel, H.P.: Infinite hidden relational models. In: UAI
2006, pp. 544–551 (2006)
10. Teh, Y.W., Jordan, M.I.: Hierarchical Bayesian nonparametric models with appli-
cations. In: Bayesian Nonparametrics: Principles and Practice (2010)
11. Ishwaran, H., James, L.: Gibbs sampling methods for stick-breaking priors. Journal
of ASA 96, 161–173 (2001)
12. Buntine, W., Jakulin, A.: Discrete components analysis. In: Subspace, Latent
Structure and Feature Selection Techniques (2006)
13. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
14. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes.
Journal of the ASA 101, 1566–1581 (2006)
15. Du, L., Buntine, W., Jin, H.: A segmented topic model based on the two-parameter
Poisson-Dirichlet process. Mach. Learn. 81, 5–19 (2010)
16. Du, L., Buntine, W., Jin, H.: Sequential latent Dirichlet allocation: Discover un-
derlying topic structures within a document. In: ICDM 2010, pp. 148–157 (2010)
17. Pitman, J., Yor, M.: The two-parameter Poisson-Diriclet distribution derived from
a stable subordinator. Annals Prob. 25, 855–900 (1997)
18. Buntine, W., Hutter, M.: A Bayesian review of the Poisson-Dirichlet process. Tech-
nical Report arXiv:1007.0296, NICTA and ANU, Australia (2010)
19. Teh, Y.: A Bayesian interpretation of interpolated Kneser-Ney. Technical Report
TRA2/06, School of Computing, National University of Singapore (2006)
20. Buntine, W., Du, L., Nurmi, P.: Bayesian networks on Dirichlet distributed vectors.
In: PGM 2010, pp. 33–40 (2010)
21. Blei, D.M., Griffiths, T.L., Jordan, M.I.: The nested Chinese restaurant process and
Bayesian nonparametric inference of topic hierarchies. J. ACM 57, 1–30 (2010)
22. Teh, Y.: Nonparametric Bayesian mixture models - release 2.1. Technical Re-
port University College London (2004), http://www.gatsby.ucl.ac.uk/~ywteh/
research/software.html
23. Wang, C., Blei, D.: Variational inference for the nested Chinese restaurant process.
In: NIPS 2009, pp. 1990–1998 (2009)
24. Wang, C., Paisley, J., Blei, D.: Online variational inference for the hierarchical
Dirichlet process. In: AISTATS 2011 (2011)
25. Teh, Y., Kurihara, K., Welling, M.: Collapsed variational inference for HDP. In:
NIPS 2007 (2007)
26. Blunsom, P., Cohn, T., Goldwater, S., Johnson, M.: A note on the implementation
of hierarchical Dirichlet processes. In: ACL 2009, pp. 337–340 (2009)
27. Buntine, W.: Estimating likelihoods for topic models. In: Zhou, Z.-H., Washio, T.
(eds.) ACML 2009. LNCS, vol. 5828, pp. 51–64. Springer, Heidelberg (2009)
28. Wallach, H., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for
topic models. In: ICML 2009, pp. 672–679 (2009)
Preference-Based Policy Iteration: Leveraging
Preference Learning for Reinforcement Learning

Weiwei Cheng1 , Johannes Fürnkranz2 ,


Eyke Hüllermeier1 , and Sang-Hyeun Park2
1
Department of Mathematics and Computer Science, Marburg University
{cheng,eyke}@mathematik.uni-marburg.de
2
Department of Computer Science, TU Darmstadt
{juffi,park}@ke.tu-darmstadt.de

Abstract. This paper makes a first step toward the integration of two
subfields of machine learning, namely preference learning and reinforce-
ment learning (RL). An important motivation for a “preference-based”
approach to reinforcement learning is a possible extension of the type of
feedback an agent may learn from. In particular, while conventional RL
methods are essentially confined to deal with numerical rewards, there
are many applications in which this type of information is not naturally
available, and in which only qualitative reward signals are provided in-
stead. Therefore, building on novel methods for preference learning, our
general goal is to equip the RL agent with qualitative policy models, such
as ranking functions that allow for sorting its available actions from most
to least promising, as well as algorithms for learning such models from
qualitative feedback. Concretely, in this paper, we build on an existing
method for approximate policy iteration based on roll-outs. While this
approach is based on the use of classification methods for generalization
and policy learning, we make use of a specific type of preference learning
method called label ranking. Advantages of our preference-based policy
iteration method are illustrated by means of two case studies.

1 Introduction

Standard methods for reinforcement learning (RL) assume feedback to be spec-


ified in the form of real-valued rewards. While such rewards are naturally gen-
erated in some applications, there are many cases in which precise numerical
information is difficult to extract from the environment, or in which the spec-
ification of such information is largely arbitrary—as a striking though telling
example, to which we shall return in Section 5, consider assigning a negative
reward of −60 to the death of the patient in a medical treatment [17]. The quest
for numerical information, even if accomplishable in principle, may also compro-
mise efficiency in an unnecessary way. In a game playing context, for example, a
short look-ahead from the current state may reveal that an action a is most likely
superior to an action a ; however, the precise numerical gains are only known at
the end of the game. Moreover, external feedback, which is not produced by the

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 312–327, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Preference-Based Policy Iteration: Leveraging Preference Learning for RL 313

environment itself but, say, by a human expert (e.g., “In this situation, action a
would have been better than a ”), is typically of a qualitative nature, too.
In order to make RL more amenable to qualitative feedback, we build upon
formal concepts and methods from the rapidly growing field of preference learn-
ing [5]. Roughly speaking, we consider the RL task as a problem of learning
the agent’s preferences for actions in each possible state, that is, as a problem
of contextualized preference learning (with the context given by the state). In
contrast to the standard approach to RL, the agent’s preferences are not nec-
essarily expressed in terms of a utility function. Instead, more general types of
preference models, as recently studied in preference learning, can be envisioned,
such as total and partial order relations.
Interestingly, this approach is in a sense in-between the two extremes that
have been studied in RL so far, namely learning numerical utility functions for
all actions (as in Q-learning [15]) and, on the other hand, directly learning a
policy which predicts a single best action in each state [11]. One may argue that
the former approach is unnecessarily complex, since precise utility degrees are
actually not necessary for taking optimal actions, whereas the latter approach
is not fully effectual, since a prediction in the form of a single action does nei-
ther suggest alternative actions nor offer any means for a proper exploration.
An order relation on the set of actions seems to provide a reasonable compro-
mise, as it supports the exploration of acquired knowledge, i.e., the selection of
(presumably) optimal actions, as well as the exploration of alternatives, i.e., the
selection of suboptimal but still promising actions.
In this paper, we make a first step toward the integration of preference learn-
ing and reinforcement learning. We build upon a policy learning approach called
approximate policy iteration, which will be detailed in Section 2, and propose
a preference-based variant of this algorithm (Section 3). While the original ap-
proach is based on the use of classification methods for generalization and pol-
icy learning, we employ label ranking algorithms for incorporating preference
information. Advantages of our preference-based policy iteration method are
illustrated by means of two case studies presented in Sections 4 and 5.

2 Approximate Policy Iteration

Conventional reinforcement learning assumes a scenario in which an agent moves


through a (finite) state space S by repeatedly selecting actions from a set of
actions A = {a1 , . . . , ak }. A Markovian state transition function δ : S × A →
P(S), where P(S) denotes the set of probability distributions over S, randomly
takes the agent to a new state, depending on the current state and the chosen
action. Occasionally, the agent receives feedback about its actions in the form
of a reward signal r : S × A → R, where r(s, a) is the reward the agent receives
for performing action a in state s. The goal of the agent is to choose its actions
so as to maximize its expected total reward.
The most common task is to learn a policy π : S → A that prescribes the
agent how to act optimally in each situation (state). More specifically, the goal
314 W. Cheng et al.

is often defined as maximizing the expected sum of rewards (given the initial
state s), with future rewards being discounted by a factor γ ∈ [0, 1]:
∞ 

π t
V (s) = E γ r(st , π(st )) | s0 = s (1)
t=0

where (s0 , s1 , s2 , . . .) is a trajectory of π through the state space. With V ∗ (s)


the best possible value that can be achieved for (1), a policy is called optimal
if it achieves the best value in each state s. Thus, one possibility to learn an
optimal policy is to learn an evaluation of states in the form of a value function
[12], or to learn a so-called Q-function which returns the expected reward for a
given state-action pair [15]:

Qπ (s, a) = r(s, a) + γ · V π (δ(s, a))

Instead of determining optimal actions indirectly through learning the value


function or the Q-function, one may try to learn a policy directly in the form of
a mapping from states to actions. Approaches following this line include actor-
critic algorithms, which learn both the value function (the critic) and an explicit
policy (the actor) simultaneously [1,10,2], and policy gradient methods, which
search for a good parameter setting in a space of parametrized policies [16,9,13].
A particularly interesting approach is approximate policy iteration with roll-
outs [11,3]. The key idea of this approach is to use a generative model of the
underlying process to perform simulations that in turn allow for approximating
the value of an action in a given state (Algorithm 1). To this end, the action is
performed, resulting in a state s1 = δ(s, a). The value of this state is estimated
by performing so-called roll-outs, i.e., by repeatedly selecting actions following a
policy π for at most T steps, and finally accumulating the observed rewards. This
is repeated K times and the average reward over these K roll-outs is returned
as an approximate Q-value Q̃π (s0 , a) for taking action a in state s0 (leading to
s1 ) and following policy π thereafter.
These roll-outs are then used in a policy iteration loop (Algorithm 2), which
iterates through each state, simulates all actions in this state, and determines
the action a∗ that promises the highest Q-value. If a∗ is significantly better than
all alternative actions in this state (indicated with the symbol >T in line 10), a
training example (s, a∗ ) is added to a training set T . Eventually, T is used to
directly learn a mapping from states to actions, which forms the new policy π  .
This process is repeated several times, until some stopping criterion is met (e.g.,
if the policy does not improve from one iteration to the next).
We should note some minor differences between the version presented in
Algorithm 2 and the original formulation [11]. Most notably, the training set
here is formed as a multi-class training set, whereas in [11] it was formed as a
binary training set, learning a binary policy predicate π̂ : S × A → {0, 1}. We
chose the more general multi-class representation because, as we will see in the
following, it lends itself to an immediate generalization to a ranking scenario.
Preference-Based Policy Iteration: Leveraging Preference Learning for RL 315

Algorithm 1. Rollout(E, s1 , γ, π, K, T ): Estimation of state-action values


Require: generative environment model E, sample state s0 , discount factor γ, policy
π, number of trajectories/roll-outs K, max. length/horizon of each trajectory T
for k = 1 to K do
s ← s1 , Q̃k ← 0, t ← 1
while t < T and ¬TerminalState(s) do
(s , r) ← Simulate(E, s, π(s))
Q̃k ← Q̃k + γ t r
s ← s , t ← t + 1
end while
end for
1
K
Q̃ = K k=1 Q̃k

return Q̃

3 Preference-Based Reinforcement Learning


The key idea of our approach is to replace the (quantitative) evaluation of in-
dividual actions by the (qualitative) comparison between pairs of actions. Com-
parisons of that kind are in principle enough to make optimal decisions. Besides,
they are often more natural and less difficult to acquire, especially in applica-
tions where the environment does not provide numerical rewards in a natural
way. As will be seen later on, comparing pairs instead of evaluating individual
actions does also have a number of advantages from a learning point of view.
The basic piece of information we consider is a pairwise preference of the
form ai s aj or, more specifically, ai πs aj , suggesting that in state s, taking
action ai (and following policy π afterward) is better than taking action aj .
A preference of this kind can be interpreted in different ways. For example,
assuming the existence of an underlying (though not necessarily known) reward
function, it may simply mean that Qπ (s, ai ) > Qπ (s, aj ).
Evaluating a trajectory t = (s0 , s1 , s2 , . . .) in terms of its (expected) total
reward reduces the comparison of trajectories to the comparison of real numbers;
thus, comparability is enforced and a total order on trajectories is induced.
More generally, and arguably more in line with the idea of qualitative feedback,
one may assume a partial order relation  on trajectories, which means that
trajectories t and t can also be incomparable. A contextual preference can then
be defined as follows:

ai πs aj ⇔ P(t(ai )  t(aj )) > P(t(aj )  t(ai )) ,

where t(ai ) denotes the (random) trajectory produced by taking action ai in


state s0 and following π thereafter, and P(t  t ) is the probability that trajec-
tory t is preferred to t . An in-depth discussion of this topic is beyond the scope
of this paper, however, an example in which  is a Pareto-dominance relation
will be presented in Section 5.
316 W. Cheng et al.

Algorithm 2. Multi-class variant of Approx. Policy Iteration with Roll-Outs[11]


Require: generative environment model E, sample states S, discount factor γ, initial
(random) policy π0 , number of trajectories/roll-outs K, max. length/horizon of
each trajectory T , max number of policy iterations p
1: π  ← π0
2: repeat
3: π ← π, T ← ∅
4: for each s ∈ S do
5: for each a ∈ A do
6: (s , r) ← Simulate(E, s, a) # do (possibly off-policy) action a
7: Q̃π (s, a) ← Rollout(E, s , γ, π, K, T ) + r # estimate state-action value
8: end for
9: a∗ ← arg maxa∈A Q̃π (s, a)
10: if Q̃π (s, a∗ ) >T Q̃π (s, a) for all a ∈ A, a 
= a∗ then

11: T ← T ∪ {(s, a )}
12: end if
13: end for
14: π  ← learn(T )
15: until StoppingCriterion(E, π, π  , p)

In order to realize our idea of preference-based approximate policy iteration,


to be detailed in Section 3.2 below, we need a learning method that induces a
suitable preference model on the basis of training information in the form of
pairwise preferences of the above kind. Ideally, given a state, the model allows
one to rank the possible actions from (presumably) most to least desirable. A
setting nicely matching these requirements is the setting of label ranking.

3.1 Label Ranking


Like in the conventional setting of supervised learning (classification), assume
to be given an instance space X and a finite set of labels Y = {y1 , y2 , . . . , yk }.
In label ranking, the goal is to learn a “label ranker” in the form of an X → SY
mapping, where the output space SY is given by the set of all total orders
(permutations) of the set of labels Y . Thus, label ranking can be seen as a
generalization of conventional classification, where a complete ranking

yτx−1 (1) x yτx−1 (2) x . . . x yτx−1 (k)

is associated with an instance x instead of only a single class label. Here, τx is


a permutation of {1, 2, . . . , k} such that τx (i) is the position of label yi in the
ranking associated with x.
The training data T used to induce a label ranker typically consists of a set
of pairwise preferences of the form yi x yj , suggesting that, for instance x, yi
is preferred to yj . In other words, a single “observation” consists of an instance
x together with an ordered pair of labels (yi , yj ).
Preference-Based Policy Iteration: Leveraging Preference Learning for RL 317

Several methods for label ranking have already been proposed in the literature;
we refer to [14] for a comprehensive survey. The idea of learning by pairwise
comparison (LPC) [8] is to train a separate model Mi,j for each pair of labels
(yi , yj ) ∈ Y × Y , 1 ≤ i < j ≤ k; thus, a total number of k(k − 1)/2 models is
needed. At classification time, a query x is submitted to all models, and each
prediction Mi,j (x) is interpreted as a vote for a label. More specifically, assuming
scoring classifiers that produce normalized scores fi,j = Mi,j (x) ∈ [0, 1], the
weighted voting technique interprets fi,j and fj,i = 1 − fi,j as weighted votes for
predicts the class y ∗ with the highest sum of
classes yi and yj , respectively, and 

weighted votes, i.e., y = arg maxi j =i fi,j . We refer to [8] for a more detailed
description of LPC in general and a theoretical justification of the weighted
voting procedure in particular.

3.2 Preference-Based Approximate Policy Iteration


Recall the scenario described at the end of Section 2, where the agent has access
to a generative model E, which takes a state s and an action a as input and
returns a successor state s and the reward r(s, a). As in [11], this scenario is
used for generating training examples via roll-outs, i.e., by using the generative
model and the current policy π for generating a training set T , which is used for
training a multi-class classifier that can be used as a policy.
Following our idea of preference-based RL, we train a label ranker instead of
a classifier : Using the notation from Section 3.1 above, the instance space X
is given by the state space S, and the set of labels Y corresponds to the set of
actions A. Thus, the goal is to learn a mapping S → SA , which maps a given
state to a total order (permutation) of the available actions. In other words,
the task of the learner is to learn a function that is able to rank all available
actions in a state. The training information is provided in the form of binary
action preferences of the form (s, ak  aj ), indicating that in state s, action ak
is preferred to action aj .
From a training point of view, a key advantage of this approach is that pairwise
preferences are much easier to elicit than examples for unique optimal actions.
Our experiments in Sections 4 and 5 utilize this in different ways.
Section 4 demonstrates that a comparison of only two actions is less difficult
than “proving” the optimality of one among a possibly large set of actions, and
that, as a result, our preference-based approach better exploits the gathered
training information. Indeed, the procedure proposed in [11] for forming train-
ing examples is very wasteful with this information. An example (s, a∗ ) is only
generated if a∗ is “provably” the best action among all candidates, namely if it
is (significantly) better than all other actions in the given state. Otherwise, if
this superiority is not confirmed by a statistical hypothesis test, all information
about this state is ignored. In particular, no training examples would be gen-
erated in states where multiple actions are optimal, even if they are clearly better
318 W. Cheng et al.

than all remaining actions.1 For the preference-based approach, on the other
hand, it suffices if only two possible actions yield a clear preference in order to
obtain (partial) training information about that state. Note that a corresponding
comparison may provide useful information even if both actions are suboptimal.
In Section 5, an example will be shown in which actions are not necessarily
comparable, since the agent seeks to optimize multiple criteria at the same time
(and is not willing to aggregate them into a one-dimensional target). In general,
this means that, while at least some of the actions will still be comparable in a
pairwise manner, a unique optimal action does not exist.
Regarding the type of prediction produced, it was already mentioned earlier
that a ranking-based reinforcement learner can be seen as a reasonable compro-
mise between the estimation of a numerical utility function (like in Q-learning)
and a classification-based approach which provides only information about the
optimal action in each state: The agent has enough information to determine
the optimal action, but can also rely on the ranking in order to look for alter-
natives, for example to steer the exploration towards actions that are ranked
higher. We will briefly return to this topic at the end of the next section. Before
that, we will discuss the experimental setting in which we evaluate the utility of
the additional ranking-based information.

4 Case Study I: Exploiting Action Preferences

In this section, we compare three variants of approximate policy iteration follow-


ing Algorithm 2. They only differ in the way in which they use the information
gathered from the performed roll-outs.

Approximate Policy Iteration (API) generates one training example (s, a∗ )


if a∗ is the best available action in s, i.e., if Q̃π (s, a∗ ) >T Q̃π (s, a) for all
a = a∗ . If there is no action that is better than all alternatives, no training
example is generated for this state.
Pairwise Approximate Policy Iteration (PAPI) works in the same way as
API, but the underlying base learning algorithm is replaced with a label
ranker. This means that each training example (s, a∗ ) of API is transformed
into a − 1 training examples of the form (s, a∗  a) for all a  = a∗ .
Preference-Based Policy Iteration (PBPI) is trained on all available pair-
wise preferences, not only those involving the best action. Thus, whenever
Q̃π (s, ak ) >T Q̃π (s, al ) holds for a pair of actions (ak , al ), PBPI generates a
corresponding training example (s, ak  al ). Note that, contrary to PAPI,
ak does not need to be the best action. In particular, it is not necessary that
there is a clear best action in order to generate training examples. Thus, from
the same roll-outs, PBPI will typically generate more training information
than PAPI or API.
1
In the original formulation as a binary problem, it is still possible to produce nega-
tive examples, which indicate that the given action is certainly not the best action
(because it was significantly worse than the best action).
Preference-Based Policy Iteration: Leveraging Preference Learning for RL 319

4.1 Application Domains

Following [3], we evaluated these variants on two well-known problems, inverted


pendulum and mountain car. We will briefly recapitulate these tasks, which were
used in their default setting, unless stated otherwise.
In the inverted pendulum problem, the task is to push or pull a cart so that
it balances an upright pendulum. The available actions are to apply a force of
fixed strength of 50 Newton to the left (-1), to the right (+1) or to apply no
force at all (0). The mass of the pole is 2 kg and of the cart 9 kg. The pole
has a length of 0.5 m and each time step is set to 0.1 seconds. Following [3], we
describe the state of the pendulum using only the angle and angular velocity
of the pole, ignoring the position and the velocity of cart. For each time step,
where the pendulum is above the horizontal line, a reward of 1 was given, else 0.
A policy was considered sufficient, if it is able to balance the pendulum longer
than 1000 steps (100 sec). The random samples in this setting were generated
by simulating a uniform random number (max 100) of uniform random actions
from the initial state (pole straight up, no velocity for cart and pole). If the
pendulum fell within this sequence, the procedure was repeated.
In the mountain car problem, the task is to drive a car out of a steep valley.
To do so, it has to repeatedly go up on each side of the hill, gaining momentum
by going down and up to the other side, so that eventually it can get out. Again,
the available actions are (−1) for left or backward and (+1) for right or forward
and (0) for a fixed level of throttle. The states or feature vectors consist of the
horizontal position and the current velocity of the car. Here, the agent received
a reward of -1 in each step until the goal was reached. A policy which needed
less than 75 steps to reach the goal was considered as sufficient.

4.2 Experimental Setup

In addition to these conventional formulations using three actions in each state,


we also used versions of these problems with 5, 9, and 17 actions, because in
these cases it becomes less and less likely that a unique best actions can be
found, and the benefit from being able to utilize information from states where
no clear winner emerges increases. The range of the original action set {−1, 0, 1}
was partitioned equidistantly into the given number of actions, for e.g., using 5
actions, the set of action signals is {−1, −0.5, 0, 0.5, 1}. Also, a uniform noise term
in [−0.2, 0.2] was added to the action signal, such that all state transitions are
non-deterministic. For training the label ranker we use LPC (cf. Section 3.1) with
simple multi-layer perceptrons (as implemented in the Weka machine learning
library [7] with its default parameters) as base classifiers. The discount factor
for both settings was set to 1 and the maximal length of the trajectory for the
inverted pendulum task was set to 1500 steps and 1000 for the mountain car task.
The policy iteration algorithms terminated if the learned policy was sufficient or
if the policy performance decreased or if the number of policy iterations reached
10. For the evaluation of the policy performance, 100 simulations beginning from
the corresponding initial states were utilized.
320 W. Cheng et al.

For each task and method, we tried five numbers of state samples s ∈ {10, 20,
50, 100, 200}, five maximum numbers of roll-outs r ∈ {10, 20, 50, 100, 200}, and
three levels of significance c ∈ {0.025, 0.05, 0.1}. Each of the 5 × 5 × 3 = 75
parameter combinations was evaluated ten times, such that the total number of
experiments per learning task was 750. We tested both domains, mountain car
and inverted pendulum, with a ∈ {3, 5, 9, 17} different actions each.
Our prime evaluation measure is the success rate (SR), i.e., the percentage
of learned sufficient policies. Following [3], we plot a cumulative distribution of
the success rates of all different parameter settings over a measure of learning
complexity, where each point (x, y) indicates the minimum complexity x needed
to reach a success rate of y. However, while [3] simply use the number of roll-outs
(i.e., the number of sampled states) as a measure of learning complexity, we use
the number of performed actions over all roll-outs, which is a more fine-grained
complexity measure. The two would coincide if all roll-outs are performed a con-
stant number of times. However, this is typically not the case, as some roll-outs
may stop earlier than others. Thus, we generated graphs by sorting all successful
runs over all parameter settings (i.e., runs which yielded a sufficient policy) in
increasing order regarding the number of applied actions and by plotting these
runs along the x-axis with a y-value corresponding to its cumulative success rate.
This visualization can be interpreted roughly as the development of the success
rate in dependence of the applied learning complexity.

4.3 Complete State Evaluations


Figure 1 shows the results for the inverted pendulum and the mountain car tasks.
One can clearly see that for an increasing number of actions, PBPI reaches a
significantly higher success rate than the two alternative approaches, and it typ-
ically also has a much faster learning curve, i.e., it needs to take fewer actions to
reach a given success rate. Another interesting point is that the maximum success
level decreases with an increasing number of actions for API and PAPI, but it
remains essentially constant for PBPI. Overall, these results clearly demonstrate
that the additional information about comparisons of lower-ranked action pairs,
which is ignored in API and PAPI, can be put to effective use when approximate
policy iteration is extended to use a label ranker instead of a mere classifier.

4.4 Partial State Evaluations


So far, based on the API strategy, we always evaluated all possible actions at
each state, and generated preferences from their pairwise comparisons. A possible
advantage of the preference-based approach is that it does not need to evaluate
all options at a given state. In fact, one could imagine to select only two actions
for a state and compare them via roll-outs. While such a partial state evaluation
will, in general, not be sufficient for generating a training example for API, it
suffices to generate a training preference for PBPI. Thus, such a partial PBPI
strategy also allows for considering a far greater number of states, using the
same number of roll-outs, at the expense that not all actions of each state will
Preference-Based Policy Iteration: Leveraging Preference Learning for RL 321

Inverted Pendulum, 3 Actions Mountain Car, 3 Actions


1 1
API API
PAPI PAPI
0.8 PBPI 0.8 PBPI
# success rate

# success rate
0.6 0.6

0.4 0.4

0.2 0.2

0 0
1000 10000 100000 1e+06 1e+07 1e+08 100000 1e+06 1e+07 1e+08 1e+09
# actions # actions

Inverted Pendulum, 5 Actions Mountain Car, 5 Actions


1 1
API API
PAPI PAPI
0.8 PBPI 0.8 PBPI
# success rate

# success rate

0.6 0.6

0.4 0.4

0.2 0.2

0 0
1000 10000 100000 1e+06 1e+07 1e+08 100000 1e+06 1e+07 1e+08 1e+09
# actions # actions

Inverted Pendulum, 9 Actions Mountain Car, 9 Actions


1 1
API API
PAPI PAPI
0.8 PBPI 0.8 PBPI
# success rate

# success rate

0.6 0.6

0.4 0.4

0.2 0.2

0 0
1000 10000 100000 1e+06 1e+07 1e+08 100000 1e+06 1e+07 1e+08 1e+09
# actions # actions

Inverted Pendulum, 17 Actions Mountain Car, 17 Actions


1 1
API API
PAPI PAPI
0.8 PBPI 0.8 PBPI
# success rate

# success rate

0.6 0.6

0.4 0.4

0.2 0.2

0 0
1000 10000 100000 1e+06 1e+07 1e+08 100000 1e+06 1e+07 1e+08 1e+09
# actions # actions

Fig. 1. Comparison of API, PAPI and PBPI for the inverted pendulum task (left) and
the mountain car task (right). The number of actions is increasing from top to bottom.
322 W. Cheng et al.

Inverted Pendulum, 5 Actions Inverted Pendulum, 5 Actions


1 1
PBPI PBPI
PBPI-1 PBPI-1
0.8 PBPI-2 0.8 PBPI-2
PBPI-3 PBPI-3

# success rate
# success rate

0.6 0.6

0.4 0.4

0.2 0.2

0 0
1000 10000 100000 1e+06 1e+07 1e+08 1 10 100 1000 10000
# actions # preferences

Fig. 2. Comparison of complete state evaluation (PBPI) with partial state evaluation
in three variants (PBPI-1, PBPI-2, PBPI-3)

be explored. Such an approach may thus be considered to be orthogonal to recent


approaches for roll-out allocation strategies [3,6].
In order to investigate this effect, we also experimented with three partial
variants of PBPI, which only differ in the number of states that they are allowed
to visit. The first (PBPI-1) allows the partial PBPI variant to visit only the
same total number of states as PBPI. The second (PBPI-2) adjusts the number
of visited sample states by multiplying it with k2 , to account for the fact that
the partial variant performs only 2 action roll-outs in each state, as opposed to
k action roll-outs for PBPI. Thus, the total number of action roll-outs in PBPI
and PBPI-2 is constant. Finally, for the third variant (PBPI-3), we assume that
the number of preferences that are generated from each state is constant. While
PBPI generates up to k(k−1) 2
preferences from each visited state, partial PBPI
generates only one preference per state, and is thus allowed to visit k(k−1) 2 as
many states.
Figure 2 shows the results for the inverted pendulum with five different actions
(the results for the other problems are quite similar). The left graph shows the
success rate over the total number of taken actions, whereas the right graph
shows the success rate over the total number of training preferences. From the
right graph, no clear differences can be seen. In particular, the curves for PBPI-
3 and PBPI almost coincide. This is not surprising, because both generate the
same number of preference samples, albeit for different random states. However,
the left graph clearly shows that the exploration policies that do not generate all
action roll-outs for each state are more wasteful with respect to the total number
of actions that have to be taken in the roll-outs. Again, this is not surprising,
because evaluating all five actions in a state may generate up to 10 preferences
for a single state, or, in the case of PBPI-2, only a total of 5 preferences if 2
actions are compared in each of 5 states.
Nevertheless, the results demonstrate that partial state evaluation is feasible.
This may form the basis of novel algorithms for exploring the state space. For
example, it was suggested that a policy-based generation of states may be prefer-
able to a random selection [4]. While this may clearly lead to faster convergence
Preference-Based Policy Iteration: Leveraging Preference Learning for RL 323

in some domains, it may also fail to find optimal solutions in other cases [11].
Selecting a pair of actions and following the better one may be a simple but
effective way of trading off exploration and exploitation for state sampling. We
are currently working on a more elaborate investigation of this issue.

5 Case Study II: Learning from Qualitative Feedback


In a second experiment, we applied preference-based reinforcement learning to
a simulation of optimal therapy design in cancer treatment, using a model that
was recently proposed in [17]. In this domain, it is arguably more natural to
define preferences that induce a partial order between states than to define an
artificial numerical reward function that induces a total order between states.

5.1 Cancer Clinical Trials Domain


The model proposed in [17] captures a number of essential factors in cancer treat-
ment: (i) the tumor growth during the treatment; (ii) the patient’s (negative)
wellness, measured in terms of the level of toxicity in response to chemother-
apy; (iii) the effect of the drug in terms of its capability to reduce the tumor
size while increasing toxicity; (iv) the interaction between the tumor growth and
patient’s wellness. The two state variables, the tumor size S and the toxicity
X, are modeled using a system of difference equations: St+1 = St + ΔSt and
Xt+1 = Xt + ΔXt , where the time variable t denotes the number of months after
the start of the treatment and assumes values t = 0, 1, . . . , 6. The terms ΔS and
ΔX indicate the increments of the state variables that depend on the action,
namely the dosage level D:
 
ΔSt = a1 · max(Xt , X0 ) − b1 · (Dt − d1 ) × 1(St > 0)
(2)
ΔXt = a2 · max(St , S0 ) + b2 · (Dt − d2 )
These changing rates produce a piecewise linear model over time. We fix the
parameter values following the recommendation of [17]: a1 = 0.15, a2 = 0.1, b1 =
b2 = 1.2 and d1 = d2 = 0.5. By using the indicator term 1(St > 0), the model
assumes that once the patient has been cured, namely the tumor size is reduced
to 0, there is no recurrence. Note that this system does not reflect a specific
cancer but rather models the generic development of the chemotherapy process.
The possible death of a patient in the course of a treatment is modeled by
means of a hazard rate model. For each time interval (t − 1, t], this rate is
defined as a function of tumor size and toxicity: λ(t) = exp (c0 + c1 St + c2 Xt ),
where c0 , c1 , c2 are cancer-dependent constants. Again following [17], we let c0 =
−4, c1 = c2 = 1. By setting c1 = c2 , the tumor size and the toxicity have an
equally important influence on patient’s survival. The probability of the patient’s
death during the time interval (t − 1, t] is calculated as
 t

Pdeath = 1 − exp − λ(x) dx .


t−1
324 W. Cheng et al.

2 tumor size
toxicity
1

0
0 (1.0) 1 (0.7) 2 (0.1) 3 (0.7) 4 (1.0) 5 (0.7) 6

Fig. 3. Illustration of the simulation model showing the patient’s status during the
treatment. The initial tumor size is 1.5 and the initial toxicity is 0.5. On the x-axis is
the month with the corresponding dosage level the patient receives. The dosage levels
are selected randomly.

5.2 A Preference-Based Approach


The problem is to learn an optimal treatment policy π mapping states (S, X)
to actions in the form of a dosage level D, where the dosage level is a num-
ber between 0 and 1 (minimum and maximum dosage, respectively). In [17],
the authors tackle this problem by means of RL, and indeed obtained interest-
ing results. However, using standard RL techniques, there is a need to define a
numerical reward function depending on the tumor size, wellness, and possibly
the death of a patient. More specifically, four threshold values and eight util-
ity scores are needed, and the authors themselves notice that these quantities
strongly influence the results.
We consider this as a key disadvantage of the approach, since in a medical
context, a numerical function of that kind is extremely hard to specify and will
always be subject to debate. Just to give a striking example, the authors defined
a negative reward of −60 for the death of a patient, which, of course, is a rather
arbitrary number. As an interesting alternative, we tackle the problem using a
more qualitative approach.
To this end, we treat the criteria (tumor size, wellness, death) independently
of each other, without the need to aggregate them in a mathematical way; in fact,
the question of how to “compensate” or trade off one criterion against another
one is always difficult, especially in fields like medicine. Instead, we compare
two policies π and π  as follows: π  π  if the patient survives under π but not
under π  , and both policies are incomparable if the patient does neither survive
under π nor under π  . Otherwise, if the patient survives under both policies,
let CX denote the maximal toxicity during the 6 months of treatment under π

and, correspondingly, CX under treatment π  . Likewise, let CS and CS denote
the respective size of the tumor at the end of the therapy. Then, we define
preference via Pareto dominance as

π  π ⇔ 
(CX ≤ CX ) and (CS ≤ CS ) (3)
Preference-Based Policy Iteration: Leveraging Preference Learning for RL 325

It is important to remark that  thus defined, as well as the induced strict order
, are only partial order relations. In other words, it is thoroughly possible that
two policies are incomparable. For our preference learning framework, this means
that less pairwise comparisons may be generated as training examples. However,
in contrast to standard RL methods as well as the classification approach of [11],
this is not a conceptual problem. In fact, since these approaches are based on a
numerical reward function and, therefore, implicitly assume a total order among
policies (and actions in a state), they are actually not applicable in the case of
a partial order.

5.3 Experimental Setup and Results


For training, we generate 1000 patients at random. That is, we simulate 1000
patients experiencing the treatment based on model (2). The initial state of
each patient, S0 and X0 , are generated independently and uniformly from (0, 2).
Then, for the following 6 months, the patient receives a monthly chemotherapy
with a dosage level taken from one of four different values (actions) 0.1 (low), 0.4
(medium), 0.7 (high) and 1.0 (extreme), where 1.0 corresponds to the maximum
acceptable dose.2 As an illustration, Fig. 3 shows the treatment process of one
patient according to model (2) under a randomly selected chemotherapy policy.
The patient’s status is clearly sensitive to the amount of received drug. When
dosage level is too low, the tumor size grows towards a dangerous level, while
with a very high dosage level, the toxicity level will strongly affect the patient’s
wellness. The preferences are generated via Pareto dominance relation 3 using
roll-outs. We use LPC and choose a linear classifier, logistic regression, as the
base learner (again using the Weka implementation). The policy iteration stops
when (i) the difference between two consequential learned policies is smaller than
a pre-defined threshold, or (ii) the number of policy iterations reaches 10.
For testing, we further generate 200 virtual patients. In Fig. 4, the average
values of the two criteria (CX , CS ) are shown as points for the constant policies
low, medium, high, extreme (i.e., the policies prescribing a constant dosage re-
gardless of the state). As can be seen, all four policies are Pareto-optimal, which
is hardly surprising in light of the fact that toxicity and tumor size are conflicting
criteria: A reduction of the former tends to increase the latter, and vice versa.
The figure also shows the convex hull of the Pareto-optimal policies.
Finally, we add the results for two other policies, namely the policy learned by
our preference-based approach and a random policy, which, in each state, picks a
dose level at random. Although these two policies are again both Pareto-optimal,
it is interesting to note that our policy is outside the convex hull of the constant
policies, whereas the random policy falls inside. Recalling the interpretation of
the convex hull in terms of randomized strategies, this means that the random
policy can be outperformed by a randomization of the constant policies, whereas
our policy can not.
2
We exclude the value 0, as it is a common practice to let the patient keep receiving
certain level of chemotherapy agent during the treatment in order to prevent the
tumor relapsing.
326 W. Cheng et al.

7
6
5
Toxicity

4
3
2
1
0
0 1 2 3 4 5 6
Tumor Size

Fig. 4. Illustration of patients status under different treatment policies. On the x-axis
is the tumor size after 6 months. On the y-axis is the highest toxicity during the 6
months. From top to bottom: Extreme dose level (1.0), high dose level (0.7), random
dose level, learned dose level, medium dose level (0.4), low dose level (0.1). The values
are averaged from 200 patients.

6 Conclusions
The goal of this work is to make first steps towards lifting conventional rein-
forcement learning into a qualitative setting, where reward is not available on
an absolute, numerical scale, but where comparative reward functions can be
used to decide which of two actions is preferable in a given state. To cope with
this type of training information, we proposed a preference-based extension of
approximate policy iteration. Whereas the original approach essentially reduces
reinforcement learning to classification, we tackle the problem by means of a
preference learning method called label ranking. In this setting, a policy is rep-
resented by a ranking function that maps states to total orders of all available
actions.
To demonstrate the feasibility of this approach, we performed two case studies.
In the first study, we showed that additional training information about lower-
ranked actions can be successfully used for improving the learned policies. The
second case study demonstrated one of the key advantages of a qualitative policy
iteration approach, namely that a comparison of pairs of actions is often more
feasible than the quantitative evaluation of single actions.
The work reported in this paper provides a point of departure for extensions
along several lines. For example, while the setting we assumed is not uncommon
in the literature, the existence of a generative model is a strong assumption.
In future work, we will therefore focus on generalizing our approach toward an
on-line learning setting with on-policy updates.

Acknowledgments. We would like to thank the Frankfurt Center for Scientific


Computing for providing computational resources. This research was supported
by the German Science Foundation (DFG).
Preference-Based Policy Iteration: Leveraging Preference Learning for RL 327

References
1. Barto, A.G., Sutton, R.S., Anderson, C.: Neuron-like elements that can solve dif-
ficult learning control problems. IEEE Transaction on Systems, Man and Cyber-
netics 13, 835–846 (1983)
2. Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor-critic algo-
rithms. Automatica 45(11), 2471–2482 (2009)
3. Dimitrakakis, C., Lagoudakis, M.G.: Rollout sampling approximate policy itera-
tion. Machine Learning 72(3), 157–171 (2008)
4. Fern, A., Yoon, S.W., Givan, R.: Approximate policy iteration with a policy lan-
guage bias: Solving relational markov decision processes. Journal of Artificial In-
telligence Research 25, 75–118 (2006)
5. Fürnkranz, J., Hüllermeier, E. (eds.): Preference Learning. Springer, Heidelberg
(2010)
6. Gabillon, V., Lazaric, A., Ghavamzadeh, M.: Rollout allocation strategies for
classification-based policy iteration. In: Auer, P., Kaski, S., Szepesvàri, C. (eds.)
Proceedings of the ICML 2010 Workshop on Reinforcement Learning and Search
in Very Large Spaces (2010)
7. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The
weka data mining software: An update. SIGKDD Explorations 11(1), 10–18 (2009)
8. Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning
pairwise preferences. Artificial Intelligence 172, 1897–1916 (2008)
9. Kersting, K., Driessens, K.: Non-parametric policy gradients: a unified treatment
of propositional and relational domains. In: Cohen, W.W., McCallum, A., Roweis,
S.T. (eds.) Proceedings of the 25th International Conference on Machine Learning
(ICML 2008), pp. 456–463. ACM, Helsinki (2008)
10. Konda, V.R., Tsitsiklis, J.N.: On actor-critic algorithms. SIAM Journal of Control
and Optimization 42(4), 1143–1166 (2003)
11. Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: Leveraging
modern classifiers. In: Fawcett, T.E., Mishra, N. (eds.) Proceedings of the 20th
International Conference on Machine Learning (ICML 2003), pp. 424–431. AAAI
Press, Washington, DC (2003)
12. Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine
Learning 3, 9–44 (1988)
13. Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods
for reinforcement learning with function approximation. In: Solla, S.A., Leen, T.K.,
Müller, K.-R. (eds.) Advances in Neural Information Processing Systems 12 (NIPS-
1999), pp. 1057–1063. MIT Press, Denver (1999)
14. Vembu, S., Gärtner, T.: Label ranking algorithms: A survey. In: Fürnkranz and
Hüllermeier [5], pp. 45–64.
15. Watkins, C.J., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)
16. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist
reinforcement learning. Machine Learning 8, 229–256 (1992)
17. Zhao, Y., Kosorok, M., Zeng, D.: Reinforcement learning design for cancer clinical
trials. Statistics in Medicine 28, 3295–3315 (2009)
Learning Recommendations in Social Media Systems by
Weighting Multiple Relations

Boris Chidlovskii

Xerox Research Centre Europe


6, chemin de Maupertuis, F–38240 Meylan, France

Abstract. We address the problem of item recommendation in social media shar-


ing systems. We adopt a multi-relational framework capable to integrate different
entity types available in the social media system and relations between the enti-
ties. We then model different recommendation tasks as weighted random walks
in the relational graph. The main contribution of the paper is a novel method
for learning the optimal contribution of each relation to a given recommenda-
tion task, by minimizing a loss function on the training dataset. We report results
of the relation weight learning for two common tasks on the Flickr dataset, tag
recommendation for images and contact recommendation for users.

1 Introduction
Social media sharing sites like Flickr and YouTube contain billions of image and videos
uploaded and annotated by millions of users. Tagging the media objects is proven to be
a powerful mechanism capable to improve media sharing and search facilities [19].
Tags play the role of metadata, however they come in a free form reflecting the individ-
ual user’s choice. Despite this freedom of tag choice, some common usage topics can
emerge when people agree on the semantic description of a group of objects.
The wealth of annotated and tagged objects on the social media sharing sites can
form a solid base for a reliable tag recommendation [10]. There are two most frequent
modes of using the recommendation systems on the media sharing sites. In the bootstrap
mode, the recommender system suggests the most relevant tags for newly uploaded
media objects by observing the characteristics of the objects as well as their social
context. In the query mode, an object is annotated with one or more tags and the system
attempts to suggest the user how to extend the tag set. In both modes, the system assists
the user in the annotation and help expand the coverage of tags on objects.
Modern social media sharing sites attract users’ attention by offering a multitude of
valuable services. They organize users activities around entities of different types, such
as images, tags, users and groups and relations between them. The tag recommenda-
tion for images or videos is one of possible scenarios in such a relational world, other
scenarios may concern user contact recommendation, group recommendation, etc. This
paper addresses the recommendation tasks in the multi-relational setting and out target
is to determine the optimal combination of the available relations for a given recom-
mendation task.
In the multi-relational setting, entities of different types get connected and form two
types of relations. The first type is the relations between entities of the same type,

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 328–342, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Learning Recommendations in Social Media Systems 329

such as a user-to-user relation of friendship or tag-to-tag relation of co-occurrence. The


second types of relations connect entities of two different types, such as user-to-image
relation of ownership. All available relations can be represented through the notion of a
relational graph.
In such a representation, the recommendation task corresponds to a prediction query
on an unfolding of the relational graph. In the following, we will use the Flickr as
the example of social media sharing system. We will address two most popular rec-
ommendation tasks on the Flickr dataset, tag recommendation for images and contact
recommendation for users.
We show that a recommendation task can be modelled as a random walk on the
relational graph. To provide a better response to a recommendation query, we compose
a weighted relational graph where the contribution of each relation is expressed by
a non-negative weight. We then show how to efficiently learn the optimal weights of
relational random walks from the training dataset.
We adopt the Markov chain model for walking on the relational graph. Any relation
can be projected on the recommendation task and thus contribute to the random walk
with its proper non-negative weight. The relational random walk is therefore expressed
as an additive function on relation weights.
We then develop a technique for learning the weights of relations contributing to
the weighted random walk. Stated as the supervised learning problem, it attempts to
minimize the loss function over a training set of examples, where function is a metric
between estimated and true probability distributions on the recommendation objects.
We use the gradient descent methods to solve the optimization problem and to effec-
tively estimate the optimal relation weights. The main advantage is that the gradient and
Hessian matrix of loss function can be directly integrated in the random walk algorithm.
The remainder of the paper is organized as follows. A short review of the prior art
on the tag recommendation systems and random walks in Section 2. In Section 3 we
present the relational graph for entity types and their relations. Section 3 recalls the
Markov chain model and presents an weighted extension of random walks on relational
graphs. The core contribution on the weight learning of the random walks from available
observations is presented in Section 4. Section 5 reports the evaluation results of two
recommendation tasks on the Flickr dataset. Section 6 concludes the paper.

2 Prior Art
Existing methods for tag recommendation often target the social network as a source
of collective knowledge. Recommender systems based on collective social knowledge
have been proven to provide relevant suggestions [9,15,18]. Some of these systems ag-
gregate the annotations used in a large collection of media objects independently of the
users that annotate them [18]. In other systems, recommendations can be personalized
by using the annotations for the images of a single user [9]. Both approaches come with
their advantages and drawbacks. When the recommendations are based on collective
knowledge the system can make good recommendations on a broad range of topics,
but is likely to miss some recommendations that are particularly relevant in a personal
context.
330 B. Chidlovskii

Random walks on weighted graphs is a well established model [8,13]. The random
walks have been used for ranking Flickr tags in [14], where the walks are executed on
one relation only, such as a image-to-tag graph for the tag recommendation. Similar
techniques have used used in [2] for analysing the user-video graph on Youtube site and
for providing personalized suggestions.
Many recommendation algorithms are inspired by the analysis of the MovieLens
collection and Netflix competition; they focus only on using ratings information, while
disregarding information about the context of the recommendation process. With the
growth of social sites, several methods have been proposed for predicting tags and
adding contextual information from social networks to improve the performance of
recommendation systems.
One common approach is to address the problem as the multi-label classification in
a (multi-)relational graph. Techniques of the label propagation on the relational graph
is proposed in [17]. It develops an iterative algorithm for the inference and learning for
the multi-label and multi-relational classification. Inference is performed iteratively by
propagating scores according to the multi-relational structure of the data. The method
extends the techniques of collective classification [11] in order to handle multiple rela-
tions and to perform multi-label classification in multi-graphs.
The concept of relational graph for integrating contextual information is used in [3].
It makes it straightforward to include different types of contextual information. The
recommendation algorithm in [3] models the browsing process of a user on a movie
database website by taking non-weighted random walks over the relational graph.
The approach closest to our is [7]; it tries to automatically learn ranking function
for searching in typed (entity-relation) graphs. User input is in the form of a partial
preference order between pairs of nodes, associated with a query. The node pairs are
instances in learning the ranking function. For each pair the method assigns a label
representing the relative relevance. It then trains a classification model with the labelled
data and makes use existing classification methodologies like SVM, Boosting, etc. [5].
The pairwise approaches to learning the rank are known for several important limi-
tations [6]. First, the objective of learning is formalized as minimizing errors in classifi-
cation of node pairs, rather than minimizing errors in node ranking. Second, the training
process is computationally costly, as the number of document pairs is very large. Third,
the number of generated node pairs varies largely from query to query; this results in
training a model biased toward queries with more node pairs.
All three issues are particularly critical in the social networking environment. One
alternative to the pairwise approach is based on the listwise approach, where node lists
and not node pairs are used as instances in learning [6].
What we propose in this paper is a probabilistic method to calculate the listwise loss
function for the recommendation tasks. We transform both the scores of nodes in the
relational graph assigned by a ranking function and the the node annotations by hu-
mans into probability distributions. We can then utilize any metric between probability
distributions as the loss function. In other words, our approach represents a listwise al-
ternative to the pairwise one in [7] and shows its application the recommendation tasks
to social media sites.
Learning Recommendations in Social Media Systems 331

Finally, the weight learning for random walks in social networks is recently ad-
dressed in [1]. In the link prediction setting, authors try to infer which interactions
among existing network members are likely to occur in the near future. They develop
an algorithm based on supervised random walks that combines the information from the
network structure with node and edge level attributes. To guide a random walk on the
graph, they formulate a supervised learning task where the goal is to learn a function
that assigns strengths to edges in the network such that a random walker is more likely
to visit the nodes to which new links will be created in the future.

3 Relational Graph
The relational graph aims at representing all available entity types and relations between
them in one uniform way. The graph is given by G = (E, R), where an entity type
ek ∈ E is represented as a node and relation rkl ∈ R between entities of types ek and
el is represented as a edge between the two nodes.
The relational graph for (a part of) Flickr web site1 is sketched in Figure 1. Nodes
represent five entities types, E={image, user, tag, group, comment}. Edges rep-
resent relations between entities of the same or different types. One example is relation
tagged with between imageand tagentities indicating which images are tagged with
which tags. Another example is relation contact between userentities which encodes
the list of user contacts. The Flickr example reports one relation between entities of ek
and el . In the general case, the model can accommodate any number of relations be-
tween any two entity types. Another advantage of the relational graph is its capacity to
add any new type of contextual information.
The relational setting reflects the variety of user activities ans services on Flickr and
similar sites. Users can upload their images and share them with other users, partici-
pate in different interest groups, browse images of other users, comment and tag them,
navigate through the image collection by tags, groups, etc.
Each individual relation rkl ∈ R is expected to be internally homogeneous. In other
words, bigger values in the relation tend to have a higher importance. Instead, different
relations may have different importance for a given recommendation task. For example,
relation annotated with(image,tag) is expected to be more important For the tag rec-
ommendation task than the relation member(user, group). On the other side, for the
user contact recommendation, the importance of these two relations may be opposite.
In the following, we model the importance of a relation toward a given recommendation
task with a non-negative weight.
Every relation rkl ∈ R is unfolded (instantiated) in the form of matrix Akl =
{aij kl
kl }, k = 1, . . . , |ei |, l = 1, . . . , |ej |, where aij indicates the relation between entity
i ∈ ek and entity j ∈ el . Values are binary (for example, in the tagged with relation,
aij = 1 if image i is tagged with tag j, 0 otherwise). In the general case, aij are non-
negative real values. Any matrix Akl is non-negatively defined. For the needs of random
walks, we assume Akl is a probability transition matrix, which can be obtained by the
row normalization.

1
http://www.flickr.com
332 B. Chidlovskii

Fig. 1. Relational graph for (a part of) Flickr web site

Random Walks

We combine the relations that induce a probability distribution over entities by learn-
ing a Markov chain model, such that its stationary distribution is a good model for
a specific prediction task. Constructing Markov chains whose stationary distributions
are informative has been used in multiple applications, including the Google PageRank
algorithm [16] and HITS-like algorithms [4].
A Markov chain over a set of states S is specified by an initial distribution P0 over
S, and a set of state transition probabilities P (St |St−1 ). A Markov chain defines a
distribution over sequences of states, via a generative process in which the initial state
S0 is first sampled according to distribution P0 , and then states St (for t = 1, 2, . . .)
are sampled according to the transition probabilities. The stationary distribution of the
Markov chain is given by π(s) = lim∞ P (St = s), if the limit exists.
To ensure that the Markov chain has a unique stationary distribution, the process can
be reset with a probability α > 0 according to the initial state distribution P0 . In practice
this prevents the chain from getting stuck in nodes having no transitions and small loops.
Having the Markov chain S0 , S1 , . . . with the initial state S0 distributed according to P0 ,
state transitions given by P and resetting probability α, it is straightforward to express
the stationary distribution π as follows:


π=α (1 − α)t P0 P t . (1)
t=0

Equation (1) can be used to efficiently compute π. Because terms corresponding to large
t have very little weight (1 − α)t , when computing π, this sequence may be truncated
after the first few (on the order 1/α) terms without incurring significant error.

Weighted Relational Random Walks

In this section we extend the Markov chain model to the unfolded relational graph.
Assume the relational graph includes b entity types, e1 , . . . , eb . The total number of
b
entities is denoted N = k=1 |ek |. The unfolded relational graph is composed of
b2 blocks, one block for each (ek , el ) pair, k, l = 1, . . . , b. Available relations fill up
some blocks, other blocks can be left empty or filled up with composed relations using
Learning Recommendations in Social Media Systems 333

the relation transitivity rule Akl = Akm Aml , where Akm and Aml are basic or other
composed relations. Note that there might exist several ways to compose a relation; a
particular choice often depends on the recommendation task.
In the Flickr relational graph (Figure 1), there are seven basic relations, which fill
up the corresponding blocks and can be used to compose other relations. The tag co-
occurrence relationship is an example of composed relation. If matrix AIT describes
relation tagged with (image,tag), the tag co-occurrence matrix can be obtained by
AT T = AIT AIT . Higher values in AT T indicate that more images are tagged with a
given tag pair.
When a random walk moves through relation rkl ∈ R between entities of types
ek and el , contribution of rij to the walk is expressed by a non-negative weight wkl .
The random walk is therefore performed over matrix A which is a weighted sum over
relations,
 A = kl wkl Akl . Matrix A is ensured to be the probability transition ma-
trix if j wkl = 1, l = 1, . . . , b. Also, π(s)j denotes a projection of the stationary
distribution π on the entity type j.
To initiate the random walk, the initial distribution P0 is composed of b vectors
δj , j = 1, . . . , b, with all elements relevant to the query. For the tag recommendation
task, we compose three vectors δI , δU , and δT , for images, users and tags. When recom-
mending tags for image i, i-th element in the image vector is 1 with all other elements
δIj , j i set to 0. Similarly, in the user vector δU only the owner u of image i is set to 1.
The default choice for the tag vector δt is 1 vector. We however prefer to add a bias on
the user tag vocabulary and preferences. We set t-th of vector δT the log value of fre-
quencies of using t by user u in the collection, δT t = log(AUT (u, t) + 1). Then the ini-
tial distribution P0 is defined as normalization of the composed vector (δ1 , δ2 , . . . , δb ).
If weights wkl are known or recommended by an expert, equation (1) can be used
for estimating the stationary distribution π and its projection πj . If the weights are
unknown a priori, we propose a method which determines such values for weights wkl
which minimize a loss function on the training set.

4 Weight Learning

To learn relation weights in the random walk, we approximate the stationary distri-
bution π with the truncated version and look for an such instantiation of the Markov
model where weights minimize a prediction error on a training set T . We express the
optimization problem on weights wkl as minimization of loss function on the training
set.
The weighted random walk defined by a Markov chain query produces a probabil-
ity distribution. Nodes having more links (with higher weights) with query nodes will
accumulate more probability than nodes having less links and of lower weights.

Probability Estimation Loss

We define a scoring function H that assigned a [0,1] value to an entity of type ej .


The task is to learn the function H from a set of known relations between entities.
The function H estimates the probability p for a given object i. Let y denote the true
334 B. Chidlovskii

probability of i and let p its estimation by H. The price we pay when predicting p in
place of y is defined as a loss function l(y, p). We use the square loss2 between y and p
in the following form:

lsq (y, p) = y(1 − p)2 + (1 − y)p2 . (2)



Note that its first and second partial derivatives in p are ∂p lsq (y, p) = 2(p − y) and
∂2
∂ 2 p lsq (y, p) = 2, respectively.

Multi-label Square Loss

Without loss of generality, in the following sections we assume to cope with the tag
recommendation. Assume the tag set includes L tags. For a given image, let YB denote
a binary vector YB = (y1 , . . . , yL ) where yi is 1 if the image is tagged with tag i, 0 oth-
erwise, i = 1, . . . , L. The probability distribution over the tag set is Y = (y1 , . . . , yn )
where yi is 0 or 1/|YB |, i = 1, . . . , L.
Let P denote an estimated tag probability distribution, P = (p1 , . . . , pL ), where
L
i=1 pi = 1. To measure the loss of using estimated distribution P in the place of true
distribution Y , we use a loss function which is symmetric in y and p and equals 0 only
if y = p. The best candidate is the multi-label square function defined as follows

Lsq (Y, P ) = (lsq (yi , pi )), i = 1, . . . , L. (3)

For the square loss function Lsq , we take its gradient as follows
 

∇Lsq (Y, P ) = lsq (yi , pi ) = 2(yi − pi ),
∂pi i=1,...,L

and similarly we have


 
∂ ∂2
∇Lsq (Y, P ) = lsq (yi , pi ) = 21.
∂P ∂ 2 pi i=1,...,L

If we dispose a training set T of images with the tag probability distribution Y , we try
to define such a scoring function H which minimizes the empirical loss over T . It is
defined as follows
1 
Loss(H) = Lsq (Yj , Pj ), (4)
|T |
j∈T

where Yj is the true probability vector for image j and Pj is the prediction probability
distribution. b
The weighted sum of composed of b distinct entity types A = kl wkl Akl . Bigger
values of wkl indicate higher importance of relation between entities ek and el to the
task. We assume that matrix Akl for relation rkl is normalized with each raw forming
a state transition distribution. The mixture matrix A satisfies the same condition if the
2
Other loss functions will be equally tested in the evaluation section.
Learning Recommendations in Social Media Systems 335


constraint j wkl = 1, wkl ≥ 0 holds. The matrix A is however is not required to be
symmetric, so wkl 
= wlk in the general case. Thus we obtain the following optimization
problem:
minwkl Loss(H)
s.t.
(5)
 0 ≤ w kl ≤ 1
l wkl = 1, k = 1, . . . , b.

The constrained optimization problem (5) is transformed into unconstrained  one by


introducing variables vkl , k, l = 1, . . . , b and representing wkl = evkl / m evlm . The
problem constrained on wkl becomes unconstrained on vkl .3
We use the L-BFGS method to solve numerically the problem. L-BFGS algorithm
is a member of the broad family of quasi-Newton optimization methods. They approx-
imate the well-known Newton’s method, a class of hill-climbing optimization tech-
niques that seeks a stationary point of a (twice continuously differentiable) function.
The Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is one of the most popular
members of quasi-Newton methods.
The L-BFGS uses a limited memory variation of the BFGS to approximate the in-
verse Hessian matrix. Unlike the original BFGS method which stores a dense n × n
approximation, L-BFGS stores only a few vectors that represent the approximation im-
plicitly. We use the open source implementation of L-BFGS routine available in Python
via SciPy library. It preforms an iterative scheme along all wkl dimensions, including
the gradient ∇Loss(H) and the inverse of the Hessian matrix H for Loss(H) at W ,
where W = {wkl }, k, l = 1, . . . , b. This produces an iterative sequence of approxi-
mated solutions W0 , W1 , ... which converges to a (generally local) optimum point.
In order to deploy the quasi-Newton methods for solving the optimization problem
(5), we first obtain the derivatives of the loss function with respect to variables wkl :

∂Loss(H) 1  ∂Pj
= ∇Lsq (Yj , Pj ) , (6)
∂wkl |T | ∂wkl
j∈T

k
where Pj = α t=1 (1 − α)t P0j At and P0t is the initial probability distribution for
image j.
The power series At , t = 1, 2, ... are the only terms in Pj depending on wkl . To
compute the derivative of a composite function, we use the chain rule for matrices. We
obtain the recursive for the first derivatives at step t

∂At ∂(At−1 A) ∂At−1


= = A + At−1 Akl . (7)
∂wkl ∂wkl ∂wkl

Algorithm 1 presents a meta-code for the loss function Loss(H) and its gradient ∇Loss(H)
needed for solving the optimization problem (5) with quasi-Newton method.

3
A regularization term on vkl can be added to the objective function.
336 B. Chidlovskii

Algorithm 1. Loss function and its gradient


Require: Training dataset T , the restarting probability α
Require: Relation matrices Akl , weights (wkl ), k, l = 1, . . . , b
Ensure:  Loss function value Loss(H) and the gradient ∇Loss(H)
1: A = kl wkl Akl
2: for j = 1 to |T | do
3: Set the initial distribution P0 for object j
4: for t = 1 until convergence do
5: Pjt = αP0 + (1 − α)Pjt−1 A
6: for all wkl do
∂P t
j
7: Update ∂wkl using (7)
8: end for
9: end for
10: Set L(Yj , Pj ) using (4)
11: Loss(H) = Loss(H) + L(Yj , Pj )
12: for all wkl do
∂L(Y ,P )
13: Set ∂wjkl j using (6)
∂Loss(H) ∂L(Y ,P )
14: ∂wkl
= ∂Loss(H)
∂wkl
+ ∂wj j
kl
15: end for
16: end for
17: Return Loss(H) and the gradient ∇Loss(H) at (wkl ), k, l = 1, . . . , b.

Hessian Matrix

The Hessian matrix H for the loss function may help the quasi-Newton method to faster
converge to an optimum point. The matrix requires the mixed derivatives for weights
wkl and wk l :

∂Loss(H) 1  ∂ ∂ 2 Pj
= ∇Lsq (Yj , Pj ) , (8)
∂wkl ∂wk l |T | ∂Pj ∂wkl ∂wk l
j∈T

The second derivatives in the Hessian matrix H can be developed by using the chain
rule for matrices, similarly to (7). We obtain a recursive formula where the values at
iteration t depend on the function values and its gradient on the previous step t − 1 of
the random walk.

∂ 2 At
∂wkl ∂wk l =
∂ ∂At−1 t−1
∂wk l ( ∂wkl A + A Akl ) = (9)
2 t−1 t−1 t−1
∂ A
∂wkl ∂wk l
A + ∂wkl Ak l + ∂A
∂A
A .
∂wk l kl

Algorithm 1 can be extended with the evaluation of the Hessian matrix in the straight-
forward way. The extension requires to expand lines 6 to 8 of Algorithm 1 with the
∂2P
evaluation of second derivatives ∂wkl ∂wj   for a given object j using the rule (9). Then
k l
lines 12 to 15 should be expanded to get the Hessian matrix for the entire set T using
formula (8).
Learning Recommendations in Social Media Systems 337

The power iteration for the gradient and Hessian can be prohibitive for large full
matrices. Luckily, matrices Akl are all sparse and the matrix product is proportional
to the number of non-zero elements in the matrix. This ensures a relative speed when
performing truncated random walks on the relational graph.

5 Evaluation
In this section we describe the real dataset used in the experiments, the evaluation set-
ting and the results of experiments for two different recommendation tasks.
Flickr dataset. In all evaluations, we use a set of Flickr data which has been down-
loaded from Flickr site with the help of social-network connectors (Flickr API) [12].
The API give an access to entities and relational data, including users, groups of inter-
ests, images with associated comments and tags.
We test the method of relation weight learning for the random walks described in
Section 4 on three entity types, E={image, tag, user}. Three core relations asso-
ciated with the three types are image-to-tag relation RIT = tagged with (image,tag),
user-to-image relation RUI = owner (user,image) and user-to-user relation RUU =
contact (user, user).
We use a fragment of the Flickr dataset with 100,000 images; these images are owned
by 1,951 users and annotated with 127,182 different tags (113,426 tags after normal-
ization). Matrices of the three core relations are sparse, their elements follow the power
low distribution which is very common in the social networks. In the image-to-tag ma-
trix, an image has between 1 and 132 tags, with the average of 5.65 tags per image. The
user-to-image matrix contains 1 to 384 images per user, the average number is 27.32
images. The number of contacts in the user-to-user matrix is between 0 and 43, the
average is 1.24 contacts.4
We run a series of experiments on the dataset where two tasks are tag recommenda-
tion for images and contact recommendation for users. In either task, we use the core
relations to compose other relations. The way the composed relations are generated
depends on the recommendation task:

Tag recommendation: the image-to-image matrix is composed as AII = AIT AIT .

Other composed relations are tag-to-tag AT T = AIT AIT and user-to-tag AUT =
AUI AIT , and their inversion.
User contact recommendation: The image-to-image matrix is composed as AII =

AUI AUI and user-to-tag matrix is given by AUT = AUI AIT .
An all cases, the matrix A is block-wise ; the optimization problem (5) is solved for
b2 weights wkl .
The image tag recommendation runs either in the bootstrap or query mode. In boot-
strap mode, the task is to predict tags for a newly uploaded image. In query mode, an
image may have some tags and the task is to extend them. In both modes, we measure
the performance of predicting the top 5 and |size| tags where the number |size| of tags
vary from image to image but is known in advance (and equals to the test tag set). The
user contact recommendation task has been tested in the query mode only.
4
The multi-relational Flick dataset is available from authors upon request.
338 B. Chidlovskii

To evaluate the correctness of different methods we use conventional precision, recall


and F1 evaluation metrics adopted for the multi-label classification case. Let Yj and Pj
denote the true and recommended tag vectors for image j in the test set, respectively.
 |Y ∪P |
Then, precision and recall metrics for the test set are defined as P r = j j|Yj | j and
 |Yj ∪Pj |
Re = j |P j|
, respectively. F 1 score is then defined as 2 PPr+Re
r·Re
.

Fig. 2. Image tag recommendation in the bootstrap mode for top 5 tags

Due to non-availability of [7], we compare the performance of relation weight learn-


ing to the baseline methods, which correspond to an unweighted combination of rela-
tions. We test two instantiations of the unweighted combinations. In one instance, we
simulate the mono-relation setting, where the core relation is only used, weights of all
other relations are set to 0. For the image tag recommendation, the core relation is given
by the image-to-tag matrix AIT . For the user contact recommendation, the core relation
is the user-to-user matrix AUU . In another instance, all available relations (both core and
composed ones) are combined with the equal weights wkl = 1/b. In the following, we
report the best of the two baseline methods and observe the gain of the weight learning
method over the unweighted ones.
In all experiments, the average values are reported over 5 independent runs, the re-
setting probability α is 0.05, the loss function (2) is used. We have also tried other loss
functions satisfying the symmetry condition. We have tested three following versions:
the absolute loss labs (y, p) = |y −p|, the exponential loss lexp (y, p) = exp|y−p| − 1
|y−p|2
and the huber loss huber(y, p) = 2 , if|y − p| ≤ 0.5 . All three are differ-
|y − p| − 12 , otherwise,
entiable but their derivatives are not continuous. This makes them a poor alternative to
the square loss which is twice continuously differentiable. Tests with these three losses
unveil their under-performance, where only the huber loss can hardly beat the baseline
method.
Learning Recommendations in Social Media Systems 339

Fig. 3. Recall and precision values in the query mode: a) Top 5 tags; b) Top |size| tags

Image Tag Recommendation


Bootstrap mode. Figure 2 reports the recall and precision values for the image tag rec-
ommendation in the bootstrap mode, where the number of images vary from 1,000 to
100,000. The test is performed in 5 folds, with the top 5 predicted tags being com-
pared to the true tag set. In each run, 80% of images are randomly selected for training
and remaining 20% are used for testing. Both recall and precision decrease with the
growth of the data set, due to the growing number of tags to predict. The learning the
relation weights permits to gain 4-17% in recall and 5-11% in precision over the best
unweighted schema. We note that the gain resists well to the growth of the dataset.
Query mode. Figure 3.a reports the recall and precision values for the query tag recom-
mendation, where the size of the image set vary from 1K to 100K. In this evaluation,
50% of tags are randomly selected to form the query for a given image, remaining
50% tags are used for testing, where the 5 top predicted tags are compared to the true
tag set. The gain in precision and recall over the best baseline method is 17% and 4%
respectively.
Figure 3.b reports precision/recall values for the same setting, where the number of
predicted tags is not 5 but equals to the test tag set which is 50% of tags for any image.

User Contact Recommendation


The second task for the method of relation weight learning is the user contact rec-
ommendation. Similarly to the previous runs, 50% of a user’s contacts are randomly
selected to form a query, remaining 50% contacts are used for testing. Figure 4 reports
precision and recall values for the top 5 recommended contacts; the number of users
vary from 100 to 1900. One can observe that the weight learning permit to gain up to
10% in precision while the gain in recall remains limited to 2%.
340 B. Chidlovskii

Fig. 4. User contact recommendation in the query mode: precision and recall for the top 5 contacts

Resetting Coefficient
Figure 5 shows the impact of resetting coefficient α on the performance of the weight
learning. It reports the relative precision values for the tag recommendation in the boot-
strap mode. Three cases for 1K, 50K and 100K images are presented. As the figure
shows, there exists a convenient range of values α between 0.05 and 0.2 where the
maximum precision is achieved in all cases.

Fig. 5. Tag recommendation for different values of the resetting coefficient

Number of Iterations
Another evaluation item concerns the convergence of the random walks in Equation (1).
Figure 6 shows the impact of truncating after varying number of iterations on the per-
formance of weight learning. It reports the precision and recall values for the tag recom-
mendation in the bootstrap mode. Two cases of 1,000 and 50,000 images are presented,
when the random walk is truncated after 1,2,...,15 iterations. As the figure suggests both
Learning Recommendations in Social Media Systems 341

Fig. 6. Query-based prediction of user contacts

precision and recall achieve their top values after 5 to 7 iterations in the first case and
after 10 to 12 iterations in the second case.
Finally, we tracked the evaluation of Hessian matrix (9) in addition to the gradient in
Algorithm 1. For small datasets (less 5,000 images), using the Hessian helps to faster
converge to the local optimum, with the saving of 25% for evaluation on 1000 images.
The situation changes drastically for large datasets where evaluation of Hessian matri-
ces become an important handicap. For this reason, Hessians were excluded from the
valuation for 10,000 images and more.

6 Conclusion
We presented the relational graph representation for multiple entity types and relations
available in a social media system. We implemented the Markov chain model and pre-
sented its weighted extension to the relational graph. We has shown how to learn the
relation weights by minimizing the loss between predicted and observed probability
distributions. We reported the evaluation results of the image tag and user contact rec-
ommendation tasks on the Flickr dataset. The results of experiments confirm that the re-
lation weights learned from the training set provide a constant gain over the unweighted
methods.

References
1. Backstrom, L., Lescovec, J.: Supervised random walks: Predicting and recommending links
in social networks. In: Proc. ACM WSDM (2011)
2. Baluja, S., Seth, R., Sivakumar, Y.J., Yagnik, J., Kumar, S., Ravichandran, D., Aly, M.: Video
suggestion and discovery for youtube: Taking random walks through the view graph. In:
Proc. WWW 2008, pp. 895–904 (2008)
3. Bogers, T.: Movie recommendation using random walks over the contextual graph. In: Proc.
2nd Workshop on Context-Aware Recommender Systems, CARS (2010)
342 B. Chidlovskii

4. Borodin, A., Roberts, G.O., Rosenthal, J.S., Tsaparas, P.: Link analysis ranking: algorithms,
theory, and experiments. ACM Trans. Internet Technol. 5(1), 231–297 (2005)
5. Burges, C.J.C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender,
G.N.: Learning to rank using gradient descent. In: Proc. ICML, pp. 89–96 (2005)
6. Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., Li, H.: Learning to rank: from pairwise approach to
listwise approach. In: Proc. ICML, pp. 129–136 (2007)
7. Chakrabarti, S., Agarwal, A.: Learning parameters in entity relationship graphs from ranking
preferences. In: Proc. PKDD, pp. 91–102 (2006)
8. Coppersmith, D., Doyle, P., Raghavan, P., Snir, M.: Random walks on weighted graphs and
applications to on-line algorithms. J. ACM 40(3), 421–453 (1993)
9. Garg, N., Weber, I.: Personalized, interactive tag recommendation for flickr. In: Proc. ACM
RecSys 2008, pp. 67–74 (2008)
10. Gupta, M., Li, R., Yin, Z., Han, J.: Survey on social tagging techniques. ACM SIGKDD
Explorations Newsletter 12, 58–72 (2010)
11. Jensen, D., Neville, J., Gallagher, B.: Why collective inference improves relational classifi-
cation. In: Proc. ACM KDD 2004, pp. 593–598 (2004)
12. Ko, M.N., Cheek, G.P., Shehab, M., Sandhu, R.: Social-networks connect services. IEEE
Computer 43(8), 37–43 (2010)
13. Toutanova, K., Manning, C.D., Ng, A.Y.: Learning random walk models for inducing word
dependency distributions. In: Proc. ICML (2004)
14. Liu, D., Hua, X.-S., Yang, L., Wang, M., Zhang, H.-J.: Tag ranking. In: Proc. WWW 2009,
pp. 351–360 (2009)
15. Overell, S., Sigurbjörnsson, B., van Zwol, R.: Classifying tags using open content resources.
In: Proc. ACM WSDM 2009, pp. 64–73 (2009)
16. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order
to the Web (1998)
17. Peters, S., Denoyer, L., Gallinari, P.: Iterative annotation of multi-relational social networks.
In: Proc. ASONAM 2010, pp. 96–103 (2010)
18. Sigurbjörnsson, B., van Zwol, R.: Flickr tag recommendation based on collective knowledge.
In: Proc. WWW 2008, pp. 327–336 (2008)
19. Tian, Y., Srivastava, J., Huang, T., Contractor, N.: Social multimedia computing. IEEE Com-
puter 43, 27–36 (2010)
Clustering Rankings in the Fourier Domain

Stéphan Clémençon, Romaric Gaudel, and Jérémie Jakubowicz

LTCI, Telecom Paristech (TSI) - UMR Institut Telecom/CNRS No. 5141


{stephan.clemencon,romaric.gaudel,jeremie.jakubowicz}@telecom-paristech.fr

Abstract. It is the purpose of this paper to introduce a novel approach


to clustering rank data on a set of possibly large cardinality n ∈ N∗ ,
relying upon Fourier representation of functions defined on the sym-
metric group Sn . In the present setup, covering a wide variety of prac-
tical situations, rank data are viewed as distributions on Sn . Cluster
analysis aims at segmenting data into homogeneous subgroups, hope-
fully very dissimilar in a certain sense. Whereas considering dissimilar-
ity measures/distances between distributions on the non commutative
group Sn , in a coordinate manner by viewing it as embedded in the
set [0, 1]n! for instance, hardly yields interpretable results and leads to
face obvious computational issues, evaluating the closeness of groups of
permutations in the Fourier domain may be much easier in contrast.
Indeed, in a wide variety of situations, a few well-chosen Fourier (ma-
trix) coefficients may permit to approximate efficiently two distributions
on Sn as well as their degree of dissimilarity, while describing global
properties in an interpretable fashion. Following in the footsteps of re-
cent advances in automatic feature selection in the context of unsu-
pervised learning, we propose to cast the task of clustering rankings
in terms of optimization of a criterion that can be expressed in the
Fourier domain in a simple manner. The effectiveness of the method
proposed is illustrated by numerical experiments based on artificial and
real data.

Keywords: clustering, rank data, non-commutative harmonic analysis,


feature selection.

1 Introduction

In a wide variety of applications, ranging from consumer relationship manage-


ment (CRM) to economics through the design of recommendation engines for
instance, data are often available in the form of rankings, i.e. partially ordered
lists of objects expressing preferences (purchasing, investment, movies, etc.). Due
to their global nature (modifying the rank of an object may affect that of many
other objects), rank data generally deserve a special treatment in regards to
statistical analysis. The latter have been indeed the subject of a good deal of
attention in the machine-learning literature these last few years. Whereas nu-
merous supervised learning algorithms have been recently proposed in order to
predict rankings accurately, see [FISS03], [PTA+ 07], [CV09] for instance, novel

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 343–358, 2011.

c Springer-Verlag Berlin Heidelberg 2011
344 S. Clémençon, R. Gaudel, and J. Jakubowicz

techniques have also been developed to handle input data, that are themselves
of the form of rankings, for a broad range of purposes: computation of centrality
measures such as consensus or median rankings (see [MPPB07] or [CJ10] for
instance), modelling/estimation/simulation of distribution on sets of rankings
(see [Mal57], [FV86], [LL03], [MM09] and [LM08] among others), ranking based
on preference data (refer to [HFCB08], [CS01] or [dEW06]).
It is the main goal of this paper to consider the issue of clustering rank data
from a novel perspective, taking the specific nature of the observations into ac-
count. We point out that the method promoted in this paper is by no means
the sole possible applicable technique. The most widely used approach to this
problem consists in viewing rank data (and, more generally, ordinal data), when
renormalized in an appropriate manner, as standard numerical data and apply
state-of-the-art clustering techniques, see Chapter 14 in [HTF09]. Alternative
procedures of probabilistic type, relying on mixture modeling of probability dis-
tributions on a set of (partial) rankings, can also be considered, following in
the footsteps of [FV88]. Rank data are here considered as probability distribu-
tions on the symmetric group Sn , n ≥ 1 denoting the number of objects to
be (tentatively) ranked and our approach crucially relies on the Fourier trans-
form on the set of mappings f : Sn → R. Continuing the seminal contribution
of [Dia89], spectral analysis of rank data has been recently considered in the
machine-learning literature for a variety of purposes with very promising re-
sults, see [KB10], [HGG09] or [HG09]. This paper pursues this line of research.
It aims at showing that, in the manner of spectral analysis in signal processing,
Fourier representation is good at describing properties of distributions on Sn
in a sparse manner, ”sparse” meaning here that a small number (compared to
n!) of Fourier coefficients carry most of the significant information, from the
perspective of clustering especially. As shown in [Dia88], the main appeal of
spectral analysis in this context lies in the fact that Fourier coefficients encode
structural properties of ranking distributions in a very interpretable fashion.
Here we propose to use these coefficients as features for defining clusters. More
precisely, we shall embrace the approach developed in [WT10] (see also [FM04]),
in order to find clusters on an adaptively-chosen subset of features in the Fourier
domain.
The article is organized as follows. Section 2 describes the statistical frame-
work, set out the main notations and recall the key notions of Fourier represen-
tation in the context of distributions on the symmetric group that will be used
in the subsequent analysis. Preliminary arguments assessing the efficiency of the
Fourier representation for discrimination purposes are next sketched in Section
3. In particular, examples illustrating the capacity of parsimonious truncated
Fourier expansions to approximate efficiently a wide variety of distributions on
Sn are exhibited. In Section 4, the rank data clustering algorithm we propose,
based on subsets of spectral features, is described at length. Numerical results
based on artificial and real data are finally displayed in Section 5. Technical
details are deferred to the Appendix.
Clustering Rankings in the Fourier Domain 345

2 Background and Preliminaries


In this section, we recall key notions on spectral analysis of functions defined on
the symmetric group Sn and sketch an approximation framework based on this
Fourier type representation, that will underly our proposed clustering algorithm.

2.1 Setup and First Notations


Here and throughout, n ≥ 1 denotes the number of objects to be ranked, indexed
by i = 1, . . . , n. For simplicity’s sake, it is assumed that no tie can occur in
the present analysis, rankings being thus viewed as permutations of the list of
objects {1, . . . , n} and coincide with the elements of the symmetric group Sn
of order n. Extension of the concepts developed here to more general situations
(including partial rankings and/or bucket orders) will be tackled in a forthcoming
article. The set of mappings f : Sn → C is denoted by C[Sn ]. For any σ ∈
Sn , the function on Sn that assigns 1 to σ and 0 to all τ  = σ is denoted by
δσ . The linear space C[Sn ] is equipped with the usual inner product: f, g =
σ∈Sn f (σ)g(σ), for any (f, g) ∈ C[Sn ] . The related hilbertian norm is denoted
2

by ||.||. Incidentally, notice that {δσ : σ ∈ Sn } corresponds to the canonical


basis of this Hilbert space. The indicator function of any event E is denoted
by I{E}, the trace of any square matrix A with complex entries by tr(A), its
conjugate by A∗ and the cardinality of any finite set S by #S. Finally, for
any m ≥ 1, the matrix space Mm×m (C) is equipped with the scalar product
< A, B >m = tr(A∗ B) and the Hilbert-Schmidt norm ||A||HS(m) =< A, A >m .
1/2

Rank Data as Distributions on Sn . The framework we consider stipulates


that the observations are of the form of probability distributions
 on Sn , i.e.
elements f of C[Sn ] taking their values in [0, 1] such that σ∈Sn f (σ) = 1. This
models a variety of situations encountered in practice: indeed, list of preferences
are rarely exhaustively observed, it is uncommon that rank data of the form
of full permutations are available (σ −1 (i) indicating the label of the i-th most
preferred object, i = 1, . . . , n). As shown by the following examples, a natural
way of accounting for the remaining uncertainty is to model the observations
as probability distributions. Precisely, let E1 , . . . , EK be a partition of Sn ,
with 1 ≤ K ≤ n!. When one observes which of these events is realized, i.e. in
which of these subsets the permutation lies, the ensemble of possible observations
is in one-to-one correspondence with the set of distributions {fk : 1 ≤ k ≤
K}, where fk denotes the conditional distribution  of S given S ∈ Ek , S being
uniformly distributed on Sn , i.e. fk = (1/#Ek )· σ∈Ek δσ . In all these situations,
the number of events K, through which preferences can be observed, may be
very large.
Example 1. (Top-k lists.) In certain situations, only a possibly random num-
ber m of objects are ranked, those corresponding to the most preferred objects.
In this case, the observations are related to the events of the type E = {σ ∈ Sn :
(σ(i1 ), . . . , σ(im )) = (1, . . . , m)} where m ∈ {1, . . . , n} and (i1 , . . . , im ) is a
m-tuple of the set of objects {1, . . . , n}.
346 S. Clémençon, R. Gaudel, and J. Jakubowicz

Example 2. (Preference data.) One may also consider the case where a col-
lection of objects, drawn at random, are ranked by degree of preference. The
events observed are then of the form E = {σ ∈ Sn : σ(i1 ) < . . . < σ(im )} with
m ∈ {1, . . . , n} and (i1 , . . . , im ) a m-tuple of the set of objects {1, . . . , n}.

Example 3. (Bucket orders.) The top-k list model can be extended the fol-
lowing way, in order to account for situations where preferences are aggregated.
One observes a random partition B1 , . . . , BJ of the set of instances for which:
for all 1 ≤ j < l ≤ J and for any (i, i ) ∈ Bj × Bl , σ(i) < σ(i ).

Hence, our objective is here to partition a set of N probability distributions


f1 , . . . , fN on Sn into subgroups C1 , . . . , CM , so that the distributions in
any given subgroup are closer to each other (in a sense that will be specified)
than to those of other subgroups. When equipped with a dissimilarity measure
D : Sn × Sn → R+ , the clustering task is then classically cast in terms of
minimization of the empirical criterion


M 

W(C) = D(fi , fj ) · I{(fi , fj ) ∈ Cm
2
}, (1)
m=1 1≤i<j≤n

over the set of all possible partitions C = {Cm : 1 ≤ m ≤ M } with M ≥ 1 cells,


the choice M requiring next the use of model selection techniques, see [TWH01]
and the references therein. Beyond the fact that this corresponds to a NP-hard
optimization problem for which acceptably good solutions can be computed after
a reasonable lapse of time through a variety of (meta-)heuristics, the choice of
a distance or a dissimilarity measure between pairs of distributions on Sn is
crucial. We emphasize that depending on the measure D chosen, one may either
enhance specific patterns in the rank data or else make them disappear. This
strongly advocates for the application of recent adaptive procedures that permit
to achieve great flexibility in measuring dissimilarities, see [WT10] (refer also
to [FM04]), selecting automatically a subset of attributes in order to emphasize
differences between the cluster candidates. In this article, an attempt is made to
implement this approach when attributes/features are those that are output by
spectral analysis on C[Sn ].

2.2 The Fourier Transform on Sn


Fourier analysis is an extraordinary powerful tool, whose usefulness and ubiquity
is unquestionable and well documented, see [K8̈9, KLR95]: in signal and image
processing it is used for the purpose of building filters and extracting informa-
tion, in probability theory it permits to characterize probability distributions
and serves as a key tool for proving limit theorems, in time-series analysis it al-
lows to analyze second order stationary sequences, in analysis for solving partial
differential equations, etc. In fact, it is hard to think about a tool in (applied)
mathematics that would be more widespread than the Fourier transform. The
most common framework deals with functions f : G → C where G denotes the
Clustering Rankings in the Fourier Domain 347

group R of real numbers, the group Z of integers or the group Z/N Z of integers
modulo N . The elements of the abelian group G act on functions f : G → C
by translation. Recall that, for any g ∈ G, the translation by g is defined by
Tg (f ) : x ∈ G
→ f (x − g). A crucial property of the Fourier transform F is that
it diagonalizes all translation operators simultaneously: ∀g ∈ G,
F (Tg (f ))(ξ) = χg (ξ) · Ff (ξ),

where χg (ξ) = exp(2iπgξ) and ξ belongs to the dual group G  (being R, R/Z and
Z/N Z when G is R, Z and Z/N Z respectively). Consequently, Fourier trans-
form provides a sparse representation of all operators that are spanned by the
collection of translations, such as convolution operators.
The diagonalization view on Fourier analysis extends to the case of non-
commutative groups such as Sn . However, in this case, the related eigenspaces
are not necessarily of dimension 1 anymore. In brief, in this context Fourier
transform only ”block-diagonalizes” translations, as shall be seen below.
The Group Algebra C[Sn ]. The set C[Sn ] is a linear space on which
 Sn acts
linearly as a group of translations Tσ : f ∈ C[Sn ] → Tσ (f ) = ν∈Sn f (ν ◦
σ −1 )δν . For clarity’s sake, we recall the following notion.
Definition 1. (Convolution Product) Let (f, g) ∈ C[Sn ]2 . The convolu-
 to the counting measure on Sn ) is the
tion product of g with f (with respect
→ ν∈Sn f (ν)g(ν −1 ◦ σ) ∈ C.
function defined by f ∗ g : σ ∈ Sn
Remark 1. Notice that one may also write, for any (f, g) ∈ C[Sn ]2 and all
−1
σ ∈ Sn , (f ∗ g)(σ) ν∈Sn f (σ ◦ ν )g(ν). The convolution product f ∗ δσ =
τ ∈ Sn → f (τ ◦ σ −1 ) reduces to the right translation of f by σ, Tσ f namely.
Observe in addition that, for n > 2, the convolution product is not commutative.
For instance: δσ ∗ δτ = δσ◦τ = δτ ◦σ = δτ ∗ δσ , when σ ◦ τ 
= τ ◦ σ.
The set C[Sn ] equipped with the pointwise addition and the convolution product
(see Definition 1 above) is referred to as the group algebra of Sn .
Canonical Decomposition. In the group algebra formalism introduced above,
a function f is an eigenvector for all the right translations (simultaneously)
whenever ∀σ ∈Sn , δσ ∗ f = χσ f , where χσ ∈ C for all σ ∈ Sn . For instance, the
function f = σ∈Sn δσ ≡ 1 can be easily seen to be such an eigenvector with
χσ ≡ 1. In addition, denoting by σ the signature of any permutation σ ∈ Sn
(recall that it is equal to (−1)I(σ) where I(σ) is the number of inversions of σ,
i.e. the number ofpairs (i, j) in {1, . . . , n}2 such that i < j and σ(i) > σ(j)),
the function f = σ∈Sn σ δσ is also an eigenvector for all the right translations
with χσ = σ . If one could possibly find n! such linearly independent eigenvec-
tors, one would be able to define a notion of Fourier transform with properties
very similar to those of the Fourier transform of functions defined on Z/N Z.
Unfortunately, due to the lack of commutativity of Sn , the functions mentioned
above are the only eigenvectors common to all right translation operators, up to
a multiplicative constant. Switching from the notion of eigenvectors to that of
irreducible subspaces permits to define the Fourier transform, see [Ser88].
348 S. Clémençon, R. Gaudel, and J. Jakubowicz

Definition 2. (Irreducible Subspaces) A non trivial vector subspace V of


C[Sn ] (i.e. different from {0} and C[Sn ]) is said to be irreducible when it is stable
under all right translations, i.e. for all (σ, f ) ∈ Sn × V , δσ ∗ f ∈ V and contains
no such stable subspace except {0} and itself. Two irreducible subspaces V1 and
V2 are said to be isomorphic when there exists a bijective linear map T : V1 → V2
that commutes with translations: ∀f ∈ V1 , ∀σ ∈ Sn , T (δσ ∗ f ) = δσ ∗ T (f ).
It follows from a well known result in Group Theory (in the compact case,
see Peter-Weyl theorem in [Ser88] for instance) that C[Sn ] decomposes into a
direct sum of orthogonal irreducible subspaces. Spectral/harmonic analysis of
an element f ∈ C[Sn ] consists then in projecting the latter onto these sub-
spaces and determine in particular which components contribute most to its
”energy” ||f ||2 . In the case of the symmetric group Sn , the elements of the ir-
reducible representation cannot be indexed by ”scalar frequencies”, the Fourier
components being actually indexed by the set Rn of all integer partitions of n,
namely:
  

k
∗k
Rn = ξ = (n1 , . . . , nk ) ∈ N : n1 ≥ · · · ≥ nk , ni = n , 1 ≤ k ≤ n .
i=1

Remark 2. (Young tableaux) We point out that each element (n1 , . . . , nk )


of the set Rn can be visually represented as a Young tableau, with k rows and
ni cells at row i ∈ {1, . . . , k}. For instance, the partition of n = 9 given by
ξ = (4, 2, 2, 1) and its conjugate ξ  can be respectively encoded by the diagrams:

and .
Hence, the Fourier transform of any function f ∈ C[Sn ] is of the form: ∀ξ ∈
Rn , 
F f (ξ) = f (σ)ρξ (σ),
σ∈Sn

where ρξ is a function on Sn that takes its values in the setof unitary ma-
trices with complex entries of dimension dξ × dξ . Note that ξ d2ξ = n!. For
clarity, we recall the following result, that summarizes some crucial properties
of the spectral representation on Sn , analogous to those of the standard Fourier
transform. We refer to [Dia88] for an nice account of linear representation the-
ory of the symmetric group Sn as well as some of its statistical applications.

Proposition 1. (Main properties of F ) The following properties hold true.


(i) (Plancherel formula)
 ∀(f, g) ∈ C[Sn ]2 , f, g =< F f, F g >, where
< F f, F g >= n! ξ∈Rn dξ < F f (ξ), F g(ξ) >.
1

(ii) (Inversion formula) ∀f ∈ C[Sn ], f = n! 1
d < ρξ (.), F f (ξ) >dξ .
n ξ
ξ∈R
(iii) (Parseval formula) ∀f ∈ C[Sn ], ||f ||2 = n!1
ξ∈Rn dξ ||Ff (ξ)||HS(dξ ) .
2
Clustering Rankings in the Fourier Domain 349

Example 4. For illustration purpose, Fig. 1 below displays the spectral analy-
sis
nof the Mallows distribution (cf [Mal57]) when n = 5, given by fσ0 ,γ (σ) =
{ j=1 (1−exp{−γ})/(1−exp{−jγ})}·exp{−γ ·dτ (σ, σ0 )} for all σ ∈ Sn , denot-

ing by dτ (σ, ν) = 1≤i<j≤n I{σ ◦ ν −1 (i) > σ ◦ ν −1 (j)} the Kendall τ distance,
for several choices of the location and concentration parameters σ0 ∈ Sn and
γ ∈ R∗+ . Precisely, the cases γ = 0.1 and γ = 1 have been considered. As shown
by the plots of the coefficients, the more spread the distribution (i.e. the smaller
γ), the more concentrated the Fourier coefficients.
0.20

[32415] [32415]
[35421] [35421]

0.4
0.15

[12453] [12453]

0.2
0.10

0.0
0.05

−0.4 −0.2
0.00

0 20 40 60 80 100 120 0 20 40 60 80 100 120

(a) Law on S5 , γ = 1 (b) Fourier Transform, γ = 1


0.06
0.012

[32415]
[35421]
[12453]
0.02
0.008

−0.02
0.004

[32415]
[35421]
−0.06
0.000

[12453]

0 20 40 60 80 100 120 0 20 40 60 80 100 120

(c) Law on S5 , γ = 0.1 (d) Fourier Transform, γ = 0.1

Fig. 1. Coefficients of the Mallows distribution and of its Fourier transform for several
choices of the pair of parameters (σ, γ)

Remark 3. (Computational issues) We point out that rank data generally


exhibit a built-in high dimensionality. In many applications, the number of ob-
jects n to be possibly ranked is indeed very large. Widespread use of methods
such as that presented here is conditioned upon advances in the development
of practical fast algorithms for computing Fourier transforms. All the displays
presented in this paper have been computed thanks to the C++ library Sn ob,
see [Kon06], that implements Clausen’s fast Fourier transform for Sn .

3 Sparse Fourier Representations of Rank Data


In this section, an attempt is made to exhibit preliminary theoretical and em-
pirical results, which show that spectral analysis of rank data can provide sparse
350 S. Clémençon, R. Gaudel, and J. Jakubowicz

and/or denoised representations of rank data in certain situations, that are use-
ful for discrimination purposes.

Sparse Linear Reconstruction. Let F be a random element of C[Sn ] and


consider the following approximation scheme that consists in truncating the
Fourier representation by retaining the coefficients with highest norm only in
the reconstruction. It can be implemented in three steps, as follows. Let K ∈
{1, . . . , n!} be fixed.
1. Perform the Fourier transform, yielding matrix coefficients {F F (ξ)}ξ∈Rn .
2. Sort the frequencies by decreasing order of magnitude of expected weighted
norm of the corresponding matrix coefficients: ξ(1) , . . ., ξ(K) , where

E dξ(1) ||FF (ξ(1) )||2HS(dξ ) ≥ . . . ≥ E dξ(M ) ||FF (ξ(M) )||2HS(dξ ) ,


(1) (M )

with M = #Rn , dξ(m) denoting the dimension of the irreducible space related
to the frequency ξ(m) .
3. Keeping the K coefficients with largest second order moment, invert the
Fourier transform, producing the approximant

1 
K
FK (σ) = dξ(k) < ρξ(k) (σ), F F (ξ(k) ) >dξ(k) . (2)
n!
k=1

The capacity of Fourier analysis to provide a sharp reconstruction of F can


be evaluated through the expected distortion rate, namely K (F ) = E[||F −
FK ||2 ]/E[||F ||2 ]. Fig. 2 shows the decay of a Monte-Carlo estimate (based on 50
runs) of the distortion rate K (F ) as the number K of retained coefficient grows,
when F is drawn at random in a set of L Mallows distributions. As expected,
the more spread the realizations of F (i.e. the smaller γ and/or the larger L),
the faster the decrease (i.e. the more efficient the Fourier-based compression).
Remark 4. (On stationarity.) For simplicity, suppose that E[F (σ)] is con-
stant, equal to 0 say. The expected distortion rate can be then easily expressed
in terms of the covariance Φ(σ, σ  ) = E[F (σ)F (σ  )] and in the case when F is
”stationary”, i.e. when Φ(σ, σ  ) = φ(σσ −1 ), it may be shown that Fourier repre-
sentation has optimal properties in regards to linear approximation/compression,
since the covariance operator is then diagonalized by the Fourier basis, exactly
as in the well-known L2 situation. Notice incidentally that such kernels φ(σσ −1 )
have already been considered in [Kon06] for a different purpose. Developing a
full approximation theory based on Fourier representation of C[Sn ], and defin-
ing in particular specific subspaces of C[Sn ] that would be analogous to Sobolev
classes and sparsely represented in the Fourier domain is beyond the scope of
this paper and will be tackled in a forthcoming article.
An Uncertainty Principle. The following proposition states an uncertainty
principle in the context of the symmetric group. It extends the result established
in [DS89] in the abelian setup to a non-commutative situation and can be viewed
Clustering Rankings in the Fourier Domain 351

1 1 1
γ = 10 γ = 10 γ = 10
γ=1 γ=1 γ=1
0.8 γ = 0.1 0.8 γ = 0.1 0.8 γ = 0.1
0.6 0.6 0.6
distortion

distortion

distortion
0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120


# used coefficients # used coefficients # used coefficients

(a) L = 1 (b) L = 10 (c) L = 100

Fig. 2. Monte-Carlo estimate of K (F ) as a function of K for a r.v. drawn at random


in a set of L Mallows distributions with concentration parameter γ

as a specific case of the inequality proved in [MS73] through operator-theoretic


arguments. A simpler proof, directly inspired from the technique used in [DS89]
in the commutative case, is given in the Appendix.

Proposition 2. (Uncertainty principle) Let f ∈ C[Sn ]. Denote by supp(f ) =


{σ ∈ Sn : f (σ) = 0} and by supp(F f ) = {ξ ∈ Rn : F f (ξ) 
= 0} the support of f
and that of its Fourier transform respectively. Then, we have:

#supp(f ) · d2ξ ≥ n!. (3)
ξ∈supp(fˆ)

Roughly speaking, the inequality (3) above says in particular that, if the Fourier
representation is sparse, i.e. F f (ξ) is zero at many frequencies ξ, f (σ) is not.
De-noising in C[Sn ]. The following example shows how to achieve noise sup-
pression through Fourier representation in some specific cases. Let A = ∅ be a

subset of Sn and consider the uniform distribution on it: fA = (1/#A)· σ∈A δσ .
We suppose that it is observed with noise and shall study the noise effect in the
Fourier domain. The noise is modeled as follows. Let T denote a transposi-
tion drawn at random in the set T of all transpositions in Sn (P{T = τ } =
2/(n(n − 1)) for all τ ∈ T), the noisy observation is: f = fA ∗ δT . Notice that
the operator modeling the noise here is considered in [RKJ07] for a different
purpose.
We first recall the following result, proved by means of the Murnaghan-
Nakayama rule ([Mur38]), that provides a closed analytic form for the expected
Fourier transform of the random distribution f .

Proposition 3. For all ξ ∈ Rn , we have:

E[F f (ξ)] = aξ · FfA (ξ), (4)


r(ξ) r(ξ )
where aξ = ( i=1 ni (ξ)2 − i=1 ni (ξ  )2 )/(n(n − 1)), denoting by r(ξ) the num-
ber of rows in the Young diagram representation of any integer partition of n ξ
and ni (ξ) the number of cells at row i.
352 S. Clémençon, R. Gaudel, and J. Jakubowicz

The proposition above deserves some comments. Notice first that the map ξ ∈
Rn → aξ is antisymmetric (i.e. aξ = −aξ for any ξ in Rn ) and one can show
that aξ is a decreasing function for the natural partial order on Young diagram
[Dia88]. As it √is shown in [RS08], |aξ | = O(n−1/2 ) for diagrams ξ satisfying
r(ξ), c(ξ) = O( n). For the two lowest frequencies, (n) and (n, n − 1) namely,
the proposition shows that a(n) = 1 and a(n−1,1) = (n − 3)/(n − 1). Roughly
speaking, this means that the noise leaves almost untouched (up to a change of
sign) the highest and lowest frequencies, while attenuating moderate frequencies.
Looking at extreme frequencies hopefully allows to recover/identify A.

Remark 5. (A more general noise model) The model above can be extended
in several manners, by considering, for instance, noisy observations of the form
f = fA ∗ δSm , where Sm is picked at random among permutations that can be
decomposed as a composition of m ≥ 1 transpositions (and no less). One may
(m) (m)
then show that E[F f (ξ)] = aξ ·FfA (ξ), for all ξ ∈ Rn , where the aξ ’s satisfy
the following property: for all frequencies ξ ∈ Rn whose row number and column

number are both less than c1 n for some constant c1 < ∞, |aξ | ≤ c2 · n−m/2 ,
(m)

where c2 denotes some finite constant. See [RS08] for further details.

4 Spectral Feature Selection and Sparse Clustering

If one chooses to measure dissimilarity in C[Sn ] through the square hilbertian


norm, the task of clustering a collection of N distributions f1 , . . . , fN on the
symmetric group Sn into L << n! groups can be then formulated as the problem
of minimizing over all partitions C = {C1 , . . . , CL } of the dataset the criterion:


L 

M(C) = ||fi − fj ||2 · I{(fi , fj ) ∈ Cl2 }
l=1 1≤i, j≤N

1   
L
= dξ ||Ffi (ξ) − Ffj (ξ)||2HS(dξ ) ,
n!
ξ∈Rn l=1 1≤i, j≤N : (fi ,fj )∈Cl2

switching to the Fourier domain by using Parseval relation (see Proposition 1).
Given the high dimensionality of rank data (cf Remark 3), such a rigid fash-
ion of measuring dissimilarity may prevent the optimization procedure from
identifying the clusters, the main features that are possibly responsible for the
differences being buried in the criterion. As shown in the previous section, in
certain situations only a few well-chosen spectral features may permit to exhibit
similarities or dissimilarities between distributions on Sn . Following in the foot-
steps of [WT10], we propose to achieve a sparse clustering of the rank data by
considering the optimization problem: λ > 0 being a tuning parameter, minimize

 dξ 
L 
ω (C) =
M ωξ ||Ffi (ξ) − Ffj (ξ)||2HS(dξ ) (5)
n!
ξ∈Rn l=1 1≤i, j≤N : (fi ,fj )∈Cl2
Clustering Rankings in the Fourier Domain 353

subject to ω = (ωξ )ξ∈Rn ∈ R#R +


n
, ||ω||2l2 ≤ 1 and ||ω||l1 ≤ λ, (6)
def  def 
where ||ω||2l2 = ξ∈Rn ωξ and ||ω||l1 =
2
ξ∈Rn |ωξ |. The coefficient ωξ must be
viewed as√ the weight of the spectral feature indexed by ξ. Observe√incidentally

that (1/ #Rn ) · M(C) corresponds to the situation where ωξ = 1/ #Rn for all
ξ ∈ Rn . Following the lasso paradigm, the l1 -penalty guarantees the sparsity of
the solution ω for small values of the threshold λ, while the l2 penalty prevents
the weight vector to concentrate on a single frequency, the one that most exhibits
clustering.
Remark 6. (Further feature selection) Recall that the Fourier coefficients
are matrices of variable sizes. Hence, denoting (abusively) by {F f (ξ)m : 1 ≤
m ≤ dξ } the coefficients of the projection of any f ∈ C[Sn ] onto the irreducible
subspace Vξ chosen in a given orthonormal basis of Vξ , one may consider a refined
sparse clustering, namely by considering the objective

 


L 
ωξ,m (F fi (ξ)m − Ffj (ξ)m )2 .
ξ∈Rn m=1 l=1 1≤i, j≤N : (fi ,fj )∈Cl2

Though it may lead to better results in practice, interpretation of large values


of the weights as significant contributions to the clustering by the corresponding
features becomes harder since the choice of a basis of Vξ is arbitrary.
As proposed in [WT10], the optimization problem (5) under the constraints
(6) √
is solved iteratively,
√ in two stages, as follows. Starting with weights ω =
(1/ #Rn , . . . , 1/ #Rn ), until convergence, iterate the two steps:
ω (C) after the partition C,
(Step 1.) fixing the weight vector ω, minimize M

(Step 2.) fixing the partition C, minimize Mω (C) after ω.

A wide variety of clustering procedures have been proposed in the literature to


perform Step 1 approximately (see [WX08] for an overview of off-the-shelf algo-
rithms). As regards Step 2, it has been pointed out in [WT10] (see Proposition
1 therein), that a closed analytic form is available for the solution ω, the cur-
rent data partition C being held fixed: ωξ = S(Z(C, ξ)+ , Δ)/S(Z(C, ξ)+ , Δ)2 ,
for all ξ ∈Rn ,where x+ denotes the positive part of any real number x,
(i,j): (fi ,fj )∈Cl ×Cl ||Ffi (ξ) − Ffj (ξ)||HS(dξ ) , and S is the soft
2
Z(C, ξ) = l = l
thresholding function S(x, Δ) = sign(x)(|x| − Δ)+ , with Δ = 0 if the l1 con-
straint is fulfilled, and chosen positive so as to enforce ωl1 = s otherwise.

5 Numerical Experiments
This section presents experiments on artificial and real data which demonstrate
that sparse clustering on Fourier representation recovers clustering information
on rank data. For each of the three studied datasets, a hierarchical clustering is
354 S. Clémençon, R. Gaudel, and J. Jakubowicz

learned using [WT10] approach, as implemented in the sparcl R library. The


clustering phase starts with each example in a separate cluster and then merges
clusters two by two. Parameter s, controlling the l1 norm of ω, is determined
after the approach introduced in [WT10], which is related to the gap statistic
used by [TWH01].
Hierarchical Clustering on Mallows Distributions. The first experiment
considers an artificial dataset composed of N = 10 examples, where each example
is a probability distribution on Sn . Formally, each example fi corresponds to the
Mallows distribution of center σi and spreading parameter γ. Centers σ1 to σ5 are
of the form σ (0) τi where σ (0) belongs to Sn and τi is a transposition uniformly
drawn in the set of all transpositions. Remaining centers follow a similar form
σ (1) τi . Therefore examples should be clustered in two groups: one with the five
first examples and a second group with remaining examples. Figure 3 gives the
learned dendograms either from the distribution fi , or from its Fourier transform
F fi . The cluster information is fully recovered with the Fourier representation
(except for one example), whereas the larger γ, the further from the target
clustering is the dendogram learned from the distribution probabilities. Hence,
the Fourier representation is more appropriate than the [0, 1]n! representation to
cluster these rank data.
0.2

0.2

0.20
[6517423]

[6517423]
0.10
0.10

0.10
[6517432]

[6514723]
0.00

0.00
0.00

[6512473]

[1567423]
[6734512]

[6517423]

[6512473]

[1567423]
[6514723]

[6517432]
[4736215]

[3746512]

[6734512]
[4736512]

[4637512]

[4637512]

[4736512]

[4736215]

[3746512]

[6514723]

[6517432]

[6512473]

[1567423]

[6734512]

[4637512]

[3746512]

[4736512]

[4736215]
(a) Law on S7 , γ = 0.1 (b) Law on S7 , γ = 1 (c) Law on S7 , γ = 10
0.14

0.10
0.10

0.08

[1567423]
0.02

0.00
[6517432]
[4637512]
0.00

[4736215]

[6734512]

[3746512]
[1567423]

[6734512]

[3746512]

[1567423]
[6512473]
[4736512]

[4736215]

[6512473]
[6514723]

[6512473]

[4736512]

[4637512]

[4736215]
[6517432]
[6517423]

[6514723]

[4637512]

[6734512]
[6517423]

[6517432]

[6517423]

[6514723]

[4736512]

[3746512]

(d) Fourier, γ = 0.1 (e) Fourier, γ = 1 (f) Fourier, γ = 10

Fig. 3. Dendograms learned on N = 10 Mallows models on S7 with spreading param-


eter γ. Examples which should be in the same cluster are plotted with same color.

Regarding the number of coefficients selected to construct these dendograms


(cf Table 1), sparcl focuses on less coefficients when launched on the distri-
bution probability. Still, with both representations, less than 260 coefficients
are selected which remains small compare to the total number of coefficients
(5,040). Unsurprisingly, in the Fourier domain, the selected coefficients depend
Clustering Rankings in the Fourier Domain 355

Table 1. Number of selected coefficients, with N = 10 Mallows models on S7 and


spreading parameter γ

# of coefficients
Representation Total Selected Selected Selected
when γ = 0.1 when γ = 1 when γ = 10
Probability distribution 5,040 8 3 1
Fourier transform 5,040 257 54 6
0.030
0.020
0.010
0.000

2 < 1 < 3 < 6 < ...

1 < 3 < 2 < 6 < ...

3 < 2 < 1 < 6 < ...

1 < 2 < 3 < 6 < ...

2 < 3 < 1 < 6 < ...

3 < 1 < 2 < 6 < ...


2 < 1 < 3 < 4 < ...
2 < 1 < 3 < 5 < ...
2 < 1 < 3 < 7 < ...
2 < 1 < 3 < 8 < ...
1 < 3 < 2 < 4 < ...
1 < 3 < 2 < 5 < ...
1 < 3 < 2 < 7 < ...
1 < 3 < 2 < 8 < ...
3 < 2 < 1 < 4 < ...
3 < 2 < 1 < 5 < ...
3 < 2 < 1 < 7 < ...
3 < 2 < 1 < 8 < ...
6 < 8 < 7 < 3 < ...
7 < 8 < 6 < 3 < ...
8 < 6 < 7 < 3 < ...
8 < 7 < 6 < 3 < ...
8 < 7 < 6 < 5 < ...
8 < 7 < 6 < 4 < ...
8 < 6 < 7 < 5 < ...
8 < 6 < 7 < 4 < ...
7 < 8 < 6 < 5 < ...
7 < 8 < 6 < 4 < ...
6 < 8 < 7 < 5 < ...
6 < 8 < 7 < 4 < ...
7 < 6 < 8 < 5 < ...
7 < 6 < 8 < 4 < ...
6 < 7 < 8 < 4 < ...
6 < 7 < 8 < 5 < ...
6 < 7 < 8 < 3 < ...
7 < 6 < 8 < 3 < ...
6 < 8 < 7 < 1 < ...
8 < 6 < 7 < 1 < ...
6 < 8 < 7 < 2 < ...
8 < 6 < 7 < 2 < ...
6 < 7 < 8 < 1 < ...
7 < 6 < 8 < 2 < ...
7 < 8 < 6 < 2 < ...
8 < 7 < 6 < 1 < ...
6 < 7 < 8 < 2 < ...
7 < 6 < 8 < 1 < ...
7 < 8 < 6 < 1 < ...
8 < 7 < 6 < 2 < ...
1 < 2 < 3 < 4 < ...
1 < 2 < 3 < 5 < ...
1 < 2 < 3 < 7 < ...
1 < 2 < 3 < 8 < ...
2 < 3 < 1 < 4 < ...
2 < 3 < 1 < 5 < ...
2 < 3 < 1 < 7 < ...
2 < 3 < 1 < 8 < ...
3 < 1 < 2 < 4 < ...
3 < 1 < 2 < 5 < ...
3 < 1 < 2 < 7 < ...
3 < 1 < 2 < 8 < ...
Fig. 4. Dendogram learned on top-k lists. Red lines (respectively green lines) cor-
respond to examples Ei = {σ ∈ S8 : (σi (i1 ), . . . , σi (i4 )) = (1, . . . , 4)} with
{i1 , i2 , i3 } = {1, 2, 3} (resp. with {i1 , i2 , i3 } = {6, 7, 8}).

on γ. When γ is small, the approach selects small frequency coefficients, whereas,


when γ is large, the used coefficients are those corresponding to high frequencies.

Hierarchical Clustering on Top-k Lists. This second experiment is con-


ducted on artificial data, which consider rankings on eight products. Each exam-
ple i of the data corresponds to a top-4 list Ei = {σ ∈ S8 : (σi (i1 ), . . . , σi (i4 )) =
(1, . . . , 4)}, where (i1 , i2 , i3 ) is either a permutation on products {1, 2, 3} or a
permutation on products {6, 7, 8}, and i4 is one of remaining products. All in
all, this dataset contains 60 examples, which should be clustered in two groups
or twelve groups. The 60 functions fi have disjoint supports. Therefore, the ex-
amples cannot be clustered based on their “temporal” representation. On the
contrary, the Fourier representation leads to a dendogram which groups together
examples for which the top-3 products are from the same three products (cf Fig-
ure 4). Furthermore, this dendogram is obtained from only seven small-frequency
coefficients, which is extremely sparse compare to the total 40, 320 coefficients.

Hierarchical Clustering on an E-commerce Dataset. The proposed ap-


proach is also used to cluster the real dataset introduced by [RBEV10]. This data
come from an E-commerce website. The only available information is the history
of purchases for each user, and the ultimate goal is to predict future purchases.
356 S. Clémençon, R. Gaudel, and J. Jakubowicz

A first step to this goal is to group users with similar behavior, that means to
group users based on the top-k rankings associated to their past purchases.
We consider the 149 users which have purchased at least 5 products among the
8 most purchased products. The sparse hierarchical clustering approach receives
as input the 6,996 smallest frequency coefficients and selects 5 of them. The
corresponding dendogram (cf Figure 5) clearly shows 4 clusters among the users.
On 7 independent splits of the dataset in two parts of equal sizes, the criterion
optimized by sparcl varies from 46.7 to 51.3 with a mean value of 49.1 and a
standard deviation of 1.4. The stability of the criterion increases the confidence
in this clustering of examples.
0.010
0.006
0.002

Fig. 5. Dendogram learned on the E-commerce database

6 Conclusion

In this paper, a novel approach to rank data clustering is introduced. Modeling


rank data as probability distributions on Sn , our approach relies on the ability of
the Fourier transform on Sn to provide sparse representations in a broad variety
of situations. Several preliminary empirical and theoretical results are presented
to support this claim. The approach to sparse clustering proposed in [WT10] is
adapted to this setup: a lasso-type penalty is used so as to select adaptively a
subset of spectral features in order to define the clustering. Numerical examples
are provided, illustrating the advantages of this new technique. A better under-
standing of the class of distributions on Sn that can be efficiently described in
the Fourier domain and of the set of operators that are almost diagonal in the
Fourier basis would permit to delineate the compass of this approach. It is our
intention to develop this line of research in the future.

Acknowledgements. Authors greatly thank DIGITÉO (BÉMOL project),


which partially supported this work.

Appendix - Proof of Proposition 2


Combining the definition of F f , triangular inequality and the fact that the ρξ ’s
take unitary values (in particular, ||ρξ (.)||HS(dξ ) = dξ ), we have: ∀ξ ∈ Rn ,
Clustering Rankings in the Fourier Domain 357


||Ff (ξ)||HS(dξ ) = f (σ)ρξ (σ) ≤ dξ ||f ||1
σ∈Sn

≤ dξ (#supp(f ))1/2 ||f ||
⎛ ⎞1/2
1 
≤ dξ (#supp(f ))1/2 ⎝ dξ ||Ff (ξ)||2HS(dξ ) ⎠ ,
n!
ξ∈supp(F f )

where the two last bounds result from Cauchy-Schwarz inequality and Plancherel
def 
relation respectively, with ||f ||1 = σ∈Sn |f (σ)|. Now, the desired bound im-
mediately follows.

References
[CJ10] Clémençon, S., Jakubowicz, J.: Kantorovich distances between rankings
with applications to rank aggregation. In: Proceedings of ECML 2010
(2010)
[CS01] Crammer, K., Singer, Y.: Pranking with ranking. In: NIPS (2001)
[CV09] Clémençon, S., Vayatis, N.: Tree-based ranking methods. IEEE Transac-
tions on Information Theory 55(9), 4316–4336 (2009)
[dEW06] desJardins, M., Eaton, E., Wagstaff, K.: Learning user preferences for sets
of objects. In: Airoldi, E.M., Blei, D.M., Fienberg, S.E., Goldenberg, A.,
Xing, E.P., Zheng, A.X. (eds.) ICML 2006. LNCS, vol. 4503, pp. 273–280.
Springer, Heidelberg (2007)
[Dia88] Diaconis, P.: Group representations in probability and statistics. Institute
of Mathematical Statistics, Hayward (1988)
[Dia89] Diaconis, P.: A generalization of spectral analysis with application to
ranked data. The Annals of Statistics 17(3), 949–979 (1989)
[DS89] Donoho, D., Stark, P.: Uncertainty principles and signal recovery. SIAM
J. Appl. Math. 49(3), 906–931 (1989)
[FISS03] Freund, Y., Iyer, R.D., Schapire, R.E., Singer, Y.: An efficient boosting
algorithm for combining preferences. JMLR 4, 933–969 (2003)
[FM04] Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes.
JRSS 66(4), 815–849 (2004)
[FV86] Fligner, M.A., Verducci, J.S.: Distance based ranking models. JRSS Series
B (Methodological) 48(3), 359–369 (1986)
[FV88] Fligner, M.A., Verducci, J.S.: Multistage ranking models. JASA 83(403),
892–901 (1988)
[HFCB08] Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by
learning pairwise preferences. Artificial Intelligence 172, 1897–1917 (2008)
[HG09] Huang, J., Guestrin, C.: Riffled independence for ranked data. In: Pro-
ceedings of NIPS 2009 (2009)
[HGG09] Huang, J., Guestrin, C., Guibas, L.: Fourier theoretic probabilistic infer-
ence over permutations. JMLR 10, 997–1070 (2009)
[HTF09] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learn-
ing, 2nd edn., pp. 520–528. Springer, Heidelberg (2009)
[K8̈9] Körner, T.: Fourier Analysis. Cambridge University Press, Cambridge
(1989)
358 S. Clémençon, R. Gaudel, and J. Jakubowicz

[KB10] Kondor, R., Barbosa, M.: Ranking with kernels in Fourier space. In: Pro-
ceedings of COLT 2010 (2010)
[KLR95] Kahane, J.P., Lemarié-Rieusset, P.G.: Fourier series and wavelets. Rout-
ledge, New York (1995)
[Kon06] R. Kondor. Sn ob: a C++ library for fast Fourier transforms on the sym-
metric group (2006), http://www.its.caltech.edu/~ risi/Snob/
[LL03] Lebanon, G., Lafferty, J.: Conditional models on the ranking poset. In:
Proceedings of NIPS 2003 (2003)
[LM08] Lebanon, G., Mao, Y.: Non-parametric modeling of partially ranked data.
JMLR 9, 2401–2429 (2008)
[Mal57] Mallows, C.L.: Non-null ranking models. Biometrika 44(1-2), 114–130
(1957)
[MM09] Mandhani, B., Meila, M.: Tractable search for learning exponential models
of rankings. In: Proceedings of AISTATS 2009 (2009)
[MPPB07] Meila, M., Phadnis, K., Patterson, A., Bilmes, J.: Consensus ranking under
the exponential model. Proceedings of UAI 2007, 729–734 (2007)
[MS73] Matolcsi, T., Szücs, J.: Intersection des mesures spectrales conjuguées. CR
Acad. Sci. S r. I Math. (277), 841–843 (1973)
[Mur38] Murnaghan, F.D.: The Theory of Group Representations. The Johns Hop-
kins Press, Baltimore (1938)
[PTA+ 07] Pahikkala, T., Tsivtsivadze, E., Airola, A., Boberg, J., Salakoski, T.:
Learning to rank with pairwise regularized least-squares. In: Proceedings
of SIGIR 2007, pp. 27–33 (2007)
[RBEV10] Richard, E., Baskiotis, N., Evgeniou, T., Vayatis, N.: Link discovery using
graph feature tracking. In: NIPS 2010, pp. 1966–1974 (2010)
[RKJ07] Howard, A., Kondor, R., Jebara, T.: Multi-object tracking with represen-
tations of the symmetric group. In: Proceedings og ICML 2007 (2007)
[RS08] Rattan, A., Sniady, P.: Upper bound on the characters of the symmetric
groups for balanced Young diagrams and a generalized Frobenius formula.
Adv. in Math. 218(3), 673–695 (2008)
[Ser88] Serre, J.P.: Algebraic groups and class fields. Springer, Heidelberg (1988)
[TWH01] Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters
in a data set via the gap statistic. J. Royal Stat. Soc. 63(2), 411–423 (2001)
[WT10] Witten, D.M., Tibshirani, R.: A framework for feature selection in clus-
tering. JASA 105(490), 713–726 (2010)
[WX08] Wünsch, D., Xu, R.: Clustering. IEEE Press, Wiley (2008)
PerTurbo: A New Classification Algorithm
Based on the Spectrum Perturbations of the
Laplace-Beltrami Operator

Nicolas Courty1 , Thomas Burger2, and Johann Laurent2


1
Université de Bretagne Sud, Université Européenne de Bretagne, Valoria
nicolas.courty@univ-ubs.fr
http://www-valoria.univ-ubs.fr/Nicolas.Courty/
2
Université de Bretagne Sud, Université Européenne de Bretagne, CNRS,
Lab-STICC
thomas.burger@univ-ubs.fr
http://www-labsticc.univ-ubs.fr/˜burger/

Abstract. PerTurbo, an original, non-parametric and efficient classifi-


cation method is presented here. In our framework, the manifold of each
class is characterized by its Laplace-Beltrami operator, which is evalu-
ated with classical methods involving the graph Laplacian. The classifi-
cation criterion is established thanks to a measure of the magnitude of
the spectrum perturbation of this operator. The first experiments show
good performances against classical algorithms of the state-of-the-art.
Moreover, from this measure is derived an efficient policy to design sam-
pling queries in a context of active learning. Performances collected over
toy examples and real world datasets assess the qualities of this strategy.

1 Introduction

Let us consider a vector space X (classically, a subspace of Rn , n ∈ N), called


the input space, and a set of m labels U = {u1 , . . . , u , . . . , um }. Moreover,
we assume that there is an unknown function h : X  → U, which maps any item
x ∈ X onto a label u = h(x) ∈ U which identifies a class. The purpose of training
a classifier on X × U is to find a “good” estimation ĥ of h. To do so, we consider
a training set T = {(x1 , h(x1 )), . . . , (xN , h(xN ))}, made of N items, as well as
their corresponding labels. Any such (xi , h(xi )) ∈ X × U is called a training
example. Then, ĥ is derived so that it provides a mapping which is as correct
as possible on T , and following the Empirical Risk Minimization principle, it is
expected that ĥ will correctly map onto U other items x̃ of X for which h(x̃) is
unknown (called test samples).
Most of the time, T is assumed to have some coherence from a geometric point
of view in X and one seeks to characterize where items of T live [1]. Classically,
one defines for each class , a latent space Y spanned by T , the set of all the
training examples with label u . Then, when a new test sample x̃ is considered,
it is projected onto Y , ∀ in order to estimate the similarity between x̃ and the

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 359–374, 2011.

c Springer-Verlag Berlin Heidelberg 2011
360 N. Courty, T. Burger, and J. Laurent

Fig. 1. The training examples (points in red) are used to evaluate a low-dimensional
space where the data live (in light blue). Then, classicaly, once a new sample (in green)
has to be labelled, it can be projected in this embedding, where an embedded metric
is used (a). In the case of Perturbo (b), the deformation of the manifold is evaluated,
thanks to the Laplace-Beltrami operator.

T , as is illustrated in Figure 1(a). Finally, x̃ is classified according to the class


to which it is the most similar.
In the literature [1], there are many such classification algorithms. Nonethe-
less, the main criticism is that these methods are parametric, or that strong
assumptions are made, such as gaussianity and linearity (PCA+mahalanobis
distance), or gaussianity and countablity (Hidden Markov Models). In this pa-
per, we propose a non-parametric algorithm, efficient on datasets which may be
embedded in non-linear manifolds. It is based on the following principle:
1- Each T is embedded in a dedicated manifold M , which is characterized
by means of tools derived from Riemannian geometry: an approximation of the
Laplace-Beltrami operator. The latter is used to define the Y .
2- The test sample x̃ is considered. Its similarity with each class u is computed
as follows: We consider the manifold M˜ which embeds {T , x̃}, i.e. the original
training set T to which the test sample x̃ has been added, and we quantify
the differences between the spectrum of M˜ and M , as is illustrated in Figure
1(b). In other words, we quantify the perturbation of the manifold M when x̃
is added to class u .
3- x̃ is classified into the class, the manifold of which was the least perturbated
by its adjunction.
The characterization of a Riemannian manifold by means of the Laplace-
Beltrani operator is known from a long time [2], and was exploited thoroughly
in the machine learning community [3,4]. In the context of computer vision
and graphics, this theoretical framework has been used for 3D mesh manipula-
tion [5,6] or, as an example, for large graph matching of 3D meshes [7]. Among
other, it has recently been proposed to quantify the interest of a sample in a
3D mesh resampling task by use of matrix pertubation theory [8]. The main
contributions of this paper are the following:
PerTurbo: A New Classification Algorithm 361

1- The adaptation of the perturbation measure used in mesh resampling to


define a dissimilarity measure between a manifold and a sample, so that a new
classification algorithm, called PerTurbo, can be derived;
2- Links between these computer graphics tools and classical methods in ma-
chine learning are provided;
3- Finally, we show the interest of PerTurbo for active learning problems.
The paper is organized as follows: Section 2 goes over the state-of-the-art. Section
3 is the core of the paper, in which the PerTurbo algorithm is derived and
formal justifications are given. Finally, experimental assessments are described
in Section 4.

2 State-of-the Art
Let us consider a Riemannian manifold M , (i.e. differentiable everywhere). Its
geometric structure can be expressed by defining the Laplace-Beltrami opera-
tor (.), which can be seen as a generalization of the Laplacian operator over
Riemannian manifolds. It completely defines the manifold up to an isometry [5].
Performing an eigenanalysis of this operator makes it possible to conduct various
tasks on the manifold. Let ui (x) be the eigenfunctions and their corresponding
eigenvalues λi of this operator. They are the solutions to the equation:

(M ) · ui (x) = λi · ui (x). (1)

Unfortunately, finding an analytic expression of this operator is generally not


possible. However, it is established in [9], that the kernel function in the integral
transform which models the propagation of the heat on a graph along time,
noted Ht (.), and called the Heat Kernel, shares some properties with the Laplace-
Beltrami operator. Notably, we have Ht (M ) = e−t(M ) . For a short propagation
time period, the exponential converges to its first order terms and we have [9]:
I − Ht (M )
(M ) = lim . (2)
t→0 t
Hence the Heat Kernel can be used as a good approximation for the Laplace-
Beltrami operator (M ). Unfortunately, the Heat Kernel may also not be known
on M . However, the latter can be robustly approximated by the Gaussian kernel
applied to a sample T of M . Finally, in [8,9], the characterization of a manifold
M by means of the spectrum of its Laplace-Beltrami operator is approximated
by the spectrum of K(T ), the Gram matrix of T , where the Gaussian kernel
k(., .) is used as a dot product. Hence, the (ith, jth) term of K(T ) is:
 
||xi − xj ||2
Kij (T ) = k(xi , xj ) = exp − (3)
2σ 2
For the sake of compact notation, we simply write K instead of K(T ), and we
consider the spectrum of K as an approximation of the spectrum of (M ) that
should be used for the characterization of M .
362 N. Courty, T. Burger, and J. Laurent

Besides, as explained in the spectral graph theory framework [10,4,3,11], the


Laplace-Beltrami operator shares a lot of properties with operators defined on
the weighted graph of a manifold (called graph matrices). Notably, it is well-
known [12] that  shares a common spectrum with the generalized graph Lapla-
cian L. Let us note G the fully connected graph (T , E), the vertice T of which
are sampled elements of M , and the edges E of which are fitted with weights
given by a symetric matrix W . The latter encodes a similarity metric on M , and
Wij ≥ 0. Then, L is defined such that Lij = Wij if i  = j, and Lii = − j=i Wij
otherwise. The spectrum of L is particuarly interesting to efficiently capture the
geometry of G: The multiplicity of eigenvalue 0 provides the number of con-
nected components of G, and the corresponding eigenfunctions are (1) indicator
functions of these connected components, and (2) invariant mesures of random
walks evolving on these connected components.
In case of no eigenfunctions with eigenvalue 0, the k eigenfunctions with the
smallest eigenvalues are indicator functions of a clustering of G into k clusters
[13]. This result is derived from the previous one: the edges of G for which the
smallest perturbation would lead to a 0 weight are highlighted by the procedure,
so that separeted connected components appear. This perturbation of the edges
can also be seen as a perturbation of the spectrum of L: We look for the graph
with k connected components, the graph Laplacian spectrum of which is the
closest to that of G.
More generally, the eigenfunctions with the smallest eigenvalues capture the
main part of the geometry of G. In other words, if the largest eigenvalues are
dropped, G is simplified (in the sense that fewer eigenfunctions are necessary
to described it) in a manner which best preserves its geometry. If the edges of
G only encode local neighborhood, only the local geometry is preserved, while
more global patterns are regularized, such as in Graph Laplacian Eigenmap [14]
(GLE).
GLE and similar techniques based on simplifying the geometry of a graph
thanks to spectral analysis are really useful to estimate a “smooth” function on
the graph, i.e. a function the behavior of which is rather homogeonous with re-
spect to the simplified geometry, such as, for instance, the fonction h which maps
the training examples and test samples to the labels. This is why they are rather
similar to matrix regularization methods [15]. On the other hand, they can also
be seen as dimensionality reduction methods [16]. More specifically, it appears
that GLE is a particular case of Kernel-PCA [17], as is established in [10].
The core idea of Kernel PCA [17] is to perform a principal component analysis
(PCA) in a feature space Z rather than in the input space X . If one wants to
embed a set of points T in a space Y of smaller dimensionality, the classical
PCA corresponds to finding the eigenvectors of the dim(Y) highest eigenvalues
of the following covariance matrix,

Cov(T ) = xi · xTi , (4)
xi ∈M
PerTurbo: A New Classification Algorithm 363

whereas the Kernel PCA considers the Gram matrix G(T ), whose (ith, jth) el-
ement is q(xi , xj ), where q is a kernel function. If the kernel is the euclidian dot
product, i.e. if qi,j =< xi , xj >, then G(T ) turns out to have the same eigen-
vectors and eigenvalues as Cov(T ) (up to a square root). Finally, the projection
x̃Y of any test sample x̃ in the latent space Y is given by the following linear
combination: ⎛ ⎞
|T |

dim(Y)

x̃Y = ⎝ αi · q(x̃, xi )⎠ · e (5)
=1 i=1

where α = {α1 , . . . , α|T | } is the eigenvector of G(T ) associated to the th


eigenvalue, and e the unit vector colinear to α .
In [10], it is established that GLE [14] can be re-interpretated as particular
cases of Kernel PCA, with a data-designed kernel. Let us consider
the preivous
||x −x ||2
graph G, but we fit its edges with the following weights: Wij = exp − i2σ2j ,
i.e. W = K. Roughly, L can be interpreted as the transition rate matrix of a
random walk (a continuous time markov chain) on G. Then, from L, let us define
another matrix T such that Tij corresponds to the expectation of commutation
time of the random walk between xi and xj . We have
 
1 ∞ Lt 11
T = e − dt, (6)
2 0 |T |2

where 11ij = 1. As can be seen in [10], finding the space spanned by the eigen-
vectors of the dim(Y) smallest eigenvalues of L is equivalent to find the space
spanned by the eigenvectors of the dim(Y) largest eigenvalues of T . Then, per-
forming a Kernel PCA with kernel q(xi , xj ) = Tij is equivalent to applying GLE.
The main interest of GLE with respect to generic kernel PCA, is to provide
a more interpretable feature space. On the other hand, as pointed out in [10],
the projection of a test sample onto Y is not possible, as there is no analytical
form for the kernel: The consideration of a new sample would mean redefining
G as well as q, and re-performing the kernel PCA. Thus, GLE can not be used
for classification issue.
In [8], the manifold M practically corresponds to the 3D surface of an ob-
ject that one wants to reconstruct from a mesh T which samples M . To make
sure that the reconstruction is the most accurate possible while minimizing the
number of points in the mesh, it is necessary to add or remove samples to T .
To evaluate the interest of a point in the mesh, the authors propose to estimate
the modification of the Laplace-Beltrami spectrum induced by the adjunction of
this point. This work is based on the interpretation of this pertubation estima-
tion in a kernel machine learning setting. In this paper, we propose to interpret
the perturbation measure from [8] as a dissimilarity criterion. We show that the
latter allows to conduct classification according to the dimensionality reduction
performed in GLE. We also establish that this measure provides an efficient tool
to design queries in an active learning scheme.
364 N. Courty, T. Burger, and J. Laurent

3 A New Classification Method

3.1 A Kernel Machine View on the Perturbation Measure

The perturbation measure proposed in [8] is easily understadable in the frame-


work of the kernel trick [18]. First, as the Gram matrix of the Gaussian kernel
can be used as a good approximation for the Laplace-Beltrami operator, let
us consider the feature space corresponding to the Gaussian kernel. Let φ be
the mapping from the input space X to the feature space Z. In [8], it is es-
tablished that the perturbation involved by the projection r(x̃) of φ(x̃) on Y
(the subspace spanned by φ(T )) can be neglected, and that the perturbation is
mainly the result of o(x̃), the component of φ(x̃) which is orthogonal to Y. As
a consequence, a fair and normalized measure of the perturbation is given by
||o(x̃)||2 ||r(x̃)||2
||φ(x̃)||2 = 1 − ||φ(x̃)||2 . The norm of r(x̃) remains to be computed. The projec-
tion over the space spanned by φ(T ) can be written as the following operator:
Φ(ΦT Φ)−1 ΦT [19] if Φ is the matrix whose columns are the elements of φ(T ).
We get:

||r(x̃)||2 = ||Φ(ΦT Φ)−1 ΦT φ(x̃)||2 = (Φ(ΦT Φ)−1 ΦT φ(x̃))T · Φ(ΦT Φ)−1 ΦT φ(x̃)
= φ(x̃)T Φ((ΦT Φ)−1 )T ΦT Φ(ΦT Φ)−1 ΦT φ(x̃)
= (φ(x̃)T Φ)((ΦT Φ)T )−1 (ΦT φ(x̃)) (7)

which, with the kernel notation kx̃ = ΦT φ(x̃) and noting that K = ΦT Φ, gives
||r(x̃)||2 = kTx̃ K−1 kx̃ . Finally, remarking that for a Gaussian kernel ||φ(x̃)||2 =
k(x̃, x̃) = 1, the final perturbation measure reads [8]:

τ (x̃, M ) = 1 − kTx̃ K−1 kx̃ . (8)

Fig. 2. Illustration of the measure on a spiral toy dataset: (a-b) The measure τ is
shown with respectively two different sigma parameters σ = 0.1 and σ = 1. (c) The
regularized τ measure according to equation (11) with σ = 1 and α = 0.1.

Practically, this measure belongs to [0, 1]. If τ (x̃, M ) = 0, then, the new
point does not modify the manifold, whereas if τ (x̃, M ) = 1, the modification is
maximum, as is illustrated in Figure 3. From this measure, a natural class-wise
PerTurbo: A New Classification Algorithm 365

measure arises, where a dissimilarity to each class is derived from the perturba-
tion of the manifold M associated to class :

τ (x̃, M ) = 1 − kTx̃ K−1


 kx̃ . (9)

Then, each new test sample x̃, is associated to the class with the least induced
perturbation, which reads as:

arg min τ (x̃, M ), (10)




Finally, the PerTurbo algorithm simply relies on:


(1) Training step: The computation of the inverse of K , ∀ ≤ m. This cor-
3 N
responds to a computational complexity of o(m × N m ), m being the mean
number of training examples per classes.
(2) Testing step: The application of Eq. 9 and 10 which only involves basic
vector-matrix multiplications to evaluate the perturbation measure m times.
Figures 3(a) and 3(d) illustrate the behavior of the estimated classification
function (ĥ, according to the notation defined in the introduction) in different
toy examples.

3.2 Perturbation Measure and Regularization Techniques


PerTurbo works as long as the perturbation measure τ is defined for all the
classes. Hence, PerTurbo is always defined as long as K is invertible ∀ ≤ m.
If any of these matrices is not invertible, then, it is always possible to consider
the pseudo-inverse: Roughly, it corresponds to only inverting the eigenvalues
which are strictly greater than zero, and dropping the others. This logic can
be extended to any cut in the spectrum of K−1 , by dropping some eigenvalues
of too small norm, as is explained in Section 2. Hence, if we consider only the
greatest eigenvalues of K−1 , i.e. (only the smallest none zero eigenvalues of K),
then, we practically consider for each class  the subspace Y corresponding to
the application of the GLE algorithm for dimensionality reduction to each set
of example T .
It is also possible to consider regularization techniques to find a close invertible
matrix: For instance, in the case of Tikhonov regularization, one considers

K̃ = K + αI, (11)

where I is the identity matrix, and α ∈ R∗+ . In such a case, it appears that the
perturbation measure τ (.) shares important similarities with the kernel Maha-
lanobis distance derived from the regularized covariance operator [20]. The main
difference derives from the fact that the measure τ is normalized and thus al-
lows for comparisons between several classes. Hence, in spite of lack of centered
data, the perturbation measure can be pictured as a Mahalanobis distance in
the feature space. Let us note that, as indicated in [20], the kernel Mahalanobis
distance is not affected by kernel centering.
366 N. Courty, T. Burger, and J. Laurent

Fig. 3. Illustration of the classification procedure on two different datasets: (a-b-c) Two
imbricated spirals and (d-e-f) a mixture of 3 Gaussians. In both cases, the lefthand
images (a-d) present the classification with the original Perturbo measure, whereas
center images (b-e) present sprectrum cut (95% of the eigenvalues were kept) and the
righthand images (c-f) present the classification with the regularized Perturbo measure.
In the two examples, we have σ = 1 and α = 0.1.

Hence, beyond alternatives to define an approximation for K−1 when K is


not invertible, these strategies are interesting to regularize the graph and conse-
quently, the function that has to be trained on it. Finally, they can be used even
in case of invertible matrices, in order to improve the efficiency of the classifi-
cation, by an adapted dimentionality reduction strategy, such as illustrated on
Figure 3, which exhibits the classification behavior of two toy examples, under
different regularization constraints. This issue will be quantitatively discussed in
the experimental section.

3.3 Active Learning


In most problems, unlabeled data (test samples) are numerous and free, whereas
labelled data (training examples) are usually not, as the labeling task is expen-
sive. Thus, the purpose of active learning is to improve the training by extracting
some knowledge from the huge amount of test samples available. To do so, the
training algorithm is allowed to query an oracle (classically, a human operator)
to get the label of particular test samples, the knowledge of which would greatly
improve the capability of the classfier. Thus, the key issue in active learning is
PerTurbo: A New Classification Algorithm 367

to derive methods to detect these particular test examples [21]. To do so, rather
intuitive and efficient strategies are based on defining particular regions of in-
terest in X , and to query the test samples which lives in them. More precisely,
it is particularly efficient to query near the expected boundaries of the classes.
Following the same idea, several other strategies can be defined, as is described
in [21].
Depending on the classification algorithm, defining a region of interest around
the expected borders of the classes may be rather complicated. To this extent,
PerTurbo is particularly interesting, as such a region of interest can easily be
defined according to the level sets of the implicit surfaces derived from the po-
tentials of the τ (x̃), ∀. In the case there are only two classes, the border
corresponds to the zeros of |τ1 (.) − τ2 (.)|. Then, the region of the query should
be defined around the border:

B = {x ∈ X | |τ1 (x) − τ2 (x)| < γ} (12)

and it can easily be represented, as depicted in Fig. 4. Similarly, in the case of


more than two classes, the region to query corresponds to the neighborhood of
the border between any pair of classes. Let us consider the two classes which
are the least perturbated by the sample x, and let us consider the correspond-
ing pertubation functions τr(1) (x) (r(1) being the least perturbated class) and
τr(2) (x) (r(2) being the second least perturbated class):

B = {x ∈ X | |τr(1) (x) − τr(2) (x)| < γ} (13)

As pictured in Fig. 4, this region corresponds to the neighborhood of the expected


borders of the classes. In the experimental section, we illustrate the efficiency of
this query strategy for active learning with PerTurbo.

4 Experimental Assesment
In this part, we provide experimental comparisons between PerTurbo and the
state-of-the-art. First we only consider the off-line classification, whithout any
active learning method. Second, we focus on the relevance of the query strategies.

4.1 Classification Performances


In this section, we apply PerTurbo to various datasets, and we compare the
resulting performances to the state-of-the-art.
We consider two kinds of datasets. First, there are three synthetic datasets
that can be easily reproduced. Second, there are ten real datasets that are pub-
licly available on the UCI Machine Learning Repository [22]. They are described
in Table 1. In the case of simulated datasets, we use a Gaussian Mixture Model
(GMM). For most of the datasets, no predefined training and test sets exists, so
we randomly define them, so that 20% of the datasets are used for training, and
the classification process is repeated ten times so that means and variances of
368 N. Courty, T. Burger, and J. Laurent

Fig. 4. Illustration of the estimation of the location of the border (top), and selection
of the region to query (bottom), for toy examples: A two spiral case (left) and a three
Gaussian case (right).

the performances can be computed. On the other hand, for datasets with pre-
defined training and testing set (such as Hill-valley), we keep those provided so
that the results are reproducible.
Three versions of PerTurbo are considered. In PerTurbo (full), we do not
reduce the dimensionality of K−1 . In PerTurbo (gle), on the contrary, we consider
only its highest eigenvalues of K−1 , so that they sum up to 95% of its trace, in
a GLE-inspired way. Although it is possible, here we do not try to optimize the
dimensionality reduction according to the dataset. Finally, in PerTurbo (reg),
we apply a regularisation to K. K−1 being non-inversible is mentionned in the
experiments. Theoretically this should not happen, unless two samples are very
close to each other, or due to numerical precision issues . Finally, the tuning of
the σ parameter (from the Gaussian kernel) and the α parameter when needed
are obtained by a classical cross-validation procedure operated on the learning
set. Those parameters were set once for all the classes in a given dataset, although
one value per class could have been obtained.
PerTurbo: A New Classification Algorithm 369

Table 1. Description of the GMM simulated datasets and of the datasets from UCI

Datasets #Training #Tests #Classes #Variables Comments


SimData-1 200 800 10 19 64 components
SimData-2 200 800 10 26 64 components
SimData-3 200 800 10 31 75 components
Ionosphere 71 280 2 34
diabets 154 614 2 8 missing values
Blood-transfusion 150 598 2 4
Ecoli 67 269 8 7 too small for CV
Glasses 43 171 6 9
Wines 36 142 3 13
Parkinsons 39 156 2 22
Letter-reco 4000 16000 26 16
Hill-valley1 606 606 2 100 50% unlabeled
Hill-valley2 606 606 2 100 50% unlabeled

For comparisons, we consider Support Vector Machines (SVM) and k Nearest


Neighbors (k-NN): SVM provides among the highest classification performances
in the state-of-the-art. Moreover, numerous libraries implement it with several
additional tools which help to tune the numerous parameters required in the
use of SVM, or better, provide an automatic procedure to select the optimum
parameters by means of cross-validation. This guaranties that the reference per-
formances are optimized. On the other hand, k-NN is completely intuitive and
requires the tuning of a single parameter (k the number of considered neigh-
bors), so that it provides a naive and easy-to-interpret lower bound for any
non-parametric model. Moreover, the difference of the performances between
k-NN and SVM provides an indication of the sensitivity of the performances to
the choice of the algorithm.
For SVM, we use, the R toolbox kernlab [23]. To choose the hyperparameter
of the algorithm, we simply test all the available kernel and choose the most
efficient one, which appears to be the Gaussian kernel. As a matter of fact, in all
the datasets but one, the Gaussian Kernel provides the best results, sometimes
with ties, sometimes with a strong difference. In the remaining one, the Laplace
kernel performed very slightly better. Thus, the choice of the Gaussian kernel for
all the dataset is reasonable. Then, the tuning of the parameters is automatically
achieved thanks to a 5-fold cross-validation. For problems with more than two
classes, a 1-versus-1 combination scheme is considered.
Concerning k-NN, the algorithm is applied for all the possible values of k ∈ I
and the highest performance
  only is considered. Practically, the best value for k

N N
is sought in I = 1, 2· m , where m is the means of training examples per
classes, and
. represents the integer part. Depending on the number of classes
and of k, ties are possible in the k-NN procedure. Then, they are randomly
solved. In such a case, the performances vary, and we provide their expectation
by computing the means of the repetition of 10 classifications.
370 N. Courty, T. Burger, and J. Laurent

Table 2. Comparison of the accuracy rates (percentages) with the various classifica-
tion algorithms. For simulated datasets, the mean and standard deviation on ten-fold
repetition is given. N.I stands for “not inversible", which refers to the inverse of K.

Datasets PerTurbo (full) PerTurbo (gle) PerTurbo (reg) SVM k-NN


SimData-1 78.7(1.9) 79.0 (1.8) 78.7 (1.9) 76.7 (1.3) 75.7 (1.6)
SimData-2 54.0 (2.4) 54.7 (2.5) 54.0 (2.4) 45.0 (1.8) 40.5 (1.9)
SimData-3 19.5 (1.2) 19.7 (1.2) 19.5 (1.2) 16.2 (1.3) 16.7 (1.7)
Ionosphere 91.9 (2.5) 91.5 (2.0) 92.1 (1.6) 92.5 (1.8) 84.2 (1.3)
diabets 71.0 (3.0) 71.6 (3.2) 72.6 (2.2) 74.0 (1.7) 71.2 (1.9)
Blood-transfusion N.I. 74.5 (2.3) 76.9 (1.0) 77.6 (1.5) 75.1 (0.9)
Ecoli 82.4 (2.8) 82.7 (2.45) 83.7 (2.5) 83.2 (2.2) 82.5 (1.8)
Glass 65.4 (2.9) 64.5 (3.9) 65.4 (2.9) 60.6 (4.4) 63.5 (2.4)
Wines 70.9 (3.3) 72.60 (1.3) 70.5 (2.8) 96.1 (1.7) 70.5 (2.1)
Parkinsons 82.8 (1.9) 82.8 (1.8) 83.8 (1.3) 85.0 (3.3) 82.5 (1.1)
Letter-reco N.I. 92.7 (0.2) 92.5 (0.3) 91.9 (0.3) 90.0 (0.4)
Hill-valley1 60.5 53.8 60.7 56.4 (2.4) 62.0
Hill-valley2 59.6 54.1 59.9 55.3 (1.0) 56.3

Accuracy rates are given in Table 2. Let us note that, in this table, k-NN ac-
curacies are over-estimated, as the k parameter is not estimated during a cross
validation, but directly on the test set. This explains why its performances are
sometimes higher than with SVM. On the contrary, SVM performances are opti-
mized with cross-validations, as SVM is much too sensitive to (hyper-)parameter
tuning. Finally, PerTurbo evaluations are loosly optimized, and a common tuning
for each class is used. Hence, the comparison is sharp with respect to PerTurbo,
and loose with respect to reference methods. From the accuracy rates, it ap-
pears that, except for the Wine datasets, where SVM completely outperforms
both PerTurbo and k-NN, there is no strict and major dominance of a method
over the others. Moreover, the relative interest of each method strongly depends
on the datasets, such as suggested by the No Free Lunch theorem [24]. There
are numerous datasets promoting PerTurbo (SimData-1, SimData-2, SimData-
3, Glass, hill-valley1, hill-valley2), whereas some other promote SVM (Diabete,
Blood-transfusion, Wine, Parkinsons), while the remaining are not significant to
determine a best method (Ionosphere, Ecoli, Letter-recognition), as the perfor-
mances are similar. Hence, even if from these tests, it is impossible to assess the
superiority of PerTurbo, it appears that this framework is able to provide classi-
fication methods comparable to that of the state-of-the-art, and we are confident
that, in the future, more elaborated and tuned version of PerTurbo will provide
even more accurate results.
Concerning, the variations of PerTurbo (full, gle, reg), it appears that the
best method is not always the same, and that accuracies strongly depend on
the datasets. Hence, further investigations are required to find the most adapted
variation, as well as a methodology to tune the corresponding parameters.
PerTurbo: A New Classification Algorithm 371

Beyond these tests, we have observed that, in general, PerTurbo is less ro-
bust/accurate than SVM when:
(1) values of some variables are missing or when there are integer-valued
variables (binary variables being completely unadapted),
(2) the number of example is very small, as the manifold is difficult to charac-
terize (However, the cross-validation for the tuning of the numerous parameters
of SVM is also not possible with too few examples).
On the other hand, we have observed that when the number of classes is
important, PerTurbo is more efficient than SVM. Moreover, in presence of nu-
merous noisy variables, dimensionality reduction or regularization techniques
provide a fundamental advantage to PerTurbo with respect to SVM. Finally,
several other elements promotes PerTurbo: First, its extreme algorithmic sim-
plicity, comparable to that of a PCA; Second, its simplicity of use and the re-
stricted number of parameters; Third, its high computational efficiency, for both
training and testing. As a matter of fact, the training is so efficient that the
active learning procedures requiring several iterations of the training are easily
manageable.

4.2 Active Learning Evaluation


Now, we focus on the active part of the training algorithm, and more specifically,
we validate that the pertubation measure used in PerTurbo is adapted to derive
and implement one of the most classical strategies to define queries (i.e. where
queries are performed around the borders of the classes).
We both consider PerTurbo and k-NN. For these two algorithms, we construct
the queries according to the perturbation measure, such as described in Section
3.3. Hence, we show that the validity of the latter for active learning is inde-
pendent of the classification algorithm, also it naturally fits best with PerTurbo.
Practically, between each training, the training set is enriched with a single sam-
ple, which has been chosen randomly amongst the samples which corresponds
to 10% of the test set for which the criterion of Eq. 13 is the smallest. These
results are compared to a reference strategy, where the queries are randomly cho-
sen from the test dataset, according to a uniform distribution. Such a reference
strategy is equivalent to compute the off-line performances of the classification
algorithm for various sizes of the training dataset. Due to the random selection
of the samples, the evolution of the accuracy of the classifiers may be rather
noisy. Thus, the process is repeated 10 times, and we only provide means and
variances. The results are given in Fig. 4.2 for three different datasets: Ecoli,
Wine, and SimData-1. It clearly illustrates the efficiency of the algorithm, as the
active learning strategy always outperforms the reference one: At the end of the
process, the differences of performances correspond to 14.2, 27.8 and 12.9 points
with the k-NN (respectively for Ecoli, Wine and SimData-1), and to 12.9, 20.9
and 10.5 with PerTurbo.
372 N. Courty, T. Burger, and J. Laurent

Fig. 5. Comparison of the active learning stragegy (red, upper trajectory) with the
reference one (blue lower trajectory), for SimData-1 (Top), Ecoli (middle) and Wine
(bottom) datasets. The classification algorithm is either PerTurbo (right column) or
k-NN (left column). The thick line represents the mean trajectory, and the shaded strip
around the mean represents the volatility.

5 Conclusion

In this paper, we have presented an original classification method based on the


analysis of the perturbations of the spectrum of an approximation of the Laplace-
Beltrami operator. Provided that the training examples of each class live on a
dedicated manifold which dimension is much lower than the input space dimen-
sion, we propose to evaluate the modification induced by any test sample when
added to the manifold of each class, and to choose the class corresponding to the
PerTurbo: A New Classification Algorithm 373

manifold where the perturbation is the smallest. This makes it possible to de-
rive an interesting strategy for sample queries in an active learning scenario. The
method is very simple, easy to implement and involves very few extra parameters
to tune. The experiments conducted with toy examples and real world datasets
showed performances comparable or slightly better than classical methods of the
state-of-the-art, including SVM. Future works will focus on a more systematical
consideration of the parameters of the algorithm, as well as comparisons with
more powerful kernel-based methods.

References
1. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, Chichester
(2001)
2. Chavel, I.: Eigenvalues in Riemannian geometry. Academic Press, Orlando (1984)
3. Lafon, S., Lee, A.B.: Diffusion maps and coarse-graining: A unified framework for
dimensionality reduction, graph partitioning, and data set parameterization. IEEE
Transactions on Pattern Analysis and Machine Intelligence 28, 1393–1403 (2006)
4. Nadler, B., Lafon, S., Coifman, R., Kevrekidis, I.: Diffusion maps, spectral cluster-
ing and eigenfunctions of fokker-planck operators. In: NIPS (2005)
5. Reuter, M., Wolter, F.-E., Peinecke, N.: Laplace-beltrami spectra as “shape-dna"
of surfaces and solids. Computer-Aided Design 38(4), 342–366 (2006)
6. Rustamov, R.: Laplace-beltrami eigenfunctions for deformation invariant shape
representation. In: Proc. of the Fifth Eurographics Symp. on Geometry Processing,
pp. 225–233 (2007)
7. Knossow, D., Sharma, A., Mateus, D., Horaud, R.: Inexact matching of large and
sparse graphs using laplacian eigenvectors. In: Torsello, A., Escolano, F., Brun, L.
(eds.) GbRPR 2009. LNCS, vol. 5534, pp. 144–153. Springer, Heidelberg (2009)
8. Öztireli, C., Alexa, M., Gross, M.: Spectral sampling of manifolds. ACM, New York
(2010)
9. Coifman, R.R., Lafon, S.: Diffusion maps. Applied and Computational Harmonic
Analysis 21(1), 5–30 (2006)
10. Ham, J., Lee, D., Mika, S., Schölkopf, B.: A kernel view of the dimensionality reduc-
tion of manifolds. In: Proc. of the International Conference on Machine learning,
ICML 2004, pp. 47–57 (2004)
11. Belkin, M., Sun, J., Wang, Y.: Constructing laplace operator from point clouds
in rd. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Dis-
crete Algorithms, SODA 2009, pp. 1031–1040. Society for Industrial and Applied
Mathematics, Philadelphia (2000)
12. Dey, T., Ranjan, P., Wang, Y.: Convergence, stability, and discrete approximation
of laplace spectra. In: Proc. of ACM-SIAM Symposium on Discrete Algorithms,
SODA 2010, pp. 650–663 (2010)
13. Luxburg, U.: A tutorial on spectral clustering. Statistics and Computing 17(4),
395–416 (2007)
14. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data
representation. Neural computation 15(6), 1373–1396 (2003)
15. Neumaier, A.: Solving ill-conditioned and singular linear systems: A tutorial on
regularization. Siam Review 40(3), 636–666 (1998)
16. Lee, J.A., Verleysen, M.: Nonlinear dimensionality reduction. Springer, Heidelberg
(2007)
374 N. Courty, T. Burger, and J. Laurent

17. Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computing 10(5), 1299–1319 (1998)
18. Aizerman, M.A., Braverman, E.M., Rozonoèr, L.: Theoretical foundations of the
potential function method in pattern recognition learning. Automation and remote
control 25(6), 821–837 (1964)
19. Meyer, C.: Matrix Analysis and Applied Linear Algebra. Society for Industrial and
Applied Mathematics, Philadelphia (2000)
20. Haasdonk, B., Pȩkalska, E.: Classification with Kernel Mahalanobis Distance Clas-
sifiers. In: Advances in Data Analysis, Data Handling and Business Intelligence,
Studies in Classification, Data Analysis, and Knowledge Organization, pp. 351–361
(2008)
21. Settles, B.: Active learning literature survey. Computer Sciences Technical Report
1648, University of Wisconsin–Madison (2009)
22. Frank, A., Asuncion, A.: UCI machine learning repository (2010)
23. Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab–an S4 package for
kernel methods in R. Journal of Statistical Software 11(9) (2004)
24. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE
Transactions on Evolutionary Computation 1(1), 67–82 (1997)
Datum-Wise Classification:
A Sequential Approach to Sparsity

Gabriel Dulac-Arnold1 , Ludovic Denoyer1 ,


Philippe Preux2 , and Patrick Gallinari1
1
Université Pierre et Marie Curie - UPMC, LIP6
Case 169 - 4 Place Jussieu - 75005 Paris, France
firstname.lastname@lip6.fr
2
LIFL (UMR CNRS) & INRIA Lille Nord-Europe
Université de Lille - Villeneuve d’Ascq, France
philippe.preux@inria.fr

Abstract. We propose a novel classification technique whose aim is to


select an appropriate representation for each datapoint, in contrast to
the usual approach of selecting a representation encompassing the whole
dataset. This datum-wise representation is found by using a sparsity
inducing empirical risk, which is a relaxation of the standard L0 regular-
ized risk. The classification problem is modeled as a sequential decision
process that sequentially chooses, for each datapoint, which features to
use before classifying. Datum-Wise Classification extends naturally to
multi-class tasks, and we describe a specific case where our inference has
equivalent complexity to a traditional linear classifier, while still using
a variable number of features. We compare our classifier to classical L1
regularized linear models (L1 -SVM and LARS) on a set of common bi-
nary and multi-class datasets and show that for an equal average number
of features used we can get improved performance using our method.

1 Introduction
Feature Selection is one of the main contemporary problems in Machine Learning
and has been approached from many directions. One modern approach to feature
selection in linear models consists in minimizing an L0 regularized empirical risk.
This particular risk encourages the model to have a good balance between a
low classification error and high sparsity (where only a few features are used for
classification). As the L0 regularized problem is combinatorial, many approaches
such as the LASSO [1] try to address the combinatorial problem by using more
practical norms such as L1 . These approaches have been developed with two
main goals in mind: restricting the number of features for improving classification
speed, and limiting the used features to the most useful to prevent overfitting.
These classical approaches to sparsity aim at finding a sparse representation of
the features space that is global to the entire dataset.

This work was partially supported by the French National Agency of Research (Lam-
pada ANR-09-EMER-007).

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 375–390, 2011.

c Springer-Verlag Berlin Heidelberg 2011
376 G. Dulac-Arnold et al.

We propose a new approach to sparsity where the goal is to limit the number
of features per datapoint, thus datum-wise sparse classification (DWSC). This
means that our approach allows the choice of features used for classification to
vary relative to each datapoint; data points that are easy to classify can be in-
ferred on without looking at very many features, and more difficult datapoints
can be classified using more features. The underlying motivation is that, while
classical approaches balance between accuracy and sparsity at the dataset level,
our approach optimizes this balance at the individual datum level, thus resulting
in equivalent accuracy at higher overall sparsity. This kind of sparsity is inter-
esting for several reasons: First, simpler explanations are always to be preferred
as per Occam’s Razor. Second, in the knowledge extraction process, such datum-
wise sparsity is able to provide unique information about the underlying structure
of the data space. Typically, if a dataset is organized onto two different subspaces,
the datum-wise sparsity principle will allows the model to automatically choose
to classify using only the features of one or another of the subspace.
DWSC considers feature selection and classification as a single sequential deci-
sion process. The classifier iteratively chooses which features to use for classifying
each particular datum. In this sequential decision process, datum-wise sparsity
is obtained by introducing a penalizing reward when the agent chooses to in-
corporate an additional feature into the decision process. The model is learned
using an algorithm inspired by Reinforcement Learning [2].
The contributions of the paper are threefold: (i.) We propose a new approach
where classification is seen as a sequential process where one has to choose which
features to use depending on the input being inferred upon. (ii.) This new ap-
proach results in a model that obtains good performance in terms of classification
while maximizing datum-wise sparsity, i.e. the mean number of features used for
classifying the whole dataset. It also naturally handles multi-class classification
problems, solving them by using as few features as possible for all classes com-
bined. (iii.) We perform a series of experiments on 14 different corpora and
compare the model with those obtained by the LARS [3], and a L1 -regularized
SVM, thus providing a qualitative study of the behaviour of our algorithm.
The paper is organized as follow: First, we define the notion of datum-wise
sparse classifiers and explain the interest of such models in Section 2. We
then describe our sequential approach to classification and detail the learning
algorithm and the complexity of such an algorithm in Section 3. We describe
how this approach can be extended to multi-class classification in Section 4. We
detail experiments on 14 datasets, and also give a qualitative analysis of the
behaviour of this model in Section 6. The related work is given in Section 7.

2 Datum-Wise Sparse Classifiers


We consider the problem of supervised multi-class classification1 where one wants
to learn a classification function fθ : X → Y to associate one category y ∈ Y to
a vector x ∈ X , where X = Rn , n being the dimension of the input vectors. θ is
1
Note that this includes the binary supervised classification problem as a special case.
Datum-Wise Classification: A Sequential Approach to Sparsity 377

the set of parameters learned from a training set composed of input/output pairs
Train = {(xi , yi )}i∈[1..N ] . These parameters are commonly found by minimizing
the empirical risk defined by:
N
1 
θ∗ = argmin Δ(fθ (xi ), yi ), (1)
θ N i=1
where Δ is the loss associated to a prediction error.
This empirical risk minimization problem does not consider any prior as-
sumption or constraint concerning the form of the solution and can result in
overfitting models. Moreover, when facing a very large number of features, ob-
tained solutions usually need to perform computations on all the features for
classifying any input, thus negatively impacting the model’s classification speed.
We propose a different risk minimization problem where we add a penalization
term that encourages the obtained classifier to classify using on average as few
features as possible. In comparison to classical L0 or L1 regularized approaches
where the goal is to constraint the number of features used at the dataset level,
our approach performs sparsity at the datum level, allowing the classifier to use
different features when classifying different inputs. This results in a datum-wise
sparse classifier that, when possible, only uses a few features for classifying
easy inputs, and more features for classifying difficult or ambiguous ones.
We consider a different type of classifier function that, in addition to predicting
a label y given an input x, also provides information about which features have
been used for classification. Let us denote Z = {0; 1}n. We define a datum-wise
classification function f of parameters θ as:

X →Y ×Z
fθ : ,
fθ (x) = (y, z)
where y is the predicted output and z is a n-dimensional vector z = (z 1 , ..., z n ),
where z i = 1 implies that feature i has been taken into consideration for comput-
ing label y on datum x. By convention, we denote the predicted label as yθ (x)
and the corresponding z-vector as zθ (x). Thus, if zθi (x) = 1, feature i has been
used for classifying x into category yθ (x).
This definition of data-wise classifiers has two main advantages: First, as we
will see in the next section, because fθ can explain its use of features with zθ (x),
we can add constraints on the features used for classification. This allows us to
encourage datum-wise sparsity which we define below. Second, while this is not
the main focus of our article, analysis of zθ (x) gives a qualitative explanation
of how the classification decision has been made, which we study in Section 6.
Note that the way we define datum-wise classification is an extension to the
usual definition of a classifier.

2.1 Datum-Wise Sparsity


Datum-wise sparsity is obtained by adding a penalization term to the empirical
loss defined in equation (1) that limits the average number of features used for
classifying:
378 G. Dulac-Arnold et al.

N N
1  1 
θ∗ = argmin Δ(yθ (xi ), yi ) + λ zθ (xi )0 . (2)
θ N i=1 N i=1

The term zθ (xi )0 is the L0 norm 2 of zθ (xi ), i.e. the number of features selected
for classifying xi , that is, the number of elements in zθ (xi ) equal to 1. In the
general case, the minimization of this new risk results in a classifier that on
average selects only a few features for classifying, but may use a different set of
features w.r.t to the input being classified. We consider this to be the crux of
the DWSC model : the classifier takes each datum into consideration differently
during the inference process.



   
 
  
  


   
 

(left) (right)

Fig. 1. The sequential process for a problem with 4 features (f1 , ..., f4 ) and 3 possible
categories (y1 , ..., y3 ). Left: The gray circle is the initial state for one particular input
x. Small circles correspond to terminal states where a classification decision has been
made. In this example, the classification (bold arrows) has been made by sequentially
choosing to acquire feature 3 then feature 2 and then to classify x in category y1 . The
bold (red) arrows correspond to the trajectory made by the current policy. Right:
The value of zθ (x) for the different states are illustrated. The value on the arrows
corresponds to the immediate reward received by the agent assuming that x belongs to
category y1 . At the end of the process, the agent has received a total reward of 0 − 2λ.

Note that the optimization of the loss defined in equation (2) is a combinatorial
problem that cannot be easily solved. In the next section of this paper, we propose
an original way to deal with this problem, based on a Markov Decision Process.

3 Datum-Wise Sparse Sequential Classification

We consider a Markov Decision Problem (MDP, [4])3 to classify an input x ∈ Rn .


At the beginning, we have no information about x, that is, we have no at-
tribute/feature values. Then, at each step, we can choose to acquire a particular
2
The L0 ‘norm’ is not a proper norm, but we will refer to it as the L0 norm in this
paper, as is common in the sparsity community.
3
The MDP is deterministic in our case.
Datum-Wise Classification: A Sequential Approach to Sparsity 379

feature of x, or to classify x. The act of classifying x in the category y ends an


“episode” of the sequential process. The classification process is a deterministic
process defined by:
– A set of states X × Z, where state (x, z) corresponds to the state where the
agent is currently classifying datum x and has selected features specified by
z. The number of currently selected features is thus z0 .
– A set of actions A where A(x, z) denotes the set of possible actions in state
(x, z). We consider two types of actions:
• Af is the set of feature selection actions Af = {f1 , . . . , fn } such that, for
a ∈ Af , a = fj corresponds to choosing feature j. Action fj corresponds to
a vector with only the j th element equal to 1, i.e. fj = (0, . . . , 1, . . . , 0).
Note that the set of possible feature selection actions on state (x, z),
denoted Af (x, z), is equal to the subset of currently unselected features,
i.e. Af (x, z) = {fj , s.t. zj = 0}.
• Ay is the set of classification actions Ay = Y, that correspond to assign-
ing a label to the current datum. Classification actions stop the sequential
decision process.
– A transition function defined only for feature selection actions (since classi-
fication actions are terminal):

X × Z × Af → X × Z
T :
T ((x, z), fj ) = (x, z )
where z is an updated version of z such that z = z + fj .

Policy. We define a parameterized policy πθ , which, for each state (x, z), returns
the best action as defined by a scoring function sθ (x, z, a):
πθ : X × Z → A and πθ (x, z) = argmax sθ (x, z, a).
a

The policy πθ decides which action to take by applying the scoring function to
every action possible from state (x, z) and greedily taking the highest scoring
action. The scoring function reflects the overall quality of taking action a in
state (x, z), which corresponds to the total reward obtained by taking action a
in (x, z) and thereafter following policy πθ 4 :
T

sθ (x, z, a) = r(x, z, a) + rθt |(x, z), a.
t=1

Here (rθt
| (x, z), a) corresponds to the reward obtained at step t while having
started in state (x, z) and followed the policy with parameterization θ for t steps.
Taking the sum of these rewards gives us the total reward from state (x, z) until
the end of the episode. Since the policy is deterministic, we may refer to a
parameterized policy using simply θ. Note that the optimal parameterization θ∗
obtained after learning (see Sec. 3.3) is the parameterization that maximizes the
expected reward in all state-action pairs of the process.
4
This corresponds to the classical Q-function in Reinforcement Learning.
380 G. Dulac-Arnold et al.

In practice, the initial state of such a process for an input x corresponds to


an empty z vector where no feature has been selected. The policy θ sequentially
picks, one by one, a set of features pertinent to the classification task, and then
chooses to classify once enough features have been considered.

Reward. The reward function reflects the immediate quality of taking action
a in state (x, z) relative to the problem at hand. We define a reward function
over the training set (xi , yi ) ∈ T : R : X × Z × A → R which reflects how
good of a decision taking action fj on state (xi , z) for input xi is relative to our
classification task. This reward is defined as follows5 :
– If a corresponds to a feature selection action, then the reward is −λ.
– If a corresponds to a classification action i.e. a = y, we have:

r(xi , z, y) = 0 if y = yi and = −1 if y 
= yi

In practice, we set λ << 1 to avoid situations where classifying incorrectly is a


better decision than choosing multiple features.

3.1 Reward Maximization and Loss Minimization


As explained in section 2, our ultimate goal is to find the parameterization
θ∗ that minimizes the datum-wise empirical loss defined in equation (2). The
training process for the MDP described above is the maximization of a reward
function. Let us therefore show that maximizing the reward function is equivalent
to minimizing the datum-wise empirical loss.
N N
1  1 
θ∗ = argmin Δ(yθ (xi ), yi ) + λ ||zθ (xi )||0
θ N i=1 N i=1
N
1 
= argmin (Δ(yθ (xi ), yi ) + λ||zθ (xi )||0 )
θ N i=1
N
1 
= argmax (−Δ(yθ (xi ), yi ) − λ||zθ (xi )||0 )
θ N i=1
N

1  0 − λ · ||zθ (xi )||0 if y = yi
= argmax
θ N i=1 −1 − λ · ||zθ (xi )||0 if y = yi
N Tθ (xi )+1
1   (t) (t)
= argmax r(xi , zθ (xi ), πθ (xi , zθ ))
θ N i=1 t=1

(t)
where πθ (xi , zθ ) is the action taken at time t by the policy πθ for the training
example xi .
5
Note that we can add −λ · z0 to the reward at the end of the episode, and give a
constant intermediate reward of 0. These two approaches are interchangeable.
Datum-Wise Classification: A Sequential Approach to Sparsity 381

Such an equivalence between risk minimization and reward maximization


shows that the optimal classifier θ∗ corresponds to the optimal policy in the
MDP defined previously. This equivalence allows us to use classical MDP res-
olution algorithms in order to find the best classifier. We detail the learning
procedure in Section 3.3.

3.2 Inference and Approximated Decision Processes

Due to the infinite number of possible inputs x, the number of states is also
infinite. Moreover, the reward function r(x, z, a) is only known for the values
of x that are in the training set and cannot be computed for any other input.
For these two reasons, it is not possible to compute the score function for all
state-action pairs in a tabular manner, and this function has to be approximated.
The scoring function that underlies the policy sθ (x, z, a) is approximated with
a linear model6 :
s(x, z, a) = Φ(x, z, a); θ
and the policy defined by such a function consists in taking in state (x, z) the
action a that maximizes the scoring function i.e a = argmaxa∈A Φ(x, z, a); θ.
Due to their infiniteness, the state-action pairs are represented in a feature
space. We note Φ(x, z, a) the featurized representation of the (x, z), a state-
action pair. Many definitions may be used for this feature representation, but
we propose a simple projection: we restrict the representation of x to only the
selected features. Let μ(x, z) be the restriction of x according to z:

i xi if z i = 1
μ(x, z) = .
0 elsewhere

To be able to differentiate between an attribute of x that is not yet known, and


an attribute that is simply equal to 0, we must keep the information present in
z. Let φ(x, z) = (z, μ(x, z)) be the intermediate representation that corresponds
to the concatenation of x with z. Now we simply need to keep the information
present in a in a manner that allows each action to be easily distinguished by
a linear classifier. To do this we use the block-vector trick [5] which consists
in projecting φ(x, z) into a higher dimensional space such that the position of
φ(x, z) inside the global vector Φ(x, z, , a) is dependent on action a:

Φ(x, z, a) = (0, . . . , 0, Φ(x, z), 0, . . . , 0) .

In Φ(x, z, a), the block φ(x, z) is at position ia · |φ(x, z)| where ia is the index
of action a in the set of all the possible actions. Thus, φ(x, z) is offset by an
amount dependent on the action a.
6
Although non-linear models such as neural networks may be used, we have chosen
to restrict ourselves to a linear model to be able to properly compare performance
with that of other state-of-the-art linear sparse models.
382 G. Dulac-Arnold et al.

3.3 Learning
The goal of the learning phase is to find an optimal policy parameterization θ∗
which maximizes the expected reward, thus minimizing the datum-wise regu-
larized loss defined in (2). As explained in Section 3.2, we cannot exhaustively
explore the state space during training, and therefore we use a Monte-Carlo
approach to sample example states from the learning space. We use the Approx-
imate Policy Iteration (API) algorithm with rollouts [6]. Sampling state-action
pairs according to a previous policy πθ(t−1) , API consists in iteratively learning
a better policy πθ(t) by way of the Bellman equation. The API With Rollouts
algorithm is composed of three main steps that are iteratively repeated:
1. The algorithm begins by sampling a set of random states: the x vector is
sampled from a uniform distribution in the training set, and z is also sampled
using a uniform binomial distribution.
2. For each state in the sampled state, the policy πθ(t−1) is used to compute the
expected reward of choosing each possible action from that state. We now
have a feature vector Φ(x, z, a) for each state-action pair in the sampled set,
and the corresponding expected reward denoted Rθ(t−1) (x, z, a).
3. The parameters θ(t) of the new policy are then computed using classical
linear regression on the set of states — Φ(x, z, a) — and corresponding ex-
pected rewards — Rθ(t−1) (x, z, a) — obtained previously. The generalizing
capacity of the classifier gives an estimated score to state-action pairs even
if we have never visited them.
After a certain number of iterations, the parameterized policy converges to a
final policy π which is used for inference.

4 Preventing Overfitting in the Sequential Model


In section 3, we explain the process by which, at each step, we either choose a
new feature or classify the current datum. This process is at the core of DWSC
but can suffer from overfitting if the number of features is larger than the number
of training examples. In such a case, DWSC would tend to learn to select the
more specific features for each training example. In classical L1 regularization
models that are not datum-wise, the classifier must use the same set of features
for classifying any data and thus overly specific features are not chosen because
they usually appear in only a few training examples.
We propose a very simple variant of the general model that allows us to avoid
overfitting. We still allow DWSC to choose how many features to use before
classifying an input x, but we constrain it to choose the features in the same
order for all the inputs. For that, we constrain the score of the feature selection
actions to depend only on the vector z of the state (x, z). An example of the
effect of such a constraint is presented in Fig. 2. This constraint is handled in
the following manner:

sθ (x, z, a) = sθ (z, a) if a ∈ Af
∀(x, z, a) , (3)
sθ (x, z, a) = sθ (x, z, a) if a ∈ Ay
Datum-Wise Classification: A Sequential Approach to Sparsity 383

where sθ (x, z, a) = sθ (z, a) implies that the score is computed using only the
values of z and a — x is ignored. This corresponds to having two different types
of state-action feature vectors Φ depending on the type of action:

if a ∈ Af , Φ(x, z, a) = (0, . . . , 0, z, 0, . . . , 0)
. (4)
if a ∈ Ay , Φ(x, z, a) = (0, . . . , 0, z, Φ(x, z), 0, . . . , 0)

Example Features Selected Example Features Selected


x1 : 2 3 x1 : 2 3
x2 : 1 4 2 3 x2 : 2 3 1 4
x3 : 3 x3 : 2 3
x4 : 2 3 1 x4 : 2 3 1
Unconstrained Model Constrained Model

Fig. 2. Difference between the base Unconstrained Model (DWSM-Un) and the Con-
strained Model (DWSM-Con) described in section 4. The figure shows, for 4 different
inputs x1 , ..., x4 the features selected by the classifiers before classification. One can see
that the Constrained Model chooses the features in the same order for all the inputs.

Although this constraint forces DWSC to choose the features in the same
order, it will still automatically learn the best order in which to choose the
features, and when to stop adding features and classify. However, it will avoid
choosing very different features sets for classifying different inputs (the first
features chosen will be common to all the inputs being classified) and thus avoid
the overfitting problem.

5 Complexity Analysis
Learning Complexity: As explained in section 3.3, the learning method is based
on Reinforcement Learning with Rollouts. Such an approach is expensive in term
of computations because it needs — at each iteration of the algorithm — to sim-
ulate trajectories in the decision process, and then to learn the scoring function
sθ based on these trajectories. Without giving the details of the computation,
the complexity of each iteration is O(Ns · (n2 + c)), where Ns is the number
of states used for rollouts (which in practice is proportional to the number of
training examples), n is the number of features and c is the number of possible
categories. This implies a learning method which is quadratic w.r.t. the num-
ber of features; the proposed approach is not able to deal with problems with
thousands of possible features. Breaking this complexity is an active research
perspective with some leads.

Inference Complexity: Inference on an input x consists in sequentially choos-


ing features, and then classifying x. At step t, one has to perform (n − t) + c
linear computations in order to choose the best action, where (n − t) + c is the
number of possible actions when t features have already been acquired. The
384 G. Dulac-Arnold et al.

inference complexity is thus O(Nf · (n + c)), where Nf is the mean number of


features chosen by the system before classifying. In fact, due to the shape of the
Φ function presented in Section 3.2 and the linear nature of sθ , the score of the
actions can be efficiently incrementally computed at each step of the process by
just adding the contribution of the newly added feature. The complexity is thus
reduced to O(n + c). Moreover, the constrained model which results in ordering
the features, has a lower complexity of O(c) because in that case, the model does
not have to choose between the different remaining features, and has only the
choice to classify or get the next feature w.r.t. to the learned order.
If the learning complexity of our model is higher than baseline global linear
methods, the inference speed is very close for the unconstrained model, and
equivalent for the constrained one. In practice, most of the baseline methods
choose a subset of variables in a couple seconds to a couple minutes, whereas
our method takes from a dozen minutes to an hour, depending on the number of
features and categories. In practice inference is indeed of the same speed, which
is in our opinion the important factor.

6 Experiments
Experiments were run on 14 different datasets obtained from the LibSVM Web-
site7 . Ten of these datasets correspond to a binary classification task, four to
a multi-class problem. The datasets are described in Table 1. For each dataset,
we randomly sampled different training sets by taking from 5% to 75% of the
examples as training examples, with the remaining examples being kept for test-
ing. We performed experiments with three different models: L1-SVM was used
as a baseline linear model with L1 regularization8. LARS was used to obtain
the optimal solution of the LASSO problem for all values of the regularization
coefficient λ at once9 . Datum-Wise Sequential Model (DWSM) was tested
with the two versions presented above: (i) DWSM-Un is the original uncon-
strained model and (ii) DWSM-Con is the constrained model for preventing
overfitting.
For the evaluation, we used a classical accuracy measure which corresponds
to 1 − error rate on the test set of each dataset. We perform 3 training/testing
set splits of a given dataset to obtain averaged figures. The sparsity has been
measured as the proportion of features not used for L1 -SVM and LARS in binary
classification, and the mean proportion of features not used to classify testing
examples in DWSM. For multi-class problems where one LARS/SVM model
is learned for each category, the sparsity is the proportion of features that have
not been used in any of the models.
For the sequential experiments, the number of rollout states (step 1 of the
learning algorithm) has been set to 2,000 and the number of policy iterations
has been fixed to 10. Note that experiments with more rollout states and/or more
7
http://www.csie.ntu.edu.tw/~ cjlin/libsvmtools/datasets/
8
Using LIBLINEAR [7].
9
We use the implementation from the authors of the LARS, available in R.
Datum-Wise Classification: A Sequential Approach to Sparsity 385

Australian - Train Size = 10% Splice - Train Size = 75 %


90 80

85 75

80 70
Accuracy (%)

Accuracy (%)
75 65

70 60

65 55

60 LARS 50 LARS
SVM L1 SVM L1
DWSM-Con DWSM-Con
DWSM-Un DWSM-Un
55 45
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Sparsity Sparsity

Fig. 3. Accuracy w.r.t. to sparsity. In both plots, the left side on the x-axis corresponds
to a low sparsity, while the right side corresponds to a high sparsity. The performances
of the models are usually decreasing when the sparsity increases, except in case of
overfitting.

Table 1. Datasets used for the experiments


Name Number of Features Number of examples Number of Classes Task
Australian 14 690 2 Binary
Breast Cancer 10 683 2 Binary
Diabetes 8 768 2 Binary
German Numer 24 1,000 2 Binary
Heart 13 270 2 Binary
Ionosphere 34 351 2 Binary
Liver Disorders 6 345 2 Binary
Sonar 60 208 2 Binary
Splice 60 1,000 2 Binary
Svm Guide 3 21 1,284 2 Binary
Segment 19 2,310 7 Multiclass
Vehicle 18 846 4 Multiclass
Vowel 10 1,000 11 Multiclass
Wine 13 178 3 Multiclass

iterations give similar results. Experiments were made using an alpha mixture
policy with α = 0.9 to ensure the stability of the learning process. We tested
the different models with different values of λ which controls the sparsity. Note
that even with a λ = 0 value, contrary to the baseline models, the DWSM model
does not use all of the features for classification.

6.1 Results
For each corpus and each training size, we have computed sparsity/accuracy
curves showing the performance of the different models w.r.t. to the sparsity of
the solution. Only two representative curves are given in Figure 3. To summarize
the performances over all the datasets, we give the accuracy of the different
models for three levels of sparsity in tables 2 and 3. Due to a lack of space,
these tables do not present the LARS’ performance, which are equivalent to
the performances of the L1 -SVM. Note that in order to obtain the accuracy
for a given level of sparsity, we have computed a linear interpolation on the
different curves obtained for each corpus and each training size. This linear
interpolation allows us to compare the baseline sparsity methods — that choose
386 G. Dulac-Arnold et al.

Table 2. This table contains the accuracy of each model on the binary classification
problems depending on three levels of sparsity (80%, 60%, and 40%) using different
training sizes. The accuracy has been linearly interpolated from curves like the ones
given in Figure 3.

Corpus Train Size Sparsity = 0.8 Sparsity = 0.6 Sparsity = 0.4


DWSM-Un DWSM-Con SVM L1 DWSM-Un DWSM-Con SVM L1 DWSM-Un DWSM-Con SVM L1
0.05 85.31 85.13 85.52 84.63 84.83 85.42 84.01 84.32 85.15
0.1 85.75 86.16 85.51 86.00 86.32 85.51 86.13 85.70 86.34
australian
0.25 85.39 86.16 85.33 86.76 86.99 85.33 86.56 86.49 86.10
0.5 83.58 84.13 83.47 84.33 84.19 83.47 84.85 84.28 83.70
0.05 87.73 88.50 88.17 96.58 96.49 94.93 96.57 96.68 96.62
0.1 89.25 89.16 88.19 94.70 95.13 91.91 95.42 95.63 92.88
breast-cancer
0.25 91.16 88.48 92.64 97.06 96.76 94.38 97.09 97.11 96.71
0.5 82.65 82.60 90.26 96.92 95.98 93.41 96.50 97.01 94.84
0.05 68.61 69.10 65.85 71.30 71.36 67.46 72.15 0.00 71.07
0.1 70.92 69.14 65.71 72.52 71.62 64.97 73.26 72.81 70.23
diabetes
0.25 68.39 68.58 64.88 71.83 72.42 65.88 74.80 74.96 75.09
0.5 70.65 69.69 62.27 72.67 70.90 67.30 73.82 73.62 72.14
0.05 70.58 70.47 67.36 70.74 70.39 69.95 69.99 70.28 69.73
0.1 69.82 69.62 69.10 70.81 70.39 71.52 71.79 0.00 72.85
german.numer
0.25 72.25 72.00 65.98 72.67 73.26 72.89 73.10 0.00 74.11
0.5 70.03 70.62 69.72 71.50 72.37 71.97 72.96 74.05 72.68
0.05 48.33 48.17 45.42 51.17 50.67 65.73 0.00 0.00 68.24
0.1 75.50 74.27 73.42 76.61 75.78 73.76 77.60 77.49 74.94
heart
0.25 76.17 78.50 76.26 81.31 81.70 83.33 82.24 83.00 83.64
0.5 70.34 68.87 69.20 77.15 80.34 78.83 80.48 80.40 80.58
0.05 69.52 71.36 73.55 73.44 73.02 72.23 74.77 75.16 72.59
0.1 71.58 71.09 71.84 75.12 74.63 75.89 74.97 74.93 74.49
ionosphere
0.25 79.65 80.29 75.94 85.18 85.44 81.58 85.58 85.69 82.78
0.5 77.31 78.40 71.15 82.94 82.68 78.18 84.96 84.16 79.64
0.05 60.40 59.37 57.01 60.07 61.25 57.01 60.29 64.27 57.74
0.1 56.70 56.24 55.41 55.85 55.98 56.43 56.69 55.00 55.86
liver-disorders
0.25 56.69 56.14 54.18 58.07 57.02 55.10 58.69 57.97 56.93
0.5 58.93 59.55 60.84 60.10 58.81 60.96 59.33 60.84 61.33
0.05 57.59 59.95 64.14 68.50 66.49 65.15 69.45 70.48 61.24
0.1 61.69 64.40 64.12 68.68 73.93 64.12 74.25 75.20 63.53
sonar
0.25 67.32 64.74 67.52 73.52 70.63 74.52 75.22 73.36 72.82
0.5 68.19 64.71 65.77 72.18 69.76 69.37 73.73 71.60 65.77
0.05 67.23 68.41 67.82 70.14 68.66 65.93 70.51 69.89 64.47
0.1 66.90 66.87 61.46 70.35 67.99 62.63 71.05 70.07 61.62
splice
0.25 73.87 73.89 70.49 74.81 75.30 72.28 75.60 76.64 71.74
0.5 72.86 76.79 72.78 74.98 77.88 70.36 77.09 0.00 69.35
0.05 77.15 77.13 77.17 77.32 77.25 78.25 77.48 77.37 78.21
0.1 77.31 77.28 76.59 77.94 78.11 78.95 78.58 78.94 78.37
svmguide3
0.25 76.67 76.56 75.96 77.44 77.14 77.40 78.21 77.72 77.91
0.5 77.71 77.78 76.87 78.55 78.63 78.15 79.38 79.47 78.37

a fixed number of features — with the average number of features chosen by


DWSC. This compares the average amount of information considered by each
classifier. We believe this approach still provides a good appreciation of the
algorithm’s capacities.
Table 2 shows that, for a sparsity level of 80%, the DWSM-Un and the DWSM-
Con models outperform the baseline L1 -SVM classifier. This is particularly true
for 7 of the 10 datasets while the results are more ambiguous on the three others
datasets: breast, ionosphere and sonar. For a sparsity of 40%, similar results are
obtained. Depending on the corpus and the training size, different configurations
are observed. Some datasets can be easily classified using only a few features,
such as australian for example. In that case, our approach gives similar results
in comparison to L1 methods (see Figure 3–left). For some other datasets, our
method clearly outperforms baseline methods (Figure 3–right). On the splice
dataset, our model is better than the best (non-sparse) SVM using only less
than 20% of the features on average. This is due to the fact that our sequential
process, which solves a different classification problem, is more appropriate for
some particular datasets, particularly when the distribution of the data is split
up amongst distinct subspaces. In this case, our model is able to choose more
appropriate features for each input.
Datum-Wise Classification: A Sequential Approach to Sparsity 387

Table 3. This table contains the accuracy of each model on the multi-class classification
problems depending on three levels of sparsity (80%, 60%, and 40%) using different
training sizes.

Corpus Train Size Sparsity = 0.8 Sparsity = 0.6 Sparsity = 0.4


DWSM-Un DWSM-Con L1-SVM DWSM-Un DWSM-Con L1-SVM DWSM-Un DWSM-Con L1-SVM
0.1 42.06 41.23 35.31 53.87 53.02 45.49 54.83 56.57 56.98
0.2 40.76 40.17 40.48 55.70 56.34 45.97 57.42 59.10 53.24
segment
0.5 43.29 0.00 37.17 54.09 0.00 45.15 56.43 0.00 50.52
0.75 43.78 41.13 38.22 55.10 53.60 44.80 56.54 56.99 47.00
0.1 34.23 37.52 43.36 43.50 45.34 50.25 47.21 0.00 56.54
0.2 38.32 39.27 53.04 45.84 45.68 53.36 48.68 47.91 52.83
vehicle
0.5 39.74 39.51 42.95 46.64 47.57 50.30 0.00 48.40 51.99
0.75 40.32 40.37 41.04 49.96 49.31 53.68 51.86 51.53 53.77
0.1 18.03 19.27 9.83 24.17 22.82 16.24 25.28 25.80 18.38
0.2 0.00 15.27 14.71 20.17 15.93 0.00 22.59 15.93
vowel
0.5 18.98 17.81 9.57 24.56 25.33 17.73 28.45 27.31 23.76
0.75 19.85 19.49 14.41 28.01 31.45 24.58 32.09 32.74 26.69
0.1 70.22 70.66 73.58 76.42 77.87 89.38 78.66 76.67 91.36
0.2 71.52 72.68 80.34 78.27 79.11 92.12 78.76 77.72 94.16
wine
0.5 72.99 74.41 74.40 79.43 80.60 86.90 82.15 79.50 91.38
0.75 76.21 75.04 72.00 80.18 81.84 94.00 83.23 80.93 96.00

1
1
0.9
0.9
0.8
0.8
Proportion of Inputs
Proportion of inputs

0.7 0.7 DWSM-Con


0.6 0.6
DWSM-Con 0.5 DWSM-Un
0.5
0.4 DWSM-Un 0.4
0.3 SVM 0.3 DWSM-Con Wrong
0.2 0.2 Decision
LARS
0.1 0.1 DWSM-Un Wong Decision
0 0
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Features
Number of Features acquired before classification

Fig. 4. Breast-Cancer, training size = 10%, Sparsity ≈ 50 % Left: The distribution


of use of each feature. For example, DWSM-Con uses feature 2 for classifying 100% of
the test examples, while DWSM-Un uses this feature for classifying only 88% of the
examples. Right: The mean proportion of features used for classifying. For example
DWSM-Con classifies 42% of the examples using exactly 2 features while DWSM-Un
classifies 21% of the examples using exactly 2 features.

When using small training sets with some datasets — sonar or ionosphere
— where overfitting is observed (accuracy decreases with more features used),
the DWSM-Con seems to be a better choice than the unconstrained version and
thus is a version of the algorithm that is well-suited when the number of learning
examples is small.
Concerning the multi-class problems, similar effects can be observed (see
Table 3). The model seems particularly interesting when the number of cate-
gories is high, as in segment and vowel. This is due to the fact that the average
sparsity is optimized by the sequential model for the multi-class problem while
L1 -SVM and LARS, which need to learn one model for each category, perform
separate sparsity optimizations for each class.
Figure 4 gives some qualitative results. First, from the left histogram, one
can see that some features are used in 100% of the decisions. This illustrates
the ability of the model to detect important features that must be used for
decision. Note that many of these features are also used by the L1 -SVM and the
388 G. Dulac-Arnold et al.

LARS models. The sparsity gain in comparison to the baseline model is obtained
through the features 1 and 9 that are only used in about 20% of decisions. From
the right histogram, one can see that the DWSM model mainly classifies using
1, 2, 3 or 10 features, showing that the model is able to adapt its behaviour to
the difficulty of classifying a particular input. This is confirmed by the green
and violet histograms that show that for incorrect decisions (i.e. very difficult
inputs) the classifier almost always acquires all the features before classifying.
These difficult inputs seem to have been identified, but the set of features is not
sufficient for a good understanding. This behaviour opens appealing research
directions concerning the acquisition and creation of new features (see Section 8).

7 Related Work
Feature selection comes in three main flavors [8]: wrapper, filter, or embedded
approaches. Wrapper approaches involve searching the feature space for an
optimal subset of features that maximize classifier performance. The feature
selection step wraps around the classifier, using the classifier as a black-box
evaluator of the selected feature subset. Searching the entire feature space is
very quickly intractable and therefore various approaches have been proposed to
restrict the search (see [9,10]). The advantage of the wrapper approaches is that
the feature subset decision can take into consideration feature inter-dependencies
and avoid redundant features, however the problem remains of the exponential
size of the search space. Filter approaches rank the features by some scor-
ing function independent of their effect on the associated classifier. Since the
choice of features is not influenced by classifier performance, filter approaches
rely purely on the adequacy of their scoring functions. Filtering methods are
susceptible to not discriminating redundant features, and missing feature inter-
dependencies (since each feature is scored individually). Filter approaches are
however easier to compute and more statistically stable relative to changes in
the dataset. Embedded approaches include feature selection as part of the
learning machine. These include algorithms solving the LASSO problem [1], and
other linear models involving a regularizer based on a sparsity inducing norm
( p∈[0;1] -norms [11], group LASSO, ...). Kernel machines provide a mixture of
feature selection and construction as part of the classification problem. Decision
trees are also considered embedded approaches although they are also similar to
filter approaches in their use of heuristic scores for tree construction. The main
critique of embedded approaches is two-fold: they are susceptible to include
redundant features, and not all the techniques described are easily applied to
multi-class problems.In brief, both filtering and embedded approaches have their
drawbacks in terms of their ability to select the best subset of features, whereas
wrapper methods have their main drawback in the intractability of searching the
entire feature space. Furthermore, all existing methods perform feature selection
based on the whole training set, the same set of features being used to represent
any data.
Our sequential decision problem defines both feature selection and classifi-
cation tasks. In this sense, our approach resembles an embedded approach. In
Datum-Wise Classification: A Sequential Approach to Sparsity 389

practice, however, the final classifier for each single datapoint remains a sepa-
rate entity, a sort of black-box classifying machine upon which performance is
evaluated. Additionally, the learning algorithm is free to navigate over the en-
tire combinatorial feature space. In this sense our approach resembles a wrapper
method.
There has been some work using similar formalisms [12], but with different
goals and lacking in experimental results. Sequential decision approaches have
been used for cost-sensitive classification with similar models [13]. There have
also been applications of Reinforcement Learning to optimize anytime classifica-
tion [14]. We have previously looked at using Reinforcement Learning for finding
a stopping point in feature quantity during text classification [15].
Finally, in some sense, DWSC has some similarity with decision trees as each
new datapoint that is labeled is following a different path in the feature space.
However, the underlying mechanism is quite different both in term of inference
procedure and learning criterion. There has been some work in using RL for
generating decision trees [16], but that approach is still tied to decision tree
construction heuristics and the end product remains a decision tree.

8 Conclusion
In this article we introduced the concept of datum-wise classification, where we
learn both a classifier and a sparse representation of the data that is adaptive
to each new datum being classified. We took an approach to sparsity that con-
siders the combinatorial space of features, and proposed a sequential algorithm
inspired by Reinforcement Learning to solve this problem. We showed that find-
ing an optimal policy for our Reinforcement Learning problem is equivalent to
minimizing the L0 regularized loss of our classification problem. Additionally we
showed that our model works naturally on multi-class problems, and is easily
extended to avoid overfitting on datasets where the number of features is larger
than the number of examples. Experimental results on 14 datasets showed that
our approach is indeed able to increase sparsity while maintaining equivalent
classification accuracy.

References
1. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B (January 1994)
2. Sutton, R., Barto, A.: Reinforcement Learning. MIT Press, Cambridge (1998)
3. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least-angle regression. Annals
of statistics 32(2), 407–499 (2004)
4. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Pro-
gramming. Wiley, Chichester (1994)
5. Har-Peled, S., Roth, D., Zimak, D.: Constraint classification: A new approach to
multiclass classification. Algorithmic Learning Theory, 1–11 (2002)
6. Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: Leveraging
modern classifiers. In: ICML 2003 (2003)
7. Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: LIBLINEAR: A library for large
linear classification. JMLR 9, 1871–1874 (2008)
390 G. Dulac-Arnold et al.

8. Guyon, I., Elisseefi, A.: An Introduction to Variable and Feature Selection. Journal
of Machine Learning Research 3(7-8), 1157–1182 (2003)
9. Girgin, S., Preux, P.: Feature discovery in reinforcement learning using genetic
programming. In: O’Neill, M., Vanneschi, L., Gustafson, S., Esparcia Alcázar, A.I.,
De Falco, I., Della Cioppa, A., Tarantino, E. (eds.) EuroGP 2008. LNCS, vol. 4971,
pp. 218–229. Springer, Heidelberg (2008)
10. Gaudel, R., Sebag, M.: Feature Selection as a One-Player Game. In: ICML (2010)
11. Xu, Z., Zhang, H., Wang, Y., Chang, X., Liang, Y.: L1/2 regularization. Science
China Information Sciences 53(6), 1159–1169 (2010)
12. Ertin, E.: Reinforcement learning and design of nonparametric sequential decision
networks. In: Proceedings of SPIE, pp. 40–47 (2002)
13. Ji, S., Carin, L.: Cost-sensitive feature acquisition and classification. Pattern Recog-
nition 40(5), 1474–1485 (2007)
14. Póczos, B., Abbasi-Yadkori, Y., Szepesvári, C., Greiner, R., Sturtevant, N.: Learn-
ing when to stop thinking and do something! In: ICML 2009, pp. 1–8 (2009)
15. Dulac-Arnold, G., Denoyer, L., Gallinari, P.: Text Classification: A Sequential
Reading Approach. In: ECIR, pp. 411–423 (2011)
16. Preda, M.: Adaptive building of decision trees by reinforcement learning. In: Pro-
ceedings of the 7th WSEAS, pp. 34–39 (2007)
Manifold Coarse Graining for Online
Semi-supervised Learning

Mehrdad Farajtabar, Amirreza Shaban,


Hamid Reza Rabiee, and Mohammad Hossein Rohban

Digital Media Lab, AICTC Research Center,


Department of Computer Engineering,
Sharif University of Technology, Tehran, Iran
{farajtabar,shaban,rahban}@ce.sharif.edu,
rabiee@sharif.edu

Abstract. When the number of labeled data is not sufficient, Semi-


Supervised Learning (SSL) methods utilize unlabeled data to enhance
classification. Recently, many SSL methods have been developed based
on the manifold assumption in a batch mode. However, when data ar-
rive sequentially and in large quantities, both computation and storage
limitations become a bottleneck. In this paper, we present a new semi-
supervised coarse graining (CG) algorithm to reduce the required number
of data points for preserving the manifold structure. First, an equivalent
formulation of Label Propagation (LP) is derived. Then a novel spectral
view of the Harmonic Solution (HS) is proposed. Finally an algorithm to
reduce the number of data points while preserving the manifold structure
is provided and a theoretical analysis on preservation of the LP proper-
ties is presented. Experimental results on real world datasets show that
the proposed method outperforms the state of the art coarse graining
algorithm in different settings.

Keywords: Semi-Supervised Learning, Manifold Assumption, Harmonic


Solution, Label Propagation, Spectral Coarse Graining, Online Classifi-
cation.

1 Introduction
Semi-supervised learning is a topic of recent research that effectively addresses
the problem of limited data [1]. In order to use unlabeled data in the learning
process efficiently, certain assumptions on the relation between the possible la-
beling functions and the underlying geometry should hold [2]. In many real world
classification problems, data points lie on a low dimensional manifold. The man-
ifold assumption states that the labeling function varies smoothly with respect
to underlying manifold [3]. Manifold structure is modeled by the neighborhood
graph of the data points. SSL methods with manifold assumption prove to be ef-
fective in many applications including image segmentation[4], handwritten digit
recognition and text classification [5].
Online classification of data is required in common applications such as object
tracking [6], face recognition in surveillance systems [11], and image retrieval [7].

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 391–406, 2011.

c Springer-Verlag Berlin Heidelberg 2011
392 M. Farajtabar et al.

Usually unlabeled data is easily available in such classification problems. How-


ever, most of the classic SSL algorithms are not efficient in an online classifica-
tion setting. This is due to repeated invocations of a computationally demanding
label inference algorithm which takes a time of O(n3 ) in standard implementa-
tions. Moreover, when the number of arrived data grows large, space complexity
becomes an important issue. Consequently, designing efficient label prediction
algorithms for the online setting is essential.
Recently online manifold classification algorithms have been proposed to ad-
dress these challenges. The manifold regularized passive-aggressive online algo-
rithm [8] uses a smoothness regularization term on the τ most recent data in
order to reduce the number of samples needed to be stored and processed. This
method fails when the windowed data are not representative of the true underly-
ing manifold. This case may happen when data arrive in a biased order. Authors
in [9] use RPtree [10] to partition the graph into clusters which grow incremen-
tally in size and cover the manifold structure. Despite promising experimental
results, no theoretical guaranty is provided on the error bound for this method.
An state of the art method aimed at reducing the size of data and coarsen-
ing the graph is proposed in [11]. Coarsening is done by replacing neighboring
points in Euclidean space with fixed number of centroids. Experiments show
that considering geodesic distances on manifold results in more accurate data
reduction. Authors in [12] propose a data reduction method based on a mathe-
matical framework with an interesting upper bound on the eigenvector distortion
after every coarsening of data. However, minimizing this upper bound is hard to
tackle. They use a variation of k-means to minimize this bound which is prone
to local minima. In addition, a drawback of their method is its prior determina-
tion of the number of new nodes, while it is better to concede this decision to
the manifold structure itself. To the best of our knowledge none of the pervious
works in this area take advantage of labeled data to reduce the size of the graph,
i.e. coarsening of the graph is performed independent of the given classification
task.
Our method like some recent works on online manifold learning [7,9,11], rely
on data reduction to overcome memory and time limitations. In this paper we
propose a semi-supervised data reduction method that not only captures the
geometric structure of data, but also considers the labeled data as a cue to
better preserve the classification accuracy. Spectral decomposition is used to
find similar nodes on the manifold in order to be merged. Assuming a maximum
buffer length of k, we do data reduction whenever this limit is reached, So the
time to predict the label of each newly arrived data will not exceed O(k 3 ).
Moreover, the complexity of our CG method is equivalent to the complexity
of eigenvector decomposition which similarly takes a time of O(k 3 ) and will be
done just when the buffer limit is reached. As a result overall time complexity
is constant with time.
The rest of the paper is organized as follows. In Section 2 basics of HS and
LP are briefly introduced. A new formulation of LP and its spectral counterpart
is derived in Section 3. Section 4 is the core of this paper where we introduce
Manifold Coarse Graining for Online Semi-supervised Learning 393

coarse graining in exact and approximate modes and explain how it helps us to
preserve LP and manifold structure while reducing the number of data points. In
Section 5 experimental results are provided, after which the paper is concluded
in Section 6.

2 Basics and Notations


Let Xu = {x1 , . . . , xu } and Xl = {xu+1 , . . . , xu+l } be sets of unlabeled and
labeled data points respectively, where n = u + l is the total number of data
points. Also let y = (y(u + 1), . . . , y(u + l)) be the vector labels on Xl . Our
goal is to predict labels of X = Xu ∪ Xl as f = (fu ; fl ) = (f (1), . . . , f (u), f (u +
1), . . . , f (u + l)) , where f (i) is the label associated to xi for i = 1, . . . , n.
Let W be the weight matrix of the k-NN graph of X ,
xi − xj 2
W (i, j) = exp(− )

where σ is the bandwidth
 parameter. Define the diagonal matrix D with nonzero
entries D(i, i) = nj=1 W (i, j). Thus the laplacian matrix Lun = D − W . Mani-
fold regularization algorithms minimize the smoothness functional
1
S(f ) = W (i, j)(f (i) − f (j))2 = f T Lun f (1)
2 i,j

under some appropriate criteria[3,13,14]. Minimizing f T Lun f such that fl = y


is called Harmonic Solution for manifold regularization problem[14].
Label Propagation [15] is a way for computing HS. In this algorithm labels
are propagated from labeled to unlabeled nodes through edges in an iterative
manner. Edges with larger weights propagate labels easier. In each step a new
label is computed for each node as a weighted average of its neighboring labels.
The stochastic matrix P is defined such that
W (i, j)
P (i, j) = n . (2)
k=1 W (i, k)

P (i, j) can be interpreted as the effect of f (j) on f (i). The algorithm is stated
as follows:
1. Propagation: f (t+1) ← P f (t)
2. Clamping: fl = y
Where f (t) is the estimated label at step t. If we decompose W and P according
to labeled and unlabeled parts,
   
Wuu Wul P P
W = P = uu ul , (3)
Wlu Wll Plu Pll
then under appropriate conditions [15], the solution of LP converges to the HS
and is independent of the initial value (i.e. f (0) ) and may be written as
fu = (I − Puu )−1 Pul y fl = y. (4)
394 M. Farajtabar et al.

3 Spectral View of Label Propagation

In this section the LP solution is derived in terms of the spectral decomposition


of a variation of the stochastic matrix, P . This helps us find a spectral property
of the stochastic matrix, the invariance of which will guarantee that the solution
of LP remains approximately constant throughout CG.
Consider the process of propagating labels. Each new label is computed as
the weighted average of its neighboring labels. However, for a labeled node the
process is undone by clamping its label to the true initial value.
These two steps for labeled nodes may be integrated in one step. For a labeled
node i, we remove P (i, j) for all js and set P (i, i) = 1. This causes LP to have
(t+1) (t)
an update rule like fl (i) = fl (i) for labeled nodes. Using this updated
stochastic matrix, we can remove the clamping procedure and state the entire
process in a coherent fashion.
We mimic the effect of this new process using a new stochastic matrix denoted
by Q, which we call the absorbing stochastic matrix:
 
Puu Pul
Q (5)
0 I

With the absorbing stochastic matrix the entire process of LP may be rewritten
as
f t+1 = Qf t , (6)
(0) (0)
where the initial value is f (0) = (fu ; y), and fu may be arbitrary. In this new
formulation estimated labels are computed as limn→∞ Qn f (0) . Defining Q∞ as

Q∞  lim Qn ,
n→∞

we can write (fu ; fl ) = Q∞ f (0) . Since the result is independent of initial states
of unlabeled data, fu (j) can be rewritten as


l+u
fu (j) = Q∞ (j, k)y(k). (7)
k=u+1

We wish to relate Q∞ (j, k) to the right eigenvectors of Q; to this end we need


the following two lemmas.

Lemma 1. The matrix Q defined in (5) has following properties:

– Every eigenvalue λ is real and |λ| ≤ 1


– Dimension of the eigenspace corresponding to λ = 1 is equal to the number
of labeled data l.
– Rows of  
0l×u Il×l
are the left eigenvectors of Q corresponding to λ = 1.
Manifold Coarse Graining for Online Semi-supervised Learning 395

Proof. The eigenvalues of Q are roots of the characteristic polynomial i.e.


p(λ) = det(Q − λI) = 0. Considering the special form of Q,

p(λ) = (1 − λ)l det(Puu − λI)

the magnitude of all eigenvalues of Puu is less than one, due to the fact
n
that Puu → 0 as n → ∞ [15]. Therefore, λ = 1 has multiplicity l and
the magnitude of all other eigenvalues of Q is less than one and real. It is
straightforward to show that eigenvalues of a stochastic matrix and the new
variation are all real.
For the last part, it can be verified that
 
  Puu Pul  
0l×u Il×l × = 0l×u Il×l .
0 I
 
Therefore, rows of 0l×u Il×l are the left eigenvectors of Q associated to
λ = 1.
Definition 1. From now we refer to eigenvectors corresponding to eigenvalues
equal to one as unitary eigenvectors, which is different from unit eigenvectors
that have unit norm.

Lemma 2. (Spectral decomposition)[16] Every squared matrix A of dimension


n with n independent eigenvectors could be decomposed as

A = VR DVLT
and
VLT VR = I,
where D is the diagonal matrix of eigenvalues, columns of VR and VL are the
right and left eigenvectors of A, respectively.
Corollary 1. By unfolding above decomposition we get another expression for
spectral decomposition as
 n
A= λi pi uTi ,
i=1
th
where λi , pi and ui are the i eigenvalue, right eigenvector and left eigenvector
respectively.
Now we are ready to prove the main result of this part.

Theorem 1. Q∞ (j, k) = pk (j), for u + 1 ≤ k ≤ l + u and 1 < j < n, where


pk (j) denotes element j of the k th right eigenvector which is unitary.

Proof. By Lemma 2 we can write Q = VR DVLT . Since VLT VR = I, It’s easily


seen that Q = VR D VL or equivalently Q = ni=1 λni pi uTi . So as n → ∞ all
n n T n

eigenvectors with eigenvalue less than one disappear and the unitary eigenvalues
and eigenvectors remain:
396 M. Farajtabar et al.


l+u
Q∞ = pi uTi .
i=u+1

By Lemma 1 the left eigenvector ui can be represented as a vector of zeros with


the exception of the ith element equal to one for u + 1 ≤ i ≤ l + u. Therefore
Q∞ (., k) is constructed with uk and all other ui s have zero elements in the
corresponding places. Consequently Q∞ (j, k) = pk (j) for u + 1 ≤ k ≤ l + u.
Applying Theorem 1 in equation (7), the final solution of LP is stated as:


l+u
fu (j) = pk (j)y(k). (8)
k=u+1

Therefore, fu can be expressed in terms of the right unitary eigenvectors of Q.


As a result, fu remains unchanged if these eigenvectors are preserved in a CG
process. This fact will become clear in the next section.

4 Manifold Coarse Graining


In some cases amount of data is so large that storing and manipulating them con-
sumes large memory and imposes high processing cost. We will show in the next
subsections that some graph nodes can be merged without seriously affecting LP
on the remaining and oncoming data.

4.1 Exact Coarse Graining


Consider the graph in Figure 1 constructed from data. Nodes 1 and 2 have the
same neighbors and are both unlabeled. Suppose rows of the absorbing stochastic
matrix Q, corresponding to these nodes are the same i.e.
w13 w23 w14 w24
= q13 = q23 = , = q14 = q24 = .
w13 + w14 w23 + w24 w13 + w14 w23 + w24
Then these two nodes take the same effect from their neighbors in label prop-
agation. Intuitively merging these two nodes should not disturb the process of
propagating the labels. After this step, weights should be summed up. This pro-
cess is illustrated in Figure 1. Node 0 is formed by summing the weights of nodes
1 and 2.
This intuition can be verified analytically. If f and f  are the estimated label
functions before and after this merge respectively, then:

Before merge After merge


f (1) = q13 f (3) + q14 f (4) f (0) = q03 f  (3) + q04 f  (4)


f (2) = q23 f (3) + q24 f (4)


f (3) = q31 f (1) + q32 f (2) + · · · f  (3) = q30 f  (0) + · · ·
f (4) = q41 f (1) + q42 f (2) + · · · f  (4) = q40 f  (0) + · · ·
Manifold Coarse Graining for Online Semi-supervised Learning 397

1 2 0
w14
w13 + w23 w14 + w24
w13 w23 w24
3 4 3 4
w48 w35 w48
w35

w36 w47 8 w36 w47 8


5 5
6 7 6 7
(a) Graph before merge (b) Graph after merge

Fig. 1. Merging two vertices 1 and 2 would not disturb label propagation

It is straightforward to see that q03 = q13 = q23 and q04 = q14 = q24 , so
columns in the first two rows of the above equations are equivalent. Also since
after merging we have q31 + q32 = q30 and q41 + q42 = q40 columns of the
last two rows impose the same effect on nodes 3 and 4. Thus if nodes 1 and 2
are unlabeled, f (t) (1) = f (t) (2) = f (t) (0) and f (t) (3) = f (t) (3) and f (t) (4) =
f (t) (4) in all steps of LP in the original and reduced graph.
This process can be modeled by the transformation Q = LQR where
⎡ ⎤
⎡ d1 ⎤ 1 0 ··· 0
d2
d1 +d2 d1 +d2
0 · · · 0 ⎢1 0 ··· 0⎥
⎢ 0 0 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
L =⎢ . . ⎥ R = ⎢0 ⎥ (9)
⎣ .. .. In−2 ⎦ ⎢ . ⎥
⎣ .. In−2 ⎦
0 0
0

and di = j W (i, j). One can see that the transformation simply merges rows
and columns of Q corresponding to nodes 1 and 2 such that all rows are still
normalized. For an undirected graph its stochastic matrix has the property that
its first left eigenvector is proportional to diag(d1 , . . . , dn ). It’s easy to see that
this is also true for the unlabeled part of the absorbing stochastic matrix Quu ,
which can be viewed as the scaled stochastic matrix of an undirected graph.
Since the first u elements of the eigenvectors of Q are equal to the eigenvectors
of Quu , this is true for unlabeled nodes. For unlabeled nodes di = u1 (i); and
only unlabeled nodes are coarsened, so alternatively u1 (i) may be used in (9).
This transformation has interesting properties and is well studied in [17] which
presents a similar algorithm based on random walk in social networks. In the
general case, R and L can be defined such that the transform merges all the
nodes with the same neighbors. There may be more than two nodes that have
similar neighbors and can thus be merged.
We will proceed using spectral analysis which will help us in later sections
where we introduce non-exact merging of nodes. The next lemma relates spectral
analysis to CG.
398 M. Farajtabar et al.

Lemma 3. Rows i and j of Q are equal if and only if pk (i) = pk (j) for all k
with λk 
= 0, where pk is a right eigenvector of Q.
Proof. Proof is immediate from the definition of eigenvectors and Corollary 1 of
spectral decomposition.
Lemma 3 states that nodes to be merged can be selected via eigenvectors of
the absorbing stochastic matrix Q, instead of comparing the rows of Q itself.
We decide to merge nodes if their corresponding elements are equal along all
the eigenvectors with nonzero eigenvalue. We will see how this spectral view of
merging helps us develop and analyze the non-exact case where we merge nodes
even if they aren’t exactly identical along eigenvectors.
We should fix some notations before we proceed. Superscript ” ” will be used
to indicate objects after CG. i.e. Q , p , u , n are the stochastic matrix, right
and left eigenvectors, and number of nodes after CG. Let S1 , . . . , Sn be the
n clusters of nodes found after CG. Also let S be the set of all nodes that are
merged into some cluster.
We wish to use ideas from section 3 to provide a spectral view of coarsening
stated so far in this section. We need the following lemma.
Lemma 4. [17] If conditions of Lemma 3 hold Lp is the right unitary eigenvec-
tor of Q with the same eigenvalue as p, where p is the right unitary eigenvector
of Q.
First note that Lp simply retains elements of p which are not merged and removes
repetition of the same elements for nodes that are merged. So after CG, the
right eigenvectors of Q and associated eigenvalues are preserved. Recall from the
previous section that the right eigenvectors are directly related to the result of
LP. We are now ready to prove the following theorem.
Theorem 2. LP solution is preserved for nodes or cluster of nodes in exact CG,
i.e. when we merge nodes if their corresponding elements are the same along all
right eigenvectors with nonzero eigenvalues.
Proof. Consider equation (8) from previous section for computing labels based
upon right eigenvectors,


u+l
fu (j) = pk (j)y(k). (10)
k=u+1

We know from Lemma 4, Lpk is also a right unitary eigenvector of Q , Suppose


j  is the new index of the node or cluster of nodes that node j will reside after
CG, similarly

u +l
 
fu (j ) = (Lpk )(j  )y(k), (11)
k=u +1

Considering (Lpk )(j  ) = pk (j) we get the result, fu (j) = fu (j  ). This means that
labels of unlabeled nodes are preserved under CG.
Manifold Coarse Graining for Online Semi-supervised Learning 399

This kind of data reduction will preserve LP results in the manifold of data and as
a consequence manifold structure in the reduced graph. This is elaborated upon
in the next subsections. Equality along all eigenvectors with nonzero eigenvalues
is a restrictive constraint for CG. In the next section we will see how this criterion
may be relaxed.

4.2 Approximate Coarse Graining


In real problems the case where neighbors of two or more nodes are exactly the
same rarely occurs. Thus the motivation for an approximate coarse graining,
i.e. merging nodes when their corresponding elements in eigenvectors are close
enough. For example along ith eigenvector we consider two elements approxi-
mately the same if their difference is no more than ηi , then merge nodes if they
are pairwise approximately the same.
Before we proceed it is beneficial to consider the term pi − RLpi . Lpi is
the approximate right eigenvector of Q . Multiplying by R unfolds the clusters.
Defining εi = pi − RLpi , we would like to find an upper bound of εi for nodes
to be merged. The smaller εi  is, the more similar Lpi is to pi . So minimizing
εi better preserves the ith eigenvector. On the other hand LP results depend
on unitary eigenvectors, so a good practice is to do CG on unitary eigenvectors
only. In spite of approximately preserving LP this allows more reduction. This
approximation will be clearer when error bounds are analyzed.
For simplicity consider node 1 that is placed in a cluster S1 = {1, ..., m},
(r ≤ m). Using

u1 (1) u1 (m)
(RLpi )(1) = m pi (1) + · · · + m pi (m) (12)
j=1 u 1 (j) j=1 u1 (j)

we may write

εi (1) =pi (1) − (RLpi )(1) =


m
u1 (j) u1 (2) u1 (m)
j=2m pi (1) − m pi (2) − · · · − m pi (m) =
j=1 u 1 (j) j=1 u 1 (j) j=1 u1 (j)
u (2) u1 (m) (13)
m 1 (pi (1) − pi (2)) + · · · + m (pi (1) − pi (m)) ≤
u
j=1 1 (j) j=1 u1 (j)
m
j=2 u1 (j) u1 (1)
ηi m = ηi (1 − m ).
u
j=1 1 (j) j=1 u1 (j)

The last inequality is due to the fact that in each cluster along the ith eigenvector,
differences between elements are no more than ηi . Inequality (13) bounds the
difference between elements of eigenvectors corresponding to a node before CG
and the desired value after CG. Note that εi is zero if CG is exact or for a node
that is not merged.
Suppose p is the true right eigenvector of Q . However we would like to have
Lp as its right eigenvector so as to better preserve the manifold structure and
400 M. Farajtabar et al.

LP. It is thus interesting to see whether Lp can be approximately considered as


an eigenvector of Q with approximately the same eigenvalue as p. Considering

Q (Lp) = λ(Lp) + e, (14)

we would like to minimize e. Following [12] we have




n −l
e(i)2
≤ 2D, (15)
i=1
u1 (i)

where

k
2 
n
D= λi u1 (j)εi (j)2 (16)
i=1 j=1

and εi = pi − RLpi where pi s and λi s are right eigenvectors and associated
eigenvalues that CG is performed along (As this bound hints CG need not be
performed along all eigenvectors. We will explain this point shortly). It’s no-
ticeable that the bound (15) is a general bound for any coarsening algorithm,
It’s also originally stated for stochastic matrix of undirected graphs such as P ,
However as stated in 4.1 the unlabeled part of Q can be considered as such a
matrix. √
Considering (14) and (15), if λ D then Lp is a good approximation of p.
Given the eigenvectors that must be preserved we √ can determine how to choose
ηi for a good approximation. The inequality λl ≥ D/ω(1) should be satisfied.
For example we may seek for sufficient conditions to satisfy

λl ≥ D/n (17)

for every eigenvector pl that we wish to preserve. Using equation (16) we want
to find ηi for all i such that (17) holds.
For simplicity consider cluster S1 = {1, . . . , m}. By using inequality (13),

  u1 (j)
u1 (j)εi (j)2 = u1 (j)ηi 2 (1 −  )2 =
r∈S1 u 1 (r)
j∈S1 j∈S1
 u1 (j)2 u1 (j)3 
ηi 2 u1 (j) − 2  +  2
=
j∈S1 r∈S1 u1 (r) ( r∈S1 u1 (r))
  u1 (j)2  u (j)3 
ηi 2 u1 (j) − 2  +  1 2
=
r∈S1 1u (r) ( u
r∈S1 1 (r))
j∈S1 j∈S1 j∈S1

2
  u1 (j)2 ( r∈S1 u1 (r))3 
ηi u1 (j) − 2  +  2
−C ≤
j∈S1 j∈S1 r∈S1 u1 (r) ( r∈S1 u1 (r))

2ηi 2 u1 (j)
j∈S1
(18)
Manifold Coarse Graining for Online Semi-supervised Learning 401

Where the last inequality is due to fact that


 u1 (j)2
 >0 , C > 0.
j∈S1 r∈S1 u1 (r)

εi = 0 for nodes which are not merged, thus



n 
u1 (j)εi (j)2 ≤ 2ηi 2 u1 (j) (19)
j=1 j∈Xu

Now we are ready to find an appropriate value for ηi to satisfy (17):


k
2 
n 
k
2 
D= λi u1 (j)εi (j)2 ≤ 2 λi ηi 2 u1 (j) (20)
i=1 j=1 i=1 j∈U

 √
Let M = j∈U u1 (j). For λl ≥ D/n to be satisfied for every l:


k
2 λl
2 λi ηi 2 M ≤ (21)
i=1
n

It’s easy to verify that choosing ηi such that



1 λl
ηi 2 ≤ 2 (22)
2kM λi n

is true for every l, is sufficient condition that will ensure Lpl is almost surely
min is the minimum eigenvalue among the eigenvectors that
preserved, i.e., if λ
must be preserved, then

1 min
λ
ηi 2 ≤ 2 . (23)
2kM λi n

The bound derived in (23) shows how ηi should be chosen to ensure that Lpi is
similar to a right eigenvector of Q .

4.3 Preserving Manifold Structure


We have seen how the size of data may be reduced while preserving LP proper-
ties. Theorem 1 shows that the LP solution is directly related to unitary eigen-
vectors of the absorbing stochastic matrix. Thus by CG along these eigenvectors
we could retain labels while reducing the amount of data. This process is suf-
ficient to preserve LP but may disturb the true underlying manifold structure.
To overcome this we can do CG not only along unitary eigenvectors, but we also
along eigenvectors with larger eigenvalues.
402 M. Farajtabar et al.

To elaborate, note that manifold structure is closely related to the evolution


of LP in its early steps, and not just the limiting case where the steps tend
to infinity. Consider the one step process of propagating labels, f t+1 = Qf t .
The more properties of Q is preserved in Q , the more the underlying structure
is retained. Also after k steps of
npropagations we have f t+k = Qk f t . Using
k k T
Corollary 1, we can write Q = i=1 λi pi ui , so as k becomes larger the effect
of large eigenvalues and their associated eigenvectors become more important.
To preserve LP in early steps it is reasonable to choose eigenvectors with larger
eigenvalues and do CG along them. In this manner in addition to LP, the general
structure of the manifold is preserved. Figure 2 illustrates the process of CG on
a toy dataset with one labeled node from each class. In this figure the general
structure of the manifold and its preservation under CG is shown. Also note
that sparse section of green nodes is preserved which is essential to capture the
manifold structure.

(a) original graph (b) graph after coarse-graining

Fig. 2. Process of CG on a toy dataset with 800 nodes. Two labeled nodes are provided
on head and tail of the spiral and are red asterisks. Green circle and blue square nodes
represent different classes. The area of each circle is proportional to the number of nodes
that reside in the corresponding cluster. After CG 255 nodes remain which means a
reduction of 68%.

Performing CG along all the eigenvectors should better preserve manifold


structure. For merging two nodes this requires that they be close along all the
eigenvectors, resulting in less reduction contradicting our initial goal, i.e., data
reduction. So in practice generally a few eigenvectors are chosen to be preserved
and as we have seen the best choices are the eigenvectors associated to larger
eigenvalues. The importance of preserving manifold structure becomes evident
when labels are to be predicted for unseen data, e.g., in online learning.
Manifold Coarse Graining for Online Semi-supervised Learning 403

5 Experiments
We evaluate our method empirically on 3 real world datasets: digit, letter and
image classification. The first is UCI letter recognition dataset [18]. The next is
USPS digit recognition. We reduce the dimension of each data to 64 with PCA.
Caltech dataset [19] is used for image classification. Features are extracted using
CEDD [20]. Adjacency matrices are constructed using 5-NN with the bandwidth
size set to mean of standard deviation of data. 20 data points are labeled. In
addition to these 20 unitary eigenvectors 5 other top eigenvectors are selected for
CG. ηi is set to divide values along ith eigenvector into I groups, where I is the
final parameter that varies to get different reduction sizes. In all experiments on
digits and letters the average accuracy among 10 pairwise problems are reported.
On Caltech we use 2 specific classes. Four experiments are designed to evaluate
our method.

5.1 Eigenvector Preservation


Our CG method captures manifold structure based on eigenvector preservation.
To show how well eigenvectors are preserved we compare Lpi and pi for top ten
eigenvectors that are to be preserved in USPS dataset. We reduce the number
of nodes from 1000 to 92. Table 1 shows eigenvalues and cosine similarity of
eigenvectors before and after CG. It is easily seen that eigenvalues and eigenvec-
tors are well preserved. This guarantees a good accuracy of classification after
reduction as demonstrated in the next subsections.

Table 1. Eigenvalue and eigenvector preservation in CG for top ten eigenvectors which
CG is performed along them

i 1 2 3 4 5 6 7 8 9 10
λi 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 -0.9999 0.9999 0.9999 0.9997
λi 1.0000 1.0000 1.0000 1.0000 1.0000 0.9999 -0.9999 0.9999 0.9998 0.9997
(Lpi ) pi
Lpi pi 
0.9967 0.9925 0.9971 0.9910 0.9982 0.9964 0.9999 0.9909 0.8429 0.9982

5.2 Online Classification


In this experiment, we design a real online scenario where the data buffer size is
at most 200 and CG is done when maximum buffer limit is reached. Data arrive
sequentially and the label of new data is predicted. Classification result in time
t is reported for all data up to this time. We compare our result with the graph
quantization method [11] and a baseline method which performs classification
without reducing the size. As Figure 3 shows our method is quite effective with
a performance comparable to the baseline. This efficiency is due to the fact that
manifold structure and label information is considered in the process of data
reduction. Note the inefficiency of graph quantization method which performs
data reduction regarding to Euclidean space which is not the case when data lie
on a manifold.
404 M. Farajtabar et al.

baseline 0.97
0.98
coarse graining
0.96
0.84graph quantization 0.96
Accuracy

Accuracy

Accuracy
0.83 0.95
0.94
0.82 0.94
0.92
0.81 0.93
0.9
0.8 0.92
0.88
0.79
200 400 600 800 200 400 600 800 200 400 600 800
Time step Time step Time step
(a) Letter recognition (b) Digit recognition (c) Image classification

Fig. 3. Online classification. Data is arrived sequentially and maximum buffer size is
200.

5.3 Manifold Structure Preservation


In this experiment CG is done for 500 data points to reduce the data size to 100.
One test data point is added and its label is predicted. Accuracy is averaged
over 500 new data points added separately. We do in this manner intentionally
to prevent new data points recover the manifold structure. So the result is an
indication of how well the manifold structure is preserved in CG. Figure 4 shows
the effectiveness of our CG method compared to graph quantization method [11]
on USPS, UCI letters. Again we think this is due to the ”manifoldwise” nature
of our method.

coarse graining 1
0.98
graph quantization
0.84
0.96 0.9
Accuracy
Accuracy

Accuracy

0.82 0.94
0.8
0.8 0.92
0.9 0.7
0.78
0.88
100 200 300 100 200 300 0.02 0.03 0.04
Number of clusters Number of clusters Outlier ratio

(a) Letter recognition (b) Digit recognition (c) Outlier robustness

Fig. 4. (a,b): Capability of methods to preserve manifold structure. 500 nodes are
coarse grained and the classification accuracy is averaged for separately added 500 new
data. (c): Comparison of robustness to outliers in USPS.

5.4 Outlier Robustness


In this experiment we evaluate robustness of our method to outliers in data from
USPS. Noise is added manually and classification accuracy is calculated. Outliers
are generated by adding multiples of the data variance. Figure 4-c shows robust-
ness of our method compared to the graph quantization method. In our method
outliers are merged and their effect is reduced while in the graph quantization
method separate clusters are devoted to outliers.
Manifold Coarse Graining for Online Semi-supervised Learning 405

6 Conclusion
In this paper, a novel semi-supervised CG algorithm is proposed to reduce the
number of data points while preserving the manifold structure. To this end a new
formulation of LP is used to derive a new spectral view of the HS. We show that
the manifold structure is closely related to the eigenvectors of a variation of the
stochastic matrix. This structure is well preserved by any algorithm which guar-
antees small distortions in the corresponding eigenvectors. Exact and approxi-
mate coarse graining algorithms are provided alongside a theoretical analysis of
how well the LP properties are preserved. The proposed method is evaluated on
three real world datasets and outperforms the state of the art CG in the follow-
ing scenarios, namely online classification, manifold preservation and robustness
against outliers. The performance of our method is comparable to that of an
algorithm that utilizes all the data in a simulated online scenario.
A theoretical analysis of robustness against noise, extending the spectral view
point to other manifold learning methods, and deriving tighter error bounds on
CG, to name a few, are interesting problems that remain as future work.

Acknowledgments. We would like to thank M. Valko and B. Kveton for pro-


viding us with experimental details of quantization method, A. Soltani-Farani for
reviewing the manuscript, and anonymous reviewers for their helpful comments.
This work was supported by National Elite Foundation of Iran.

References
1. Zhu, X.: Semi-Supervised Learning Literature Survey. Technical Report 1530, De-
partment of Computer Sciences, University of Wisconsin Madison (2005)
2. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised Learning. MIT Press, Cam-
bridge (2006)
3. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold Regularization: a Geometric Frame-
work for Learning from Labeled and Unlabeled Examples. Journal of Machine
Learning Research 7, 2399–2434 (2006)
4. Duchenne, O., Audibert, J., Keriven, R., Ponce, J., Segonne, F.: Segmentation by
Transduction. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2008, pp. 1–8 (2008)
5. Belkin, M., Niyogi, P.: Using Manifold Structure for Partially Labeled Classifica-
tion. Advances in Neural Information Processing Systems 15, 929–936 (2003)
6. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised On-Line Boosting for Ro-
bust Tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I.
LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008)
7. He., X.: Incremental Semi-Supervised Subspace Learning for Image Retrieval. In:
Proceedings of the ACM Conference on Multimedia (2004)
8. Moh, Y., Buhmann, J.M.: Manifold Regularization for Semi-Supervised Sequential
Learning. In: ICASSP (2009)
9. Goldberg, A., Li, M., Zhu, X.: Online Manifold Regularization: A New Learning
Setting and Empirical Study. In: Proceeding of ECML (2008)
10. Dasgupta, S., Freund, Y.: Random Projection Trees and Low Dimensional Mani-
folds. Technical Report CS2007-0890, University of California, San Diego (2007)
406 M. Farajtabar et al.

11. Valko, M., Kveton, B., Ting, D., Huang, L.: Online Semi-Supervised Learning
on Quantized Graphs. In: Proceedings of the 26th Conference on Uncertainty in
Artificial Intelligence, UAI (2010)
12. Lafon, S., Lee, A.B.: Diffusion Maps and Coarse-Graining: A Unified Framework
for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization.
IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9), 1393–1403
(2006)
13. Zhou, D., Bousquet, O., Lal, T., Weston, J., Scholkopf, B.: Learning with local and
global consistency. Neural Information Processing Systems (2004)
14. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-Supervised Learning Using Gaussian
Fields and Harmonic Functions. In: ICML (2003)
15. Zhu, X., Ghahramani, Z.: Learning from Labeled and Unlabeled Data with Label
Propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University
(2002)
16. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes,
The Art of Scientific Computing, 3rd edn. Cambridge University Press, Cambridge
(2007)
17. Gfeller, D., De Los Rios, P.: Spectral Coarse Graining of Complex Networks. Phys-
ical Review Letters 99, 3 (2007)
18. Frank, A., Asuncion, A.: UCI Machine Learning Repository (2010)
19. Fei, L., Fergus, R., Perona, P.: Learning Generative Visual Models From Few Train-
ing Examples: An Incremental Bayesian Approach Tested on 101 Object Cate-
gories. In: IEEE CVPR 2004, Workshop on Generative Model Based Vision (2004)
20. Chatzichristofis, S.A., Boutalis, Y.S.: CEDD: Color and Edge Directivity Descrip-
tor: A Compact Descriptor for Image Indexing and Retrieval. In: ICVS, pp. 312–322
(2008)
Learning from Partially Annotated Sequences

Eraldo R. Fernandes1 and Ulf Brefeld2


1
Pontifícia Universidade Católica do Rio de Janeiro, Brazil
efernandes@inf.puc-rio.br
2
Yahoo! Research, Barcelona, Spain
brefeld@yahoo-inc.com

Abstract. We study sequential prediction models in cases where only


fragments of the sequences are annotated with the ground-truth. The
task does not match the standard semi-supervised setting and is highly
relevant in areas such as natural language processing, where completely
labeled instances are expensive and require editorial data. We propose to
generalize the semi-supervised setting and devise a simple transductive
loss-augmented perceptron to learn from inexpensive partially annotated
sequences that could for instance be provided by laymen, the wisdom
of the crowd, or even automatically. Experiments on mono- and cross-
lingual named entity recognition tasks with automatically generated par-
tially annotated sentences from Wikipedia demonstrate the effectiveness
of the proposed approach. Our results show that learning from partially
labeled data is never worse than standard supervised and semi-supervised
approaches trained on data with the same ratio of labeled and unlabeled
tokens.

1 Introduction
The problem of labeling, annotating, and segmenting observation sequences
arises in many applications across various areas such as natural language process-
ing, information retrieval, and computational biology; exemplary applications
include named entity recognition, information extraction, and protein secondary
structure prediction.
Traditionally, sequence models such as hidden Markov models [26,14] and
variants thereof have been applied to label sequence learning [9] tasks. Learning
procedures for generative models adjust the parameters such that the joint like-
lihood of training observations and label sequences is maximized. By contrast,
from an application point of view, the true benefit of a label sequence predictor
corresponds to its ability to find the correct label sequence given an observation
sequence. Thus, many variants of discriminative sequence models have been ex-
plored, including maximum entropy Markov models [20], perceptron re-ranking
[7,8], conditional random fields [16,17], structural support vector machines [2,34],
and max-margin Markov models [32].
Learning discriminative sequential prediction models requires ground-truth
annotations and compiling a corpus that allows for state-of-the-art performance
on a novel task is not only financially expensive but also in terms of the time it

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 407–422, 2011.

c Springer-Verlag Berlin Heidelberg 2011
408 E.R. Fernandes and U. Brefeld

Table 1. Different interpretations for ”I saw her duck under the table" [15]

I saw [NP her] [VP duck under the table]. → She ducked under the table.
I [VP saw [NP her duck] [PP under the table]]. → The seeing is done under the table.
I saw [NP her duck [PP under the table]]. → The duck is under the table.

takes to manually annotate the observations. Frequently, annotating data with


ground-truth cannot be left to laymen due to the complexity of the domain.
Instead trained editors need to deal with the pitfalls of the domain at hand such
as morphological, grammatical, and word sense disambiguation when dealing
with natural language. Table 1 shows an ambiguous sentence with three different
interpretations that cannot be resolved without additional context.
Semi-supervised learning approaches [6] aim at reducing the need for large
annotated corpora by incorporating unlabeled examples in the optimization; to
deal with the unlabeled data one assumes that the data meets certain criteria.
A common assumption exploits that similar examples are likely to have simi-
lar labelings. This so-called cluster assumption can be incorporated into semi-
supervised structural prediction models by means of Laplacian priors [17,1],
entropy-based criterions [18], transduction [37], or SDP relaxations [35]. Al-
though these methods have been shown to improve over the performance of
purely supervised structured baselines, they do not reduce the amount of re-
quired labeled examples significantly as it is sometimes the case for univariate
semi-supervised learning. One of the key reasons is the variety and number of
possible annotations for the same observation sequence; there are |Σ|T different
annotations for a sequence of length T with tag set Σ and many of them are
similar in the sense that they differ only in a few labels.
In this paper, we extend the semi-supervised learning setting and study learn-
ing from partially annotated data. That is, in our setting only some of the ob-
served tokens are annotated with the ground-truth while the rest of the sequence
is unlabeled. The rational is as follows: If the target concept can be learned from
partially labeled sequences, annotation costs can be significantly reduced. Large
parts of an unlabeled corpus could for instance be labeled by laypeople using
the wisdom of the crowd via platforms like MechanicalTurk1 or CrowdFlower2.
Prospective workers could be asked to only annotate those parts of a sequence
they feel confident about and if two or more workers disagree on the labeling of
a sentence, the mismatches are simply ignored in the model generation. In the
example in Table 1, one could label the invariant token her=NP and leave the
ambiguous parts of the sentence unlabeled.
We devise a straight-forward transductive extension of the structured loss-
augmented perceptron that allows to include partially labeled sequences in the
training process. This extension contains the supervised and the semi-supervised
structured perceptron as special cases. To demonstrate that we can learn from
inexpensive data, we evaluate our method on named entity recognition tasks.
1
https://www.mturk.com
2
http://www.crowdflower.com
Learning from Partially Annotated Sequences 409

We show in a controlled experiment that learning with partially labeled data is


always on par or better than standard supervised and semi-supervised baselines
(trained on data with the same ratio of labeled and unlabeled tokens). Moreover,
we show that mono- and cross-lingual named entity recognition can significantly
be improved by using additional corpora that are automatically extracted from
Wikipedia3 at factually no costs at all.
The remainder is structured as follows. Section 2 reviews related work and Sec-
tion 3 introduces label sequence learning. We devise the transductive perceptron
in Section 4. Section 5 reports on the empirical results and Section 6 concludes.

2 Related Work
Learning from partially annotated sequences has been studied by [30] who extend
HMMs to explicitly exclude states for some observations in the estimation of the
models. [22] propose to incorporate domain-specific ontologies into HMMs to
provide labels for the unannotated parts of the sequences, [10] cast learning an
HMM for partially labeled data into a large-margin framework and [33] present
an extension of maximum entropy Markov models (MEMMs) and conditional
random fields (CRFs). The latent-SVM [36] allows for the incorporation of latent
variables in the underlying graphical structure; the additional variables implicitly
act as indicator variables and conditioning on their actual value eases model
adaptation because it serves as an internal clustering.
The generalized perceptron for structured output spaces is introduced by [7,8].
Altun et al. [2] leverage this approach to support vector machines and explore
label sequence learning tasks with implicit 0/1 loss. McAllester et al. [19] propose
to incorporate loss functions into the learning process of perceptron-like algo-
rithms. Transductive approaches for semi-supervised structured learning are for
instance studied in [17,1,35,18,37], where the latter is the closest to our approach
as the authors study transductive support vector machines with completely la-
beled and unlabeled examples.
Generating fully annotated corpora from Wikipedia has been studied by
[24,27,21]. While [21] focus on English and exploit the semi-structured content
of the info-boxes, [24] and [27] propose heuristics to assign tags to Wikipedia
entries by manually defined patterns.

3 Preliminaries
The task in label sequence learning [9] is to find a mapping from a sequential
input x = x1 , . . . , xT  to a sequential output y = y1 , . . . , yT , where yt ∈ Σ;
i.e., each element of x is annotated with an element of the output alphabet Σ
which denotes the set of tags. We denote the set of all possible labelings of x by
Y(x).
The sequential learning task can be modeled in a natural way by a Markov
random field where we have edges between neighboring labels and between
3
http://www.wikipedia.com
410 E.R. Fernandes and U. Brefeld

  
y1 y2 yT
  


 
 

x1 x2 xT
 
  


Fig. 1. A Markov random field for label sequence learning. The xt denote observations
and the yi their corresponding hidden class variables.

label-observation pairs, see Figure 1. The conditional density p(y|x) factorizes


across the cliques [12] and different feature maps can be assigned to the differ-
ent types of cliques, φtrans for transitions and φobs for emissions [2,16]. Finally,
interdependencies between x and y are captured by an aggregated joint feature
map φ : X × Y → Rd ,
 T 
 
T
 
φ(x, y) = φtrans (x, yt−1 , yt ) , φobs (x, yt )
t=2 t=1

which gives rise to log-linear models of the form

g(x, y; w) = w φ(x, y).

The feature map exhibits a first-order Markov property and as a result, decod-
ing can be performed by a Viterbi algorithm [11,31] in O(T |Σ|2 ) so that, once
optimal parameters w∗ have been found, these are used as plug-in estimates to
compute the prediction for a new and unseen sequence x ,

ŷ = f (x ; w∗ ) = argmax g(x , ỹ; w∗ ). (1)


ỹ∈Y(x )

The optimal function f (·; w∗ ) minimizes the expected risk E [(y, f (x; w ∗ )] where
 is a task-dependent, structural loss function. In the remainder, we will focus
on the 0/1- and the Hamming loss to compute the quality of predictions,
|y|

0/1 (y, ỹ) = 1[y=ỹ] ; h (y, ỹ) = 1[yt =ỹt ] (2)
t=1

where the indicator function 1[u] = 1 if u is true and 0 otherwise.

4 Transductive Loss-Augmented Perceptrons


4.1 The Structured Perceptron
The structured perceptron [7,2] is analogous to its univariate counterpart. Given
an infinite sequence (x1 , y 1 ), (x2 , y 2 ), . . . drawn i.i.d. from p(x, y), the structured
Learning from Partially Annotated Sequences 411

perceptron generates a sequence of models w 0 = 0, w 1 , w2 , . . .. At time t, an


update is performed if the the prediction ŷ t = f (xt ; w t ) does not coincide with
the true output y t ; the update rule is given by

w t+1 ← w t + φ(xt , y t ) − φ(xt , ŷ t ).

Note that in case ŷ t = y t the model is not changed, that is w t+1 ← wt . After
an update, the model favors y t over ŷ t for the input xt and a simple extension
of Novikoff’s theorem [25] shows that the structured perceptron is guaranteed to
converge to a zero loss solution (if one exists) in at most t ≤ ( γ̃r )2 w∗ 2 steps,
where r is the radius of the smallest hypersphere enclosing the data points and
γ̃ is the functional margin of the data [8,2].

4.2 Loss-Augmented Perceptrons


The above update formula intrinsically minimizes the 0/1-loss which is generally
too coarse for differentiating the severity of erroneous annotations. To incorpo-
rate task-dependent loss functions into the perceptron, the structured hinge loss
of a margin-rescaled SVM [34,19] can be used. The respective decoding problem
becomes
 
ŷ = argmax (y t , ỹ) − w 
t φ(xt , y t ) + w t φ(xt , ỹ)
ỹ∈Y(xt )
 
= argmax (y t , ỹ) + w
t φ(xt , ỹ) .
ỹ∈Y(xt )

Margin-rescaling can be intuitively motivated by recalling that the size of the


margin γ = γ̃/w quantifies the confidence in rejecting an erroneously decoded
output ỹ. Re-weighting γ̃ with the current loss (y, ỹ) leads to a weaker rejec-
tion confidence when y and ỹ are similar, while large deviations from the true
annotation imply a large rejection threshold. Rescaling the margin by the loss
implements the intuition that the confidence of rejecting a mistaken output is
proportional to its error.
Margin-rescaling can always be integrated into the decoding algorithm when
the loss function decomposes over the latent variables of the output structure as it
is the case for the Hamming loss in Eq. (2). The final model w∗ is a minimizer of
a convex-relaxation of the theoretical loss (the generalization error) and given by
 
∗ 
w = argmin E max (y t , ỹ) − wt (φ(xt , y t ) − φ(xt , ỹ)) .
w̃ ỹ∈Y(xt )

4.3 Transductive Perceptrons for Partially Labeled Data


We derive a straight-forward transductive extension of the loss-augmented per-
ceptron that allows for dealing with partially annotated sequences. Instead of
the ground-truth annotation y of an observed sequence x, we are now given a
set z = {(tj , σj )}j=1,...,m with 1 ≤ tj ≤ T and σj ∈ Σ of token annotations such
that the time slices xtj of x are labeled with ytj = σj while the remaining parts
of the label sequence are unlabeled.
412 E.R. Fernandes and U. Brefeld

  
σ1 - σ1 - σ1
 Q  3  Q  3 
J Q 


J Q 

   Q

J Q
Q J

Q
σ2 s σ2
Q
J
- J
- s σ2
Q
 J  J 
@
J  @
J 

@

@J
@J

J
^
J 
^
JJ
@
R
@ @
R
@
σ k - σ - kσ k
  
t−1 t t+1

Fig. 2. The constrained Viterbi decoding (emissions are not shown). If time t is an-
notated with σ2 , the light edges are removed before decoding to guarantee that the
optimal path passes through σ2 .

To learn from the partially annotated input stream (x1 , z 1 ), (x2 , z 2 ), . . ., we


perform a transductive step to extrapolate the fragmentary annotations to the
unlabeled tokens so that we obtain a reference labeling as a makeshift for the
missing ground-truth. Following the transductive principle, we use a constrained
Viterbi algorithm [5] to decode a pseudo ground-truth y p for the tuple (x, z),

y p = argmax w φ(x, ỹ) s.t. ∀(t, σ) ∈ z : ỹt = σ.


ỹ∈Y(x)

The constrained Viterbi decoding guarantees that the optimal path passes through
the already known labels by removing unwanted edges, see Figure 2. Assuming
that a labeled token is at position 1 < t < T , the number of removed edges
is precisely 2(k − 1)k, where k = |Σ|. Algorithmically, the constrained decod-
ing splits sequences at each labeled token in two halves which are then treated
independently of each other in the decoding process.
Given the pseudo labeling y p for an observation x, the update rule of the
loss-augmented perceptron can be used to complement the transductive percep-
tron. The inner loop of the resulting algorithm is shown in Table 2. Note that
augmenting the loss function into the computation of the argmax (step 2) gives
y p = ŷ if and only if the implicit loss-rescaled margin criterion is fulfilled for all
alternative output sequences ỹ.

Kernelization. Analogously to the regular perceptron algorithm, its transduc-


tive generalization can easily be kernelized. The weight vector at time t is given
by

t−1
wt = 0 + φ(xj , y pj ) − φ(xj , ŷ j ) (3)
j=1
  
= αx (y p , ŷ) φ(x, y p ) − φ(x, ŷ) (4)
(x,yp ,ŷ)
Learning from Partially Annotated Sequences 413

Table 2. The transductive perceptron algorithm

Input: Partially labeled example (x, z), model w


 
1: y p ← argmaxỹ∈Y(x) w  φ(x, ỹ) s.t. ∀(t, σ) ∈ z : ỹt = σ.
 
2: ŷ ← argmaxỹ∈Y(x) h (y p , ỹ) + w  φ(x, ỹ)

3: w ← w + φ(x, y p ) − φ(x, ŷ)

Output: Updated model w 

with appropriately chosen α’s that act as virtual counters, detailing how many
times the prediction ŷ has been decoded instead of the pseudo-output y p for
an observation x. Thus, the dual perceptron has virtually exponentially many
parameters, however, these are initialized with αx (y, y  ) = 0 for all triplets
(x, y, y  ) so that the counters only need to be instantiated once the respective
triplet is actually seen. Using Eq. (4), the decision function depends only on
inner products of joint feature representations which can then be replaced by
appropriate kernel functions k(x, y, x , y  ) = φ(x, y) φ(x , y  ).

Parameterization. Anecdotal evidence shows that unlabeled examples often


harm the learning process when the model is weak as the unlabeled data out-
weigh the labeled part and hinder adaptation to the target concept. A remedy
is to differently weight the influence of labeled and unlabeled data or to in-
crease the influence of unlabeled examples during the learning process [13,37].
In our experiments we parameterize the Hamming loss to account for labeled
and unlabeled tokens,
|y p |

h (y p , ŷ) = λ(z, t)1[ytp =ŷt ]
t=1

where λ(z, t) = λL if t is a labeled time slice, that is (t, ·) ∈ z, and λ(z, t) = λU


otherwise. Appropriate values of λL and λU can be found using cross-validation
or using holdout data.

Discussion. Trivially, the traditional supervised and semi-supervised counter-


parts of the perceptron are obtained as special cases. That is, if for all examples
|z| = |x| holds, we recover the traditional supervised learning setting and in
case either |z| = |x| or z = ∅ holds, we obtain the standard semi-supervised
setting. This observation allows us to design the experiments in the next section
simply by changing the data, that is the distribution of the annotations across
tokens, while keeping the algorithm fixed. For supervised and semi-supervised
scenarios, we only need to alter the label distribution so that it gives rise to
either completely labeled or unlabeled sequences.
Using the results by Zinkevich et al. [38] the proposed transductive perceptron
can easily be distributed on several machines. Note that the inner loop of the
algorithm, displayed in Table 2 depends only on the input (x, z) and the actual
414 E.R. Fernandes and U. Brefeld

model w. Consequentially, several models can be trained in parallel on disjoint


subsets of the data. A subsequent merging process aggregates the models where
each model’s impact is proportional to the amount of data it has been trained on.

5 Empirical Results

In this section, we will show that (i) one can effectively learn from partial annota-
tions and that (ii) our approach is superior to standard semi-supervised setting.
We thus compare the transductive loss-augmented perceptron to its supervised
and semi-supervised counterparts. Experiments with CoNLL data use the orig-
inal splits of the respective corpora into training, holdout, and test set, where
parameters are adjusted on the holdout sets. We report on averages of training
3 × 4 = 12 repetitions, involving 3 perceptrons and 4 data sets to account for
the random effects in the algorithm and data generation; error bars indicate
standard error.
Due to the different nature of the algorithms, we need to provide different
ground-truth annotations for the algorithms. While the transductive perceptron
is simply trained on arbitrarily (e.g., partially) labeled sequences, the supervised
baseline needs completely annotated sentences and the semi-supervised percep-
tron allows for the inclusion of additional unlabeled examples. In each setting, we
use the same observation sequences for all methods and only change the distri-
bution of the labels so that it meets the requirements of the respective methods;
however note that the number of labeled tokens is identical for all methods. We
describe the generation of the training sets in greater detail in the following
subsections. All perceptrons are trained for 100 epochs.

5.1 English CoNLL 2003


The first study is based on the CoNLL 2003 shared task [29], an English corpus
that includes annotations of four types of entities: person (PER), organization
(ORG), location (LOC), and miscellaneous (MISC). This corpus is assembled
from Reuters News stories and divided into three parts: 203,621 training, 51,362
development, and 46,435 test tokens.
We first study the impact of the ratio of labeled and unlabeled tokens in a con-
trolled setting. To generate the respective training sets for supervised and semi-
supervised settings, we proceed as follows. For each ratio, we draw sentences at
random until the amount of tokens matches (approximately) the required num-
ber of labeled examples. These sentences are then completely labeled and form
the training set for the supervised perceptron. The semi-supervised perceptron
additionally gets the remaining sentences from the original training set as un-
labeled examples. The partially labeled training data is generated by randomly
removing token annotations from the original training split until the desired ratio
of labeled/unlabeled tokens is obtained. Note that the underlying assumptions
on the annotations are much stronger for the completely annotated data.
Learning from Partially Annotated Sequences 415

84

82

80
F1 78

76
Partially labeled
74 Semi-supervised
Supervised
72
10 20 30 40 50 60 70 80 90 100
annotated tokens (%)

Fig. 3. Results for CoNLL

Figure 3 shows F1 scores for different ratios of labeled and unlabeled tokens.
Although the baselines are more likely to capture transitions well because the
labeled tokens form complete annotations, they are significantly outperformed
by the transductive perceptron in case only 10-50% of the tokens are labeled. For
60-100% all three algorithms perform equally well which is still notable because
the partial annotations are inexpensive and easier to obtain. By contrast, the
semi-supervised perceptron performs worst and is not able to benefit from many
unlabeled examples.
We now study the impact of the amount of additional labeled tokens. In Fig-
ure 4, we fix the amount of completely annotated sentences at 20% (left figure)
and 50% (right), respectively, and vary the amount of additional partially an-
notated tokens. The supervised and semi-supervised baselines are constant as
they cannot deal with the additional data where the semi-supervised perceptron
treats the remaining 80% and 50% of the data as unlabeled sentences. Notice
that the semi-supervised baseline performs poorly; as in the previous experiment,
the additional unlabeled data seemingly harm the training process. Similar ob-
servations have for instance been made by [4,23] and particularly for structural
semi-supervised learning by [37]. By contrast, the transductive perceptron shows
in both figures an increasing performance for the partially labeled setting when
the amount of labeled tokens increases. The gain in predictive accuracy is highest
for settings with only a few completely labeled examples (Figure 4, left).

5.2 Wikipedia – Mono-Lingual Experiment


We now present an experiment using automatically annotated real-world data
extracted from Wikipedia. To show that incorporating partially labeled exam-
ples improves performance, we proceed as follows: The training set consists of
completely labeled sentences which are taken from the English CoNLL data and
partially labeled data that is extracted automatically from Wikipedia. One of
416 E.R. Fernandes and U. Brefeld

84 84

82
83
80

78
F1

F1
82
76

74
81
Partially labeled Partially labeled
72 Semi-supervised Semi-supervised
Supervised Supervised
70 80
0 20 40 60 80 100 0 20 40 60 80 100
additional labels (%) additional labels (%)

Fig. 4. Varying the amount of additional labeled tokens with 20% (left) and 50% (right)
completely labeled examples

Table 3. An exemplary partially labeled sentence extracted from Wikipedia. The coun-
try Hungary is labeled as a location (LOC) due to the majority vote, while Bukkszek
could not be linked to a tagged article and remains unlabeled.

x = Bukkszek is a small village in the north of Hungary


| | | | | | | | | |
PER LOC ORG MISC O
y= ? ? ? ? ? ? ? ? ?
7 10498 42 2288 374

the major goals in the data generation is to render human interaction unneces-
sary or at least as low as possible. In the following we briefly describe a simple
way to automatically annotate Wikipedia data using existing resources.
Atserias et al. [3] provide a tagged version of the English Wikipedia that
preserves the link structure. We collect the tagged entities in the text that are
linked to a Wikipedia article. In case the tagged entity does not perfectly match
the hyperlinked text we treat it as untagged. This gives us a distribution of
tags for each Wikipedia article as the tagging is noisy and depends highly on the
context.4 The linked entities referring to Wikipedia articles are now re-annotated
with the most frequent tag of the referenced Wikipedia article. Table 3 shows
an example of an automatically annotated sentence. Words that are not linked
to a Wikipedia article (e.g., small) as well as words corresponding to Wikipedia
articles which have not yet been tagged (e.g., Bikkszek) remain unlabeled.
Table 4 shows some descriptive statistics of the extracted data. Since the
automatically generated data is only partially annotated, the average number of
4
For instance, a school could be either tagged as a location or an organization, de-
pending on the context.
Learning from Partially Annotated Sequences 417

Table 4. Characteristics of the English data sets

CoNLL Wikipedia
tokens 203,621 1,205,137,774
examples 14,041 58,640,083
tokens per example 14.5 20.55
entities 23,499 22,632,261
entities per example 1.67 0.38
MISC 14.63% 18.17%
PER 28.08% 19.71%
ORG 26.89% 30.98%
LOC 30.38% 31.14%

Additional Wikipedia data Additional Wikipedia data


Only CoNLL data Only CoNLL data

82.7 77
82.6 76.9
82.5 76.8
82.4 76.7
F1

F1

82.3 76.6
82.2 76.5
82.1 76.4
82.0 76.3
0 2x106 4x106 6x106 0 4x106 8x106
# Wikipedia tokens # Wikipedia tokens

Fig. 5. Results for mono-lingual (left) and cross-lingual (right) Wikipedia experiments

entities in sentences is much lower compared to that of CoNLL. That is, there are
potentially many unidentified and missed entities in the data. By looking at the
numbers one could assume that particularly persons (PER) are underrepresented
in the Wikipedia data while organizations (ORG) and others (MISC) are slightly
overrepresented. Locations (LOC) are seemingly well captured.
The experimental setup is as follows. We use all sentences contained in the
CoNLL training set as completely labeled examples and add randomly drawn
partially labeled sentences that are automatically extracted from Wikipedia.
Figure 5 (left) shows F1 scores for varying numbers of additional data. The
leftmost point coincides with the supervised perceptron that only processes the
labeled CoNLL data. Adding partially labeled data shows a slight but significant
improvement over the supervised baseline. Interestingly, the observed improve-
ment increases with the number of partially labeled examples although these
come from a different distribution as shown in Table 4.
418 E.R. Fernandes and U. Brefeld

Table 5. Characteristics of the Spanish data sets

CoNLL Wikipedia
tokens 264,715 257,736,886
examples 8,323 9,500,804
tokens per example 31.81 27.12
entities 18,798 8,520,454
entities per example 2.26 0.89
MISC 11.56% 27.64%
PER 22.99% 23.71%
ORG 39.31% 32.63%
LOC 26.14% 16.02%

5.3 Wikipedia – Cross-Lingual Experiment


This experiment aims at studying whether we could enrich a small data set in
the target language (here: Spanish) by exploiting resources in a source language
(here: English). For the cross-language scenario we use the CoNLL’2002 corpus
[28] for evaluation. The corpus consists of Spanish news wire articles from the
EFE5 news agency and is annotated with four types of entities: person (PER),
organization (ORG), location (LOC), and miscellaneous (MISC). The additional
Wikipedia resource is generated automatically as described in the previous sec-
tion, however, we add an intermediate step to translate the English pages into
Spanish by exploiting Wikipedia language links. The automatic data generation
now consists of the following three steps: (1) Count the tagged entities that are
linked to an English Wikipedia article. (2) Translate the article into Spanish by
using the language links. In case such a link does not exist we ignore the article.
(3) Annotate mentions of the Spanish entity in the Spanish Wikipedia with the
most frequent tag of its English counterpart of step 1.
Table 5 shows some descriptive statistics of the extracted data from the Span-
ish Wikipedia. The number of contained entities is again much lower than in the
CoNLL data. Compared to Table 4, the percentage of persons (PER) matches
that of the Spanish CoNLL, however locations (LOC) and other entities (MISC)
show large deviations. This is probably due to missing language links between
the Wikipedias (the Spanish Wikipedia is much smaller than the one for English)
and caused by differences in the respective languages.
Our experimental setup is identical to that of the previous section, except
that we now use the training set of the Spanish CoNLL together with the au-
tomatically extracted data from the Spanish Wikipedia. Figure 5 (right) shows
the results. A relatively small number of additional partially labeled examples
does not have an impact on the performance of the transductive perceptron. We
credit this finding to noisy and probably weak annotations caused by the lan-
guage transfer. However, when we add more than 6 million automatically labeled
tokens, the generalized problem setting pays off and the performance increases,
slightly, but significantly.
5
http://efe.com/
Learning from Partially Annotated Sequences 419

5.4 Execution Time


Figure 6 reports on execution times on an Intel(R) Core(TM)2 Duo CPU (E8400
model) with 3.00GHz and 6 MB cache memory. We use the same experimental
setup as for Figure 4 (left). That is we use 20% of the English CoNLL sequences
as completely labeled examples and vary the number of additional annotations
on the remaining tokens. The figure shows that the execution time decreases for
an increasing number of labels because the decoding is less expensive until it
reaches the performance of the the standard loss-augmented perceptron which
is trained on the completely labeled training set of the CoNLL data. The results
also hold for the Wikipedia experiments. Using additional 0.1% of the English
Wikipedia (which is about 5 times the size of the CoNLL training set) takes
about 18 minutes. In sum, we observe a linear growing execution time in the size
of the corpus given a fixed ratio of labeled/unlabeled tokens.

6.0
5.8
training time (min)

5.6
5.4
5.2
5.0
4.8
4.6
0 20 40 60 80 100
additional labels (%)

Fig. 6. Execution time

6 Conclusion
In this paper, we showed that surprisingly simple methods, such as the devised
transductive perceptron, allow for learning from sparse and partial labelings.
Our empirical findings show that a few, randomly distributed labels often lead to
better models than the standard supervised and semi-supervised settings based
on completely labeled ground-truth; the transductive perceptron was observed
to be always better or on par as its counterparts trained on the same amount
of labeled data. Immediate consequences arise for the data collection: while the
standard semi-supervised approach requires completely labeled editorial data, we
can effectively learn from partial annotations that have been generated automat-
ically and without manual interaction; using additional, automatically labeled
data from Wikipedia lead to a significant increase in performance in mono- and
cross-lingual named entity recognition tasks. We emphasize that these improve-
ments come at factually no additional labeling costs at all.
420 E.R. Fernandes and U. Brefeld

Future work will extend our study towards larger-scales. It will certainly be
of interest to extend the empirical evaluation to other sequential tasks, output
structures. As the developed transductive perceptron is a relatively simple algo-
rithm, more sophisticated ways for dealing with partially labeled data are also
interesting research areas.

Acknowledgments. We thank Jordi Atserias for helping us to generate the


Wikipedia data. This work was partially funded by the Coordenação de Aper-
feiçoamento de Pessoal de Nível Superior (CAPES), Brazil.

References
1. Altun, Y., McAllester, D., Belkin, M.: Maximum margin semi–supervised learning
for structured variables. In: Advances in Neural Information Processing Systems
(2006)
2. Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden Markov support vector ma-
chines. In: Proceedings of the International Conference on Machine Learning
(2003)
3. Atserias, J., Zaragoza, H., Ciaramita, M., Attardi, G.: Semantically annotated
snapshot of the english wikipedia. In: European Language Resources Association
(ELRA), editor, Proceedings of the Sixth International Language Resources and
Evaluation (LREC 2008), Marrakech, Morocco (May 2008)
4. Baluja, S.: Probabilistic modeling for face orientation discrimination: Learning
from labeled and unlabeled data. In: Advances in Neural Information Processing
Systems (1998)
5. Cao, L., Chen, C.W.: A novel product coding and recurrent alternate decoding
scheme for image transmission over noisy channels. IEEE Transactions on Com-
munications 51(9), 1426–1431 (2003)
6. Chapelle, O., Schölkopf, B., Zien, A.: Semi–supervised Learning. MIT Press, Cam-
bridge (2006)
7. Collins, M.: Discriminative reranking for natural language processing. In: Pro-
ceedings of the International Conference on Machine Learning (2000)
8. Collins, M.: Ranking algorithms for named-entity extraction: Boosting and the
voted perceptron. In: Proceedings of the Annual Meeting of the Association for
Computational Linguistics (2002)
9. Dietterich, T.G.: Machine learning for sequential data: A review. In: Proceedings
of the Joint IAPR International Workshop on Structural, Syntactic, and Statisti-
cal Pattern Recognition (2002)
10. Do, T.-M.-T., Artieres, T.: Large margin training for hidden Markov models with
partially observed states. In: Proceedings of the International Conference on Ma-
chine Learning (2009)
11. Forney, G.D.: The Viterbi algorithm. Proceedings of IEEE 61(3), 268–278 (1973)
12. Hammersley, J.M., Clifford, P.E.: Markov random fields on finite graphs and lat-
tices (1971) (unpublished manuscript)
13. Joachims, T.: Transductive inference for text classification using support vector
machines. In: Proceedings of the International Conference on Machine Learning
(1999)
14. Juang, B., Rabiner, L.: Hidden Markov models for speech recognition. Techno-
metrics 33, 251–272 (1991)
Learning from Partially Annotated Sequences 421

15. King, T.H., Dipper, S., Frank, A., Kuhn, J., Maxwell, J.: Ambiguity management
in grammar writing. In: Proceedings of the ESSLLI 2000 Workshop on Linguistic
Theory and Grammar Implementation (2000)
16. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic
models for segmenting and labeling sequence data. In: Proceedings of the Inter-
national Conference on Machine Learning (2001)
17. Lafferty, J., Zhu, X., Liu, Y.: Kernel conditional random fields: representation
and clique selection. In: Proceedings of the International Conference on Machine
Learning (2004)
18. Lee, C., Wang, S., Jiao, F., Greiner, R., Schuurmans, D.: Learning to model
spatial dependency: Semi-supervised discriminative random fields. In: Advances
in Neural Information Processing Systems (2007)
19. McAllester, D., Hazan, T., Keshet, J.: Direct loss minimization for structured
perceptronsi. In: Advances in Neural Information Processing Systems (2011)
20. McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for
information extraction and segmentation. In: Proceedings of the International
Conference on Machine Learning (2000)
21. Mika, P., Ciaramita, M., Zaragoza, H., Atserias, J.: Learning to tag and tagging
to learn: A case study on wikipedia. IEEE Intelligent Systems 23, 26–33 (2008)
22. Mukherjee, S., Ramakrishnan, I.V.: Taming the unstructured: Creating structured
content from partially labeled schematic text sequences. In: Chung, S. (ed.) OTM
2004. LNCS, vol. 3291, pp. 909–926. Springer, Heidelberg (2004)
23. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Text classification from
labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134
(2000)
24. Nothman, J., Murphy, T., Curran, J.R.: Analysing wikipedia and gold-standard
corpora for ner training. In: EACL 2009: Proceedings of the 12th Conference
of the European Chapter of the Association for Computational Linguistics, pp.
612–620. Association for Computational Linguistics, Morristown (2009)
25. Novikoff, A.B.: On convergence proofs on perceptrons. In: Proceedings of the
Symposium on the Mathematical Theory of Automata (1962)
26. Rabiner, L.: A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE 77(2), 257–285 (1989)
27. Richman, A.E., Schone, P.: Mining wiki resources for multilingual named entity
recognition. In: Proceedings of ACL 2008: HLT, pp. 1–9. Association for Compu-
tational Linguistics, Columbus (2008)
28. Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: language-
independent named entity recognition. In: COLING-2002: Proceedings of the 6th
Conference on Natural Language Learning, pp. 1–4. Association for Computa-
tional Linguistics, Morristown (2002)
29. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared
task: Language-independent named entity recognition. In: Proceedings of CoNLL
2003, pp. 142–147 (2003)
30. Scheffer, T., Wrobel, S.: Active hidden Markov models for information extrac-
tion. In: Proceedings of the International Symposium on Intelligent Data Analysis
(2001)
31. Schwarz, R., Chow, Y.L.: The n-best algorithm: An efficient and exact procedure
for finding the n most likely hypotheses. In: Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (1990)
32. Taskar, B., Guestrin, C., Koller, D.: Max–margin Markov networks. In: Advances
in Neural Information Processing Systems (2004)
422 E.R. Fernandes and U. Brefeld

33. Truyen, T.T., Bui, H.H., Phung, D.Q., Venkatesh, S.: Learning discriminative
sequence models from partially labelled data for activity recognition. In: Pro-
ceedings of the Pacific Rim International Conference on Artificial Intelligence
(2008)
34. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods
for structured and interdependent output variables. Journal of Machine Learning
Research 6, 1453–1484 (2005)
35. Xu, L., Wilkinson, D., Southey, F., Schuurmans, D.: Discriminative unsupervised
learning of structured predictors. In: Proceedings of the International Conference
on Machine Learning (2006)
36. Yu, C.-N., Joachims, T.: Learning structural svms with latent variables. In: Pro-
ceedings of the International Conference on Machine Learning (2009)
37. Zien, A., Brefeld, U., Scheffer, T.: Transductive support vector machines for
structured variables. In: Proceedings of the International Conference on Machine
Learning (2007)
38. Zinkevich, M., Weimer, M., Smola, A., Li, L.: Parallelized stochastic gradient
descent. In: Advances in Neural Information Processing Systems, vol. 23 (2011)
The Minimum Transfer Cost Principle for
Model-Order Selection

Mario Frank, Morteza Haghir Chehreghani, and Joachim M. Buhmann

Department of Computer Science, ETH Zurich, Switzerland

Abstract. The goal of model-order selection is to select a model variant that


generalizes best from training data to unseen test data. In unsupervised learn-
ing without any labels, the computation of the generalization error of a solution
poses a conceptual problem which we address in this paper. We formulate the
principle of “minimum transfer costs” for model-order selection. This principle
renders the concept of cross-validation applicable to unsupervised learning prob-
lems. As a substitute for labels, we introduce a mapping between objects of the
training set to objects of the test set enabling the transfer of training solutions.
Our method is explained and investigated by applying it to well-known problems
such as singular-value decomposition, correlation clustering, Gaussian mixture-
models, and k-means clustering. Our principle finds the optimal model complex-
ity in controlled experiments and in real-world problems such as image denoising,
role mining and detection of misconfigurations in access-control data.

Keywords: clustering, generalization error, transfer costs, cross-validation.

1 Introduction
Clustering and dimensionality reduction are highly valuable concepts for exploratory
data analysis that are frequently used in many applications for pattern-recognition, vi-
sion, data mining, and other fields. Both problem domains require to specify the com-
plexity of solutions. When partitioning a set of objects into clusters, we must select an
appropriate number of clusters. Learning a low-dimensional representation of a set of
objects, for example by learning a dictionary, involves choosing the number of atoms
or codewords in the dictionary. More generally speaking, learning the parameters of a
model given some measurements requires selecting the number of parameters, i.e. one
must select the model-order.
In this paper we address the general issue of model-order selection for unsuper-
vised learning problems and we develop and advocate the principle of minimal transfer
costs (MTC). Our method generalizes classical cross-validation known from supervised
learning. It is applicable to a broad class of model-order selection problems even when
no labels or target values are given. In essence, MTC can be applied whenever a cost
function is defined. The MTC principle can be easily explained in abstract terms: A
good choice of the model-order based on a given dataset should also yield low costs on
a second dataset from the same distribution. We learn models of various model-orders
from a given dataset X(1) . These models with their respective parameters are then used
to interpret a second data set X(2) , i.e., to compute its costs. The principle selects the

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 423–438, 2011.

c Springer-Verlag Berlin Heidelberg 2011
424 M. Frank, M.H. Chehreghani, and J.M. Buhmann

model-order that achieves lowest transfer cost, i.e. the solution that generalizes best to
the second dataset. Too simple models underfit and achieve high costs on both datasets;
too complex models overfit to the fluctuations of X(1) which results in high costs on
X(2) where the fluctuations are different.
The conceptually challenging part of this procedure is related to the transfer of the
solution inferred from the objects of the first dataset to the objects of the second dataset.
This transfer requires a mapping function which generalizes the conceptually straight-
forward assignments in supervised learning. For several applications, we demonstrate
how to map two datasets to each other when no labels are given.
Our main contribution is to propose and describe the minimum transfer cost principle
(MTC) and to demonstrate its broad applicability on a set of different applications. We
select well-known methods such as singular-value decomposition (SVD), max. likeli-
hood inference, k-means, Gaussian mixture models, and correlation clustering because
the understandability of our principle should not be limited by long explanations of the
complicated models it is applied to. In our real-world applications image denoising,
role mining, and error detection in access-control configurations we pursue the goal to
investigate the reliability of the model order selection scheme, i.e. whether for a prede-
termined method (such as SVD), our principle finds the model-order that performs best
on a second test data set.
In the remainder of the paper we first explain the principle of minimal transfer costs
and we address the conceptual question of how to map a trained model to a previously
unseen dataset. In the following Sections 3, 6, 7 we invoke the MTC principle to select
a plausible (“true”) number of centroids for the widely used Gaussian mixture model,
the optimal number of clusters for correlation clustering and for the k-means algorithm.
In Section 4, we apply MTC to SVD for image denoising and detecting errors in access-
control configurations. In Section 5, we use MTC for selecting the number of factors
for Boolean matrix factorization on role mining data.

2 Minimum Transfer Costs


2.1 Notational Preliminaries
Let O be a set of N objects with corresponding measurements. The measurements can
be characterized in several ways: (i) objects can be identified with the measurements
and we can use the terms synonymously, i.e., the ith object is described by the vector
xi ; (ii) measurements are pairwise (dis)similarities between objects. In the first case,
the objects O are directly characterized by the measurements X = {x1 , . . . , xN } ∈ X .
In the second case, a graph G(O, X) with (dis)similarity measurements X := {Xij }
characterizes the relations for all pairs of objects (i, j), 1 ≤ i ≤ N, 1 ≤ j ≤ N .
Furthermore, let {O(1) , X(1) } and {O(2) , X(2) } be two datasets given from a unique
source. Often in practical situations, only one such dataset is available. Then we ran-
domly partition it into X(1) and X(2) .
A data model is usually characterized as an optimization problem with an associated
cost function. We denote the potential outcome of optimizing a cost function by the term
solution. A cost function R(s, X, k) quantifies how well a particular solution s ∈ S
explains the measurements X. For parametric models, the solution includes a set of
The Minimum Transfer Cost Principle for Model-Order Selection 425

model parameters which are learned through an inference procedure. The number k
quantifies the number of model parameters and thereby identifies the model order. In
clustering, for instance, k would be the number of clusters of the solution s(X).

2.2 Minimum Transfer Costs

A cost functions imposes a partial order on all possible solutions given the data. Since
usually the measurements are contaminated by noise, one aims at finding solutions that
are robust against the noise fluctuations and thus generalize well to future data. Learning
theory demands that a well-regularized model explains not only the dataset at hand,
but also new datasets generated from the same source and thus drawn from the same
probability distribution.
Let s(1) be the solution (e.g. model parameters) learned from a given set of ob-
jects O(1) = {i : 1 ≤ i ≤ N1 } and the corresponding measurements X(1) . Let the set
O(2) = {i : 1 ≤ i ≤ N2 } represent the objects of a second dataset X(2) drawn from
the same distribution as X(1) . In a supervised learning scenario, the given class labels of
both datasets guide a natural and straightforward mapping of the trained solution from
the first to the second dataset: the model should assign objects of both sets with same
labels to the same classes. However, when no labels are available, it is unclear how to
transfer a solution. To enable the use of cross-validation, we propose to compute the
costs of a learned solution on a new dataset in the following way. We start with defining
a mapping ψ from objects of the second dataset to objects of the first dataset:
 
ψ : O(2) × X × X → O(1) , i , X(1) , X(2)  → ψ(i , X(1) , X(2) ) (1)

This mapping function aligns each object from the second dataset with its nearest neigh-
bor in O(1) . We have to compute such a mapping in order to transfer a solution. Let’s
assume, for the moment, that the given model is a sum over independent partial costs


N
R(s, X, k) = Ri (s(i), xi , k). (2)
i=1

Ri (s(i), xi , k) denotes the partial costs of object i and s(i) denotes the structure part of
the solution that relates to object i. For a parametric centroid-based clustering model
s(i) would be the centroid object i is assigned to. Using the object-wise mapping
function ψ to map objects i ∈ O(2) to objects in O(1) , we define the transfer costs
RT (s(1) , X(2) , k) of a solution s with model-order k as follows:

1 
N2 
N1
(2)
RT (s(1) , X(2) , k) := Ri (s(1) (i), xi , k) I{ψ(i ,X(1) ,X(2) )=i} . (3)
N2  i=1
i =1

For each object i ∈ O(2) we compute the costs of i with respect to the learned solution
s(X(1) ). The mapping function ψ(i , X(1) , X(2) ) ensures that the cost function treats the
(2)
measurement xi with i ∈ O(2) as if it was the object i ≡ ψ(i , X(1) , X(2) ) ∈ O(1) .
In the limit of many observations N2 , the transfer costs converge to E[R(s(1) , X, k)],
426 M. Frank, M.H. Chehreghani, and J.M. Buhmann

the expected costs of the solution s(1) with respect to the probability distribution of the
measurements. Minimizing this quantity, with respect to the solution is what we are
ultimately interested in. The minimum transfer cost principle (MTC) selects the model-
order k with lowest transfer costs. MTC disqualifies models with a too high complexity
that perfectly explain X(1) but fail to fit X(2) (overfitting), as well as models with too
low complexity which insufficiently explain both of them (underfitting).
We would like to emphasize the relation of our method to cross-validation in su-
pervised learning which is frequently used in classification or regression. In supervised
learning a model is trained on a set of given observations X(1) and labels (or output
variables) y(1) . Usually , we assume i.i.d. training and test data in classification and,
therefore, the transfer problem disappears.

A variant and a special case of the mapping function: In the following, we will describe
two other mapping variants. In many problems such as clustering, a solution is a set
of structures where the objects inside a structure are statistically indistinguishable by
the algorithm. Therefore, the objects O(2) can directly be mapped to the structures
inferred from X(1) rather than to individual objects, since the objects in each structure
are unidentifiable. In this way, the mapping function assigns the objects O(2) to the
solution s(X(1) ) ∈ S:
   
ψ s : O(2) × S × X → S(O(1) ) , i , s(X(1) ), X(2)  → ψ i , s(X(1) ), X(2) . (4)

The generative mapping, another variant of the ψ function, is obtained in a natural way
by data construction. Given the true model parameters, we randomly sample pairs of
data items. This gives the identity mapping between the pairs in O(1) and O(2) and can
be used whenever the data is artificially generated.

ψ G : O(2) → O(1) , i 
→ ψ(i ) = i . (5)

In practice, however, the data is usually generated in an unknown way. One has a single
dataset X and subdivides it (eventually multiple times) into random subsets X(1) , X(2)
which are not necessarily of equal cardinality. The nearest-neighbor mapping is ob-
tained by assigning each object i ∈ O(2) to the structure or object where the costs of
R(s, X(O(1) ∪ i ), k) is minimized. In the cases where multiple objects or structures
satisfy this condition, i is randomly assigned to one of them.

3 The Easy Case: Gaussian Mixture Models


We start with mixtures of Gaussians (GMM). We will see that for this model, the trans-
fer of the learned solution to a second dataset is straightforward and requires no partic-
ular mapping function. This case is still a good example to start with as it demonstrates
that cross-validation for unsupervised learning is a powerful technique that can compete
with well known model-selection scores such as BIC and AIC.
A GMM solution consist of the centers μt and the covariances Σ t of the Gaussians,
as well as the mixing coefficients πt . The model order is the number of Gaussians k and
the cost function is the negative log likelihood of the model
The Minimum Transfer Cost Principle for Model-Order Selection 427

 

N 
k
R(μ, Σ, X, k) = − ln πt N (xi |μt , Σ t ) (6)
i=1 t=1

As all model parameters are independent of the object index i, it is straightforward to


compute the transfer costs on a second dataset. The learned model parameters provide
a probability density estimate for the entire measurement space such that the individ-
ual likelihood of each new data item can be readily computed. The transfer costs are
RT (s(1) , X(2) , k) = R(μ(1) , Σ (1) , X(2) , k)
We carry out experiments by generating 500 items from three Gaussians. As we
increase their variances to increase their overlap, we learn GMM’s with varying number
of Gaussians k and compute the BIC score, the AIC score, as well as the transfer costs.
Two exemplary results are illustrated in Figure 1: an easy setting in the upper row and
a difficult setting with high overlap in the lower row. In the easy case, each of the four
methods selects the correct number of clusters. For increasing overlap, AIC exhibits
a tendency to select a too high number of components. At the variance depicted in the
lower plots, BIC starts selecting k < 3, while MTC still estimates 3 Gaussians. For very
high overlap, we observe that both BIC and MTC select k = 1 while AIC selects the
maximum number of Gaussians that we offered. The interval of the standard deviation
where BIC selects a lower number of Gaussians than MTC ranges from 60% of the
distance between the centers (illustrated in Figure 1 bottom) to 85%. The reason for this
discrepancy has to be theoretically explored. This gap might be due to MTC being less
accurate than the BIC score that is exact in the asymptotic limit of many observations.
Maybe, BIC underfits due to non-asymptotic corrections. Visual inspection of the data
suggests that this discrepancy regime poses a hard model-order selection problem.

4 Model Order Selection for Truncated SVD


4.1 Image Denoising with Rank-Limited SVD
SVD provides a powerful, yet simple method of denoising images. Given a noisy image,
one extracts small n × m patches from the image (where usually m = n) and computes
a rank-limited SVD on the matrix X containing the ensemble of all patches, i.e. the
pixel values of one patch are one row in X. SVD provides a dictionary that describes
the image content on a local level. Restricting the rank of the decomposition, the image
content is approximated and, hopefully, denoised. SVD has been frequently applied to
image denoising in the described way or as part of more sophisticated methods (e.g.
[8]). Thereby, selecting the rank of the decomposition poses a crucial modeling choice.
In [8], for instance, the rank is selected by experience of the authors and the issue of
automatic selection is shifted to further research. Here, we address this specific part of
the problem. The task is to select the rank of the SVD decomposition such that the de-
noised image is closest to the noise-free image. Please note that our goal is not primarily
to achieve the very best denoising error given an image (clearly, better image denoising
techniques than SVD exist). Therefore, we do not optimize on other parameters such as
the size of the patches. The main goal is to demonstrate that MTC selects the optimal
rank for a defined task, such as image denoising, conditioned on a predefined method.
428 M. Frank, M.H. Chehreghani, and J.M. Buhmann

Fig. 1. Selecting the number of Gaussians k. Data is generated from 3 Gaussians. Going from the
upper to the lower row, their overlap is increased. For very high overlap, BIC and MTC select
k = 1. The lower row illustrates the smallest overlap where BIC selects k < 3.

We extract N = 4096 patches of size D = 8 × 8 from the image and arrange each of
them in one row of a matrix X. We randomly split this matrix along the rows into two
sub-matrices X(1) and X(2) and select the rank k that minimizes the transfer costs
1 

(1) (1) (1)T 
2
RT(s, X, k) = ψN N (X(1) , X(2) ) ◦ X(2) − Uk Sk Vk  . (7)
N2 2

The mapping ψN N (X(1) , X(2) ) reindexes all objects of the test set with the indices of
their nearest neighbors in the training set. We illustrate the results for the Lenna image
in Figure 2 by color-coding the peak-SNR of the image reconstruction. As one can
see, there is a crest ranging from a low standard deviation of the added Gaussian noise
and maximal rank (k = 64) down to the region with high noise and low optimal rank
(k = 1). The top of the crest marks the optimal rank for given noise (dashed magenta
line). The rank selected by MTC is highlighted by the solid black line (dashed lines
are three times the standard deviation). The selected rank is always very close to the
optimum. At low noise where the crest is rather broad, the deviation from the optimum
is maximal. There the selection problem is most difficult. However, in this parameter
range the choice of the rank has little influence on the error. For high noise, where a
deviation from the optimum has higher influence, our method finds the optimal rank.

4.2 Denoising Boolean Matrices with SVD


In this section, we investigate how well the appropriate rank of SVD is found in a
model-mismatch situation. Here, we apply SVD to Boolean data, namely to Boolean
The Minimum Transfer Cost Principle for Model-Order Selection 429

Fig. 2. PSNR (logarithmic) of the denoised image as a function of the added noise and the rank
of the SVD approximation of the image patches. The crest of this error marks the optimal rank at
a given noise level and is highlighted (dashed magenta). The rank selected by MTC (solid black)
is close to this optimum.

access-control configurations. Such a configuration indicates which user has the permis-
sion to access which resources and it is encoded in a Boolean matrix X, where a 1-entry
means that the permission is granted to the user. In practice, a given user-permission as-
signment is often noisy, meaning that some individual user-permission assignments do
not correspond to the regularities of the data and should thus be regarded as excep-
tions or might even be errors. Such irregularities pose not only a security-relevant risk
but they also constitute a problem when such direct access control systems are to be
migrated to role based access control (RBAC) via so-called role mining methods [14].
As most existing role mining methods today are very sensitive to noise [9], they could
benefit a lot from denoising as a preprocessing step. In [16], SVD and other contin-
uous factorization techniques for denoising X are proposed. Molloy et al. compute a
rank-k approximation Uk Sk VkT of X. Then, a function g maps all individual entries
higher than 0.5 to 1 and the others to 0. The distance of the resulting denoised matrix
X̃k = g(Uk Sk VkT ) to the error-free matrix X∗ depends heavily on k. The authors pro-
pose two methods for selecting the rank k. The first method takes the minimal rank such
that the approximation X̃k covers 80% of the entries of X (this heuristic originates from
the rule of thumb that 20% of the entries of X are corrupted). The second methodselects
the smallest rank that decreases the approximation increment ||(X̃k − X̃k+1 )||1 ||X||1
below 0.001.
We also compare with the rank selected by the Bi-crossvalidation method for SVD
presented by Owen and Perry [19]. This method, which we will term OP-CV, divides
the n × d input matrix X1:n,1:d into four submatrices, X1:p,1:q , X1:p,q+1:d , Xp+1:n,1:q ,
and Xp+1:n,q+1:d with p < n and q < d. Let M† be the Moore-Penrose inverse of
(k)
the matrix M. OP-CV learns the truncated SVD X̂p+1:n,q+1:d from Xp+1:n,q+1:d and
430 M. Frank, M.H. Chehreghani, and J.M. Buhmann

Fig. 3. Denoising four different access-control configurations via rank-limited SVD. The ranks
selected by transfer costs and OP-CV are significantly closer to the optimal rank than the ranks
selected by the originally proposed methods [Molloy et al., 2010].

(k)
computes the error score  = X1:p,1:q − X1:p,q+1:d (X̂p+1:n,q+1:d )† Xp+1:n,1:q . In our
experiments, we compute  for 20 permutations of the input matrix and select the rank
with lowest median error.
We compare the rank selected by the described approaches to the rank selected
by MTC with nearest-neighbor mapping and Hamming distance. The four different
datasets are taken from [16]. The first dataset ’University’ is the access control configu-
ration of a department, the other three are artificially created, each with differing gener-
ation processes as described in [16]. The sizes of the datasets are (users×permissions)
493 × 56, 500 × 347, 500 × 101, and 500 × 190. We display the results in Figure 3.
The optimal rank for denoising is plotted as a big red square. The statistics of the rank
selected by MTC is plotted as small bounded squares. We select the median over 20
random splits of the dataset. As one can see, the minimum transfer cost rank is always
significantly closer to the optimal rank than the ranks selected by the originally pro-
posed methods. The performance of the 80-20 rule is very poor and performance of
the increment threshold depends a lot on the dataset. The Bi-crossvalidation method
by Owen and Perry (OP-CV) finds good ranks, although not so reliably as MTC. It has
been reported that, for smaller validation sets, OP-CV tends to overfit. We could observe
this effect in some of our experiments and also on the University dataset. However, on
the Tree dataset it is actually the method with the larger validation set that overfits.

5 Minimum Transfer Costs for Boolean Matrix Factorization

In this section, we use MTC to select the number of factors in Boolean matrix factor-
ization for role mining [14]. A real-world access-control matrix X with 3000 users and
The Minimum Transfer Cost Principle for Model-Order Selection 431

Fig. 4. Model-order selection for Boolean matrix factorization

500 permissions defines the data set for role mining applications. We factorize this user-
permission matrix into a user-role assignment matrix Z and a user-permission assign-
ment matrix U by maximizing the likelihood derived in [22]. Five-fold cross-validation
is performed with 2400 users in the training set and 600 users in the test set. As in the
last section, the mapping function uses the nearest-neighbor rule with Hamming metric.
(2)
Here, the MTC score in (Eq. 3) measures the number
of bits in xi that do
not match
(1) (2)
(2) k (1) (1)

the decomposition: Ri (s (i), xi , k) = j


xi j − t=1 (zit ∧ utj )
. This mea-
sure differs from the other experiments, where the cost function for optimization on the
training data and the cost function for MTC are equal. MTC applies any desired cost
function to the hold out dataset.
The number of factors with best generalization ability is k = 248. In the underfitting
regime, the transfer costs have low variance because the structure in the data equals in
all random validation sets. In the overfitting regime, the transfer costs vary significantly
as the noisy bits in the validation set determine how well the overfitted model matches
the data.

6 Minimum Transfer Costs for Non-factorial Models

The representation of the measurements plays an important role for optimization. In


parametric or central clustering, the cost function can be written as a sum over indepen-
dent object-wise costs Ri (s, xi , k) as shown in Eq. (2). When the measurements are
characterized by pairwise (dis)similarities, instead of explicit coordinates, then such a
function form of the costs as in Eqs (2), (3) does not exist. An example is the quadratic
cost function for correlation clustering [3]. In the following, we explain how to obtain
the transfer costs for such models.
Correlation clustering partitions a graph with positive and negative edge labels. Given
a graph G(O, X) with similarity matrix X := {Xij } ∈ {±1}( 2 ) between objects i and
N

j and a clustering solution s, the set of edges between two clusters u and v is defined as
Eu,v = {(i, j) ∈ E : s(i) = u ∧ s(j) = v}, where s(i) is the cluster index of object
i. Eu,v , v

= u are inter-cluster edges and Eu,u are intra-cluster edges. Given the noise
parameter p and the complexity parameter q, the correlation graph is generated in the
following way:
432 M. Frank, M.H. Chehreghani, and J.M. Buhmann

(a) p = 0.70 (b) p = 0.80

(c) p = 0.95

Fig. 5. Transfer costs and instability for various noises p. The complexity q is kept fixed at 0.30.

1. Construct a perfect graph, i.e. assign the weight +1 to all intra-cluster edges and
−1 to all inter-cluster edges.
2. Change the weight of each inter-cluster edge in Eu,v , v

= u to +1 with probability
q, increasing structure complexity.
3. With probability p, replace the weight of each edge (Eu,v , v
= u and Eu,u ) by a
random weight.

Let N and k be the number of objects and the number of clusters, respectively. The cost
function counts the number of disagreements, i.e. the number of negative intra-cluster
edges plus the number of positive inter-cluster edges:

1   1   
R(s, X, k) = − (Xij −1)+ (Xij +1). (8)
2 2
1≤u≤k (i,j)∈Eu,u 1≤u≤k 1≤v<u (i,j)∈Eu,v

To transfer the clustering solution s(1) to the second dataset X(2) , we use the Hamming
distances between objects i from O(2) and the clusters inferred from X(1) . The cluster
index of object i is determined by:

s(1) (i ) = arg min H(i , s(1)


v ), with (9)
1≤v≤k
The Minimum Transfer Cost Principle for Model-Order Selection 433

1  1  
H(i , s(1)
v ) =− (Xij − 1) + (Xij + 1), (10)
2 j∈s 2
v 1≤u≤k,u
=v j∈su

where sv includes the set of objects whose cluster indices are v.


For our experiments, we construct a graph with 900 nodes and 3 clusters. We fix
the structure complexity at q = 0.30 and vary the noise level p from 0.7 to 0.95. We
then divide the graph into two smaller graphs of identical cardinality N1 = N2 = 450.
For clustering, we use Gibbs sampling since, according to our experiments, it usually
achieves lower costs than approximation algorithms such as CC-Pivot [1]. We run the
sampler with a number of clusters varying from 1 to 10 each for 10 different random
initializations. We compare the transfer costs with the instability measure proposed in
[15]. The results are summarized in Figure 5. At p = 0.70 the problem is simple, which
means that the Gibbs sampler, even when initialized with a large number of clusters,
always selects the correct number of clusters on its own. The extra clusters are simply
left empty. As a consequence, the transfer costs are indifferent for a number of clusters
larger than or equal to the correct number (Figure 5(a)). At p = 0.80 the problem
is complicated but still learnable. Here, the inferred clustering and also the transfer
costs vary for different choices of the number of clusters. As illustrated in Figure 5(b)
the minimal transfer cost selects the true number of clusters. For both p = 0.70 and
p = 0.80 the instability measure is consistent with the transfer costs. At p = 0.95 the
edge labels are almost entirely random, hiding all structure in the data. Therefore, as
Figure 5(c) confirms, the number of learnable clusters is 1. In this regime, instability
cannot determine the correct number of clusters as it is not defined for k = 1.

7 Transfer Costs for k-means Clustering


In this last example, we investigate a conceptually difficult task, namely the application
of k-means to Gaussian data. A solution s of k-means is an assignment vector c ∈
{1, .., k}N and k centroids μt : t ∈ {1, .., k}. Thereby, c(i) = t means that object i is
assigned to cluster t. The model order is the number of centroids k. The cost function of
k-means
is the sum of distances between each object and its centroid, i.e. R(s, X, k) =
i d (μc (i), xi ). The distance function d depends on the data type (Hamming, squared
Euclidean ...). As k-means provides a disjoint partitioning of the objects into the k
clusters, one can rewrite the transfer cost formula:

1 
N2 N1  
(1) (2)
RT(s(1) , X(2) , k) = d μc(i) , xi I{ψ(i ,X(1) ,X(2) )=i}
N2  i=1
i =1
1   (1) (2) 
≈ d μt , xi I{ψs (i ,s(1) ,X(2) )=t} , (11)
N2  t
i

whereas ψ is the nearest-neighbor mapping between objects and ψ s is the mapping of


objects to the nearest centroid as defined in Eq. (4). We use the fact that the centroids
(2)
represent the objects which are assigned to them. Therefore, the centroid closest to xi
(2)
is on average approximately as far away as the centroid of the nearest neighbor of xi
in O(1) . For high N1 and N2 this approximation becomes very precise.
434 M. Frank, M.H. Chehreghani, and J.M. Buhmann

Fig. 6. Costs and transfer costs (computed with mappings: nearest-neighbor, generative, soft) for
k-means clustering of three Gaussians. Solid lines indicate the median and dashed lines are the
25% and 75% percentiles. The right panel shows the clustering result selected by soft mapping
MTC. Top: equidistant centers and equal variance. Middle: heterogeneous distances between
centers (hierarchical). Bottom: heterogeneous distances and variances.

The setup of the experiment is as follows: We sample 200 objects from three bi-
variate Gaussian distributions (see for instance Figure 6 top right). The task is to find
the appropriate number of clusters. By altering the variances and the pairwise distances
of the centers, we control the difficulty of this problem and especially tune it such
that selecting the number of clusters is hard. We investigate the selection of k by the
The Minimum Transfer Cost Principle for Model-Order Selection 435

nearest-neighbor mapping of the objects from the second dataset to the centroids μ(1)
as well as by the generative mapping where the two data subsets are aligned by con-
struction. We report the statistics over 20 random repetitions of generating the data.
Our findings for three different problem difficulties are illustrated in Figure 6. As
expected, the costs on the training dataset monotonically decrease with k. When the
mapping is given by the generation process of the data (generative mapping), MTC
provides the true number of clusters in all cases. However, recall that the generative
mapping requires knowledge of the true model parameters and leaks information about
the true number of clusters to the costs. Interestingly, MTC with a nearest-neighbor
mapping follows almost exactly the same trend as the original costs on the first dataset
and therefore proposes selecting the highest model-order that we offer to MTC. The
higher the number of clusters is, the closer are the centroids of the nearest neighbors
of each object. This reduces the transfer costs of high k. The only difference between
original costs and transfer costs stems from the average distance between nearest neigh-
bors (the data granularity). Only when the pairwise centroid distances become smaller
than this distance, the transfer costs increase again. Ultimately, the favored solution is
a vector quantization at the level of the data granularity. This is the natural behavior of
k-means, as its cost function has no variances. As we have seen in the first experiments
with Gaussian mixture models, fitting Gaussian data with MTC imposes no particular
difficulties when the appropriate model (here GMM) is used. The k-means behavior is
due to a model mismatch.

Probabilistic Mapping: A variant of MTC can be used to still make k-means applica-
ble to estimating the true model order of Gaussian data. As follows, we extend the
notion of a strict mapping to a probabilistic mapping between objects. Let pi i :=
p(ψ(i , X(1) , X(2) ) = i) be the probability that ψ maps object i from the second
dataset to object i of the first dataset. We define pi i as
    
(1) (2) (1) (2)
pi i := Z −1 exp −β d(xi , xi ) , Z = exp −β d(xi , xi ) (12)
i

This mapping distribution is parameterized by the computational temperature β −1 and


(1) (2)
depends on the problem-specific dissimilarity function d(xi , xi ). A probabilistic
mapping is more general than the deterministic function ψ. When β has a finite value,
then objects are mapped to more than one other object. In the case of β → ∞, it reduces
to a deterministic nearest-neighbor mapping between O(2) and O(1) . When β = 0 then
object i ∈ O(2) is mapped to all N1 objects in O(1) with equal probability, thereby
maximizing the entropy of pi i .
Using this probabilistic mapping, we define the transfer costs RT (s(1) , X(2) , k) of a
factorial model with model-order k as follows:
1 
N2 
N1
(2)
RT (s(1) , X(2) , k) = pi i Ri (s(1) (i), xi , k). (13)
N2  i=1
i =1

For k-means, taking the object to centroid approximation, this becomes


1 
N2 
k   (1) (2)
e−βd(µt ,xi )
(1) (2) (1) (2)
R (s
T
,X , k) ≈ d μt , xi (1) (2)
(14)
N2  t=1 e−βd(µt ,xi )
k
i =1  t =1
436 M. Frank, M.H. Chehreghani, and J.M. Buhmann

We fix the inverse temperature by the costs of the data with respect to a single cluster:
β = 0.75 ∗ R(s(1) , X(1) , 1)−1 . This choice defines the dynamic range of the model-
order selection problem. When fixing β roughly at the costs of one cluster, the resolution
of individual pairwise distances resembles the visual situation where one looks at the
entire data cloud as a whole.

Results of probabilistic mapping MTC: The probabilistic mapping finds the true num-
ber of clusters when the variances of the Gaussians are roughly the same, even for a
substantial overlap of the Gaussians (Figure 6, top row). Please note that although the
differences of the transfer costs are within the plotted percentiles, the rank-order of the
number of clusters in each single experiment is preserved over the 20 repetitions, i.e.
the variance mainly results from the data and not from the selection of k.
When the problem scale varies on a local level, fixing the temperature at the k = 1
solution does not resolve the dynamic range of the costs. We illustrate this by two hard
problems: The middle problem in Figure 6 has a hierarchical structure, i.e. the pair-
wise distances between centers vary a lot. In the bottom problem in Figure 6, both the
distances and the individual variances of the Gaussians vary. In both cases the number
of clusters is estimated too low. When inspecting the middle plot, this choice seems
reasonable, whereas in the bottom plot clearly three clusters would be desirable. The
introduction of a computational temperature simulates the role of the variances in Gaus-
sian mixture models. However, as the temperature is the same for all clusters, it fails to
mimic situations where the variances of the Gaussians substantially differ. A Gaussian
mixture model would be more appropriate than modeling Gaussian data with k-means.

8 Related Work
In this section we point to related work on model selection for unsupervised learning.
Models that assume an explicit parametric form, are often controlled by a model com-
plexity penalty (a regularizer). Akaike information criterion (AIC) [2] and Bayesian
information criterion (BIC) [21] both trade off the goodness of fit measured in terms of
a likelihood function against the number of model parameters used. In [18], the model
evidence for probabilistic PCA is maximized with respect to the number of components.
Introducing approximations, this score equals BIC. In [12] the number of principal com-
ponents is selected by integrating over the sensitivity of the likelihood to the model pa-
rameters. Minimum description length (MDL) [20] selects the lowest model order that
can explain the data. It essentially minimizes the negative log posterior of the model
and is thus formally identical to BIC [13]. It is unclear how to generalize model-based
critera like [2,21,18,12] to non-probabilistic methods such as, for instance, correlation
clustering, being specified by a cost function instead of a likelihood.
For selecting the rank of truncated SVD, probably the most related approach is the
cross-validation method proposed in [19]. It is a generalization of the method in [11]
and was also applied to NMF. We explain it and compare with it in Section 4.2. A
method with single hold-out entries (i, j) is proposed in [7]. It trains a SVD on the
input matrix without row i and another one without column j. Then it combines U
from one SVD and V from the other and averages their singular values to obtain an
SVD which is independent of (i, j). The method in [7] has been reviewed in [19].
The Minimum Transfer Cost Principle for Model-Order Selection 437

In [17], the authors abandon cross-validation for Boolean matrix factorization. They
found that i) the method in [19] is not applicable and ii) using the rows of the second
matrix of the factorization (here U in Section 5) to explain the hold-out data, tolerates
overfitting. From our experience, cross-validation fails when only the second matrix
is fixed and the first matrix is adapted to the new data. With a predefined mapping to
transfer both matrices to the new data without adapting them, cross-validation works
for Boolean matrix factorization as demonstrated in Section 5.
Specialized to selecting the number of clusters in clustering, gap statistics have been
proposed in [23]. Stability analysis has also shown promising results [6,15]. Stability
neglects to account the informativeness of solutions. An information theoretic model
validation principle has been proposed in [4] to determine the tradeoff between stability
and informativeness based on an information theoretic criterion called approximation
capacity. So far, this principle has been applied to clustering [5] and SVD [10].

9 Conclusion

We defined the minimum transfer cost principle (MTC) and proposed several variants
of how to apply it. Our method extends the cross-validation principle to unsupervised
learning problems as it solves the problem of transferring a learned model from one
dataset to another one when no labels are given. We demonstrated how to apply the
principle to different problems such as max. likelihood inference, k-means clustering,
correlation clustering, Gaussian mixture models, and rank-limited SVD, highlighting its
broad applicability. For each problem, we explained the appropriate mapping function
between datasets and we demonstrated how the principle can be employed with respect
to the specifications of the particular tasks. In all cases, MTC makes a sensible choice of
the model order. It finds the optimal rank for image denoising with SVD and for error
correction in access-control configurations. Future work will cover the application of
our principle to other models as well as to other tasks such as feature selection.

Acknowledgements. This work was partially supported by the Zurich Information


Security Center, by the DFG-SNF research cluster FOR916, and by the FP7 EU project
SIMBAD.

References
1. Ailon, N., Charikar, M., Newman, A.: Aggregating inconsistent information: Ranking and
clustering. Journal of the ACM 55, 23:1–23:27 (2008)
2. Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Auto-
matic Control 19(6), 716–723 (1974)
3. Bansal, N., Blum, A., Chawla, S.: Correlation clustering. Machine Learning 56(1-3), 89–113
(2002)
4. Buhmann, J.M.: Information theoretic model validation for clustering. In: ISIT 2010 (2010)
5. Buhmann, J.M., Chehreghani, M.H., Frank, M., Streich, A.P.: Information theoretic model
selection for pattern analysis. In: JMLR: Workshop and Conference Proceedings, vol. 7, pp.
1–8 (2011)
438 M. Frank, M.H. Chehreghani, and J.M. Buhmann

6. Dudoit, S., Fridlyand, J.: A prediction-based resampling method for estimating the number
of clusters in a dataset. Genome biology 3(7) (2002)
7. Eastment, H.T., Krzanowski, W.J.: Cross-validatory choice of the number of components
from a principal component analysis. Technometrics 24(1), 73–77 (1982)
8. Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned
dictionaries. IEEE Transactions on Image Processing 15(12), 3736–3745 (2006)
9. Frank, M., Buhmann, J.M., Basin, D.: On the definition of role mining. In: SACMAT, pp.
35–44 (2010)
10. Frank, M., Buhmann, J.M.: Selecting the rank of truncated SVD by Maximum Approxima-
tion Capacity. In: IEEE International Symposium on Information Theory, ISIT (2011)
11. Gabriel, K.: Le biplotoutil dexploration de données multidimensionelles. Journal de la Soci-
ete Francaise de Statistique 143, 5–55 (2002)
12. Hansen, L.K., Larsen, J.: Unsupervised learning and generalization. In: IEEE Intl. Conf. on
Neural Networks, pp. 25–30 (1996)
13. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer, New York (2001)
14. Kuhlmann, M., Shohat, D., Schimpf, G.: Role mining – revealing business roles for security
administration using data mining technology. In: SACMAT 2003, p. 179 (2003)
15. Lange, T., Roth, V., Braun, M.L., Buhmann, J.M.: Stability-based validation of clustering
solutions. Neural Computation 16(6), 1299–1323 (2004)
16. Molloy, I., et al.: Mining roles with noisy data. In: SACMAT 2010, pp. 45–54 (2010)
17. Miettinen, P., Vreeken, J.: Model Order Selection for Boolean Matrix Factorization. In:
SIGKDD International Conference on Knowledge Discovery and Data Mining (2011)
18. Minka, T.P.: Automatic choice of dimensionality for PCA. In: NIPS, p. 514 (2000)
19. Owen, A.B., Perry, P.O.: Bi-cross-validation of the SVD and the nonnegative matrix factor-
ization. Annals of Applied Statistics 3(2), 564–594 (2009)
20. Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
21. Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6, 461 (1978)
22. Streich, A.P., Frank, M., Basin, D., Buhmann, J.M.: Multi-assignment clustering for Boolean
data. In: ICML 2009, pp. 969–976 (2009)
23. Tibshirani, R., Walther, G., Hastie, T.: Estimating the Number of Clusters in a Dataset via
the Gap Statistic. Journal of the Royal Statistical Society, Series B 63, 411–423 (2000)
A Geometric Approach to Find Nondominated Policies
to Imprecise Reward MDPs

Valdinei Freire da Silva and Anna Helena Reali Costa

Universidade de São Paulo, São Paulo, Brazil


valdinei.freire@gmail.com, anna.reali@poli.usp.br

Abstract. Markov Decision Processes (MDPs) provide a mathematical frame-


work for modelling decision-making of agents acting in stochastic environments,
in which transitions probabilities model the environment dynamics and a reward
function evaluates the agent’s behaviour. Lately, however, special attention has
been brought to the difficulty of modelling precisely the reward function, which
has motivated research on MDP with imprecisely specified reward. Some of these
works exploit the use of nondominated policies, which are optimal policies for
some instantiation of the imprecise reward function. An algorithm that calcu-
lates nondominated policies is πWitness, and nondominated policies are used to
take decision under the minimax regret evaluation. An interesting matter would
be defining a small subset of nondominated policies so that the minimax regret
can be calculated faster, but accurately. We modified πWitness to do so. We also
present the πHull algorithm to calculate nondominated policies adopting a geo-
metric approach. Under the assumption that reward functions are linearly defined
on a set of features, we show empirically that πHull can be faster than our modi-
fied version of πWitness.

Keywords: Imprecise Reward MDP, Minimax Regret, Preference Elicitation.

1 Introduction
Markov Decision Processes (MDPs) can be seen as a core to sequential decision prob-
lems with nondeterminism [2]. In an MDP transitions among states are seen as marko-
vian and evaluation is done through a reward function. Many decision problems can
be modelled by an MDP with imprecise knowledge. This imprecision can be stated as
partial observability regarding states [7], intervals of probability transitions [12] or a set
of potential reward functions [13].
Scenarios where reward functions are imprecise are quite common in a preference
elicitation process [4,6]. Preference elicitation algorithms guide a process of sequential
queries to a user so as to elicit his/her preference based on his/her answer. Even if the
process is guide to improve the knowledge about the user’s preference, after a finite
sequential of queries an imprecise representation must be used [3,9]. User’s preference
may be model for example in a reward function [10].

This work was conducted under project LogProb (FAPESP proc. 2008/03995-5). Valdinei
F. Silva thanks FAPESP (proc. 09/14650-1) and Anna H. R. Costa thanks CNPq (proc.
305512/2008-0).

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 439–454, 2011.

c Springer-Verlag Berlin Heidelberg 2011
440 V.F. da Silva and A.H. Reali Costa

In this paper we tackle the problem of Imprecise Reward MDPs (IRMDP). If a deci-
sion must be taken in an IRMDP, optimal action must be properly defined. The minimax
regret approach considers relatively the worst case decision, providing a balanced de-
cision. First, when evaluating a decision, it is considered an adversary that chooses the
reward function that minimises the value of the decision. The regret step compares the
actual chosen decision within the best adversarial option for each feasible reward func-
tion and the decision with less regret in the worst case is considered.
Since optimisation must go through reward functions and adversarial policies, it can-
not be solved by one linear program. Efficient solutions consider an iterative process in
which decisions are chosen using a linear programming and adversarial choices are
made through a mixed integer programming [13,10] or linear programs based on non-
dominated policies [11]. In the latter, the πWitness algorithm was used to generate non-
dominated policies. Nondominated policies are optimal policies for some instantiation
of the imprecise reward function.
Even if the set of nondominated policies is much smaller when compared to the set of
deterministic policies, the cardinality of the set to be considered is still a burden to deal
with. Although πWitness is able to calculate the set of nondominated policies, nothing
is said about choosing efficiently a small subset of nondominated policies. Using a small
subset helps to choose properly and fast the best minimax regret decision.
We propose the πHull algorithm to calculate an efficient small subset of nondom-
inated policies. In order to compare it with πWitness, we also modify πWitness to
generate a small subset of nondominated policies.
The paper is organised as follows. Section 2 introduces theory and notation used and
section 3 describes our modified version of πWitness. Section 4 presents our main con-
tribution, the πHull algorithm. Finally, experiments are given in section 5 and section 6
presents our conclusions.

2 Theoretic Background
In this section we summarise some significant theoretic background regarding MDPs
and Imprecise Reward MDPs.

2.1 Markov Decision Process


Markov Decision Process (MDP) is a common formulation regarding optimal deci-
sions. An MDP presents two main features [2]: (i) an underlying dynamic system,
and (ii) an evaluation function that is additive in time. An MDP is defined by a tuple
S, A, Pa (·), γ, β, r(·), where:
– A is a finite set of possible actions a;
– S is a finite set of process states s;
– Pa : S × S → [0, 1] models a stationary discrete-time stochastic process on state
st such that Pa (i, j) = P (st+1 = j|st = i, at = a);
– γ ∈ [0, 1) is a discount factor;
– β : S → [0, 1] is an initial probability distribution;
– r : S × A → Ê is a reward function; and
A Geometric Approach to Nondominated Policies 441

– the system dynamics is such that s0 ∈ S is drawn from distribution β(s) and if the
process is in the state i at time t and action a is chosen, then: the next state j is cho-
sen according to the transition probabilities Pa (i, j) and a payoff with expectation
r(i, a) is incurred.

A solution for an MDP consists in a policy π : S × A → [0, 1], i.e., at any time t,
π(s, a) indicates the probability of executing action at = a after observing state st = s.
A policy π is evaluated ∞ according to its expected accumulated discounted reward, i.e.,
V (π) = Es0 ∼β,at ∼π [ t=0 γ t rt ].
A policy π induces a discounted occupancy frequency f π (s, a) for each pair (s, a), or
in vector notation f π , i.e., the accumulated expected occurrences of each pair discounted
in time. Let F be the set of valid occupancy frequencies for a given MDP, then for any
f ∈ F it is valid:
[½ − γP] f = β,
where P is a |S||A|×|S| matrix indicating Pa (s, s ), ½ is a |S||A|×|S| matrix with one
in self-transitions, i.e., ½((s, a), s) = 1 for all a ∈ A, and β is a |S| vector indicating
β(s).
Consider the reward function with a vector notation, i.e., r. In this case, the value of
a policy π is given by V (π) = f π · r. An optimal occupancy frequency f ∗ can be found
by solving:
min f · r
f
subject to: [½ − γP] · f − β = 0 . (1)
f ≥0
Given an optimal occupancy frequency f ∗ the optimal policy can be promptly defined.
For any s ∈ S and a ∈ A an optimal policy π ∗ is defined by:


⎪ f (s, a) 
⎨ , if a ∈A f (s, a) > 0
π ∗ (s, a) = a ∈A f (s, a) . (2)
⎪ 1 

⎩ , if  f (s, a) = 0
a ∈A
|A|

2.2 Imprecise Reward MDP

An imprecise reward MDP (IRMDP) consists in an MDP in which the reward func-
tion is not precisely defined [13,10,11]. This can occur due to a preference elicita-
tion process, lack of knowledge regarding to evaluation of policies or reward functions
comprising the preferences of a group of people. An IRMDP is defined by a tuple
S, A, Pa (·), γ, β, R, where R is a set of feasible reward functions. We consider that
a reward function is imprecisely determined by nR strict linear constraints. Given a
|S||A| × nR matrix A and a 1 × nR matrix b, the set of feasible reward functions is
defined by R = {r|Ar ≤ b}.
In an IRMDP, it is also necessary to define how to evaluate decisions. The minimax
regret evaluation makes a trade-off between the best and the worst cases. Consider a
feasible occupancy frequency f ∈ F. One can calculate the regret Regret(f , r) of
442 V.F. da Silva and A.H. Reali Costa

taking such the occupancy frequency f relative to a reward function r as the difference
between f and the optimal occupancy frequency under r, i.e.,
Regret(f , r) = max g · r − f · r.
g∈F

Since any reward function r can be chosen from R, the maximum regret MR(f , R)
evaluates the occupancy frequency f , i.e.,
MR(f , R) = max Regret(f , r).
r∈R

Then, the best policy should minimise the maximum regret criterium:
MMR(R) = min MR(f , R).
f ∈F

In order to calculate MMR(R) efficiently, some works use nondominated policies, i.e.,
policies that are optimal for some feasible reward functions [11]. Formally, a policy f
is nondominated with respect to R iff
∃r ∈ R such that f · r ≥ f  · r, ∀f 

= f ∈ F. (3)

2.3 Reward Functions Based on Features


Reward functions are usually defined on a small set of k features, where k |S||A|.
Features represent semantic events of interest, such as obtaining or consuming a re-
source.
Consider a feature function that maps each pair (s, a) to a vector of k observed fea-
tures, i.e., φ : S × A → Êk . Then, the reward function r(s, a) is considered to be a
linear combination of such observed features:
r(s, a) = w · φ(s, a),
where w is a weight vector.
The occupancy frequency vector f can be reduced to an expected feature vector μ =
f  φ, where φ is considered to be the matrix |S||A| × k with rows denoting φ(s, a).
Note that we can talk interchangeably about a policy π, its occupancy frequency vector
f , or its expected feature vector μ. A stationary policy π is uniquely defined by an
occupancy frequency vector f , and the evaluation of a policy π depends only on its
expected feature vector μ, i.e., V (π) = w · μ.
Although it makes easy the definition of reward functions, the use of attributes still
requires the assignment of scalar values, which is not an easy task. Unless the user is
familiar with some concepts of decision theory, it is not a natural task understanding this
assignment of precise numerical values. However, defining a weight vector imprecisely
is much easier than defining a reward function in the full state-action space.

3 The πWitness Algorithm


Regan and Boutilier [11] present the The πWitness algorithm for identifying nondom-
inated policies. We describe below a modified version of it in order to choose properly
a small subset of nondominated policies.
A Geometric Approach to Nondominated Policies 443

3.1 Witness Reward Functions

Given any occupancy frequency f ∈ F, define its corresponding policy πf (see equa-
tion 2). Let f [s] be the occupancy frequency obtained by executing policy πf with deter-
mministic initial state s0 = s, i.e., the frequency occupancy of policy πf with an initial
state distribution β  (s) = 1.
Let f s:a be the occupancy frequency in the case that if s is the initial state then action
a is executed and policy πf is followed thereafter, i.e.,

 
f s:a = β(s) es:a + γ f [s ]Pa (s, s ) + f [s ]β(s )
s ∈S s 
=s

where es:a is an |S||A| vector with 1 in position (s, a) and zeros elsewhere1 .
The occupancy frequency f s:a can be used to find nondominated policies in two
steps: (i) choose arbitrarily rinit and find an optimal occupancy frequency frinit with
respect to rinit , keeping each optimal occupancy frequency in a set Γ ; (ii) for each
occupancy frequency f ∈ Γ , (iia) find s ∈ S, a ∈ A and r ∈ R such that:

f s:a · r > f  · r for all f  ∈ Γ, (4)

(iib) calculate the respective optimal occupancy frequency fr , and add it into Γ . The
algorithm stops when no reward function can be found such that equation 4 is true. The
reward function in equation 4 is a witness that there exists at least another nondominated
policy to be defined.

3.2 Efficient Small Subset of Policies

Despite of the set of nondominated policies being much smaller than the set of all deter-
ministic policies, it can still be very large, making costly the calculation of the minimax
regret. It is interesting to find a small set of policies that approximates efficiently the
set of nondominated policies. By efficient we mean that a occupancy frequency fΓ
chosen within a small subset Γ is as better as the exact minimax regret decision, i.e.,
1
We changed the original formula [11]:

s:a s:a
  
f = β(s) e +γ f [s ]Pa (s, s ) + (1 − β(s))f .
s ∈S

Note that such equation implicitly considers a new β  distribution, defined by


β(x) + β(x)(1 − β(s)) , if x = s


β  (x) = .
β(x)(1 − β(s)) , otherwise

In this case the occupancy frequency f a:s has the meaning of executing action a when starting
in state s with probability
β(s) + πf (s, a)(1 − β(s)).
444 V.F. da Silva and A.H. Reali Costa

MR(fΓ , R)− MMR(R) 0. Consider a witness rw and its respective optimal occupancy
frequency f w . The difference

Δ(f w , Γ ) = max min



[f w · r − f  · r]
r∈R f ∈Γ

can be used to define the gain when adding f w to the set Γ . If a small subset of nondom-
inated policies is desired, Δ(f w , Γ ) may indicate a priority on which policies are added
to Γ . Instead of adding to Γ every occupancy frequency f w related to nondominated
policies, it is necessary to choose carefully among witnesses f w , and to add only the
witness that maximizes Δ(f w , Γ ).

3.3 The πWitnessBound Algorithm


Table 1 summarises the πWitnessBound algorithm, our modified version of πWitness. It
chooses NΓ nondominated policies. The findBest(r) function solves an MDP with
reward function r (see equation 1). Instead of finding all feasible occupancy frequency
vectors in F , the findWitnessReward(f s:a , Γ )) tries to find a witness rw to f s:a
which guarantees equation 3 within the set Γ .

Algorithm 1. The πWitnessBound algorithm


Input: IRMDP, NΓ
r ← some arbitrary r ∈ R
f ← findBest(r)
Γ ←∅
Γagenda ← {f }
while |Γ | < NΓ do
f ← best item in Γagenda regarding to Δ(f , Γ )
add f to Γ
foreach s, a do
rw ← findWitnessReward(f s:a , Γagenda )
while witness found do
f best ← findBest(rw )
add f best to Γagenda
rw ← findWitnessReward(f s:a , Γagenda )

Output: Γ

It is worth to notice that each iteration of πWitness takes at least |S||A| calls to
findWitnessReward(·), and if it succeeds findBest(·) is also called. The num-
ber of policies in agenda can also increase fast, increasing the burden of calls to
findWitnessReward(·). In the next section we consider the hypothesis of the re-
ward function being defined with a small set of features, and we take this into account
to define a new algorithm with better run-time performance.
A Geometric Approach to Nondominated Policies 445

4 A Geometric Approach to Find Nondominated Policies

The problem of finding nondominated policies is similar to the problem of finding the
convex hull of a set of points. Here the set of points are occupancy frequency vectors.
We consider reward functions defined in terms of features, thus we can work in a space
of reduced dimensions. Our algorithm is similar to the Quickhull algorithm [1], but the
set of points is not known a priori.

4.1 Space of Feature Vector

Even with no information about the reward functions, but considering that they are de-
scribed by k features, we can analyse the corresponding IRMDP in the feature vector
space. The advantage of such analysis is that a conventional metric space can be con-
sidered. This is possible because the expected vector of features regarding to a policy
accumulates all the necessary knowledge about transitions in an MDP.
In this section we show through two theorems that if we take the set of all feasible
expected feature vectors M = {μπ |π ∈ Π} and define its convex hull M = co(M )2 ,
then the vertices V of the polytope M represents the expected feature vector of spe-
cial deterministic policies. Such special policies are the nondominated policies under
imprecise reward functions where the set R is free of constraints.

Theorem 1. Let Π be the set of stochastic policies and let M = {μπ |π ∈ Π} be the
set of all expected feature vectors defined by Π. The convex hull of M determines a
polytope M = co(M ), where co(·) stands for the convex hull operator. Let V be the set
of vertices of the polytope M, then for any vertex μ ∈ V there exists a weight vector
wµ such that:
wµ · μ > wµ · μ for any μ

= μ ∈ M.

Proof. If μ is a vertex of the polytope M, there exists a hyperplane H such that H ∩


M = {μ}. The hyperplane H divides the feature vector space in two half-spaces XH,1
and XH,2 . Let us define the set M  = M − {μ}. Because the set resulting from the
intersection of H and M has cardinality one, all feature vectors in M  are in the same
half-space, i.e., either M  ⊂ XH,1 or M  ⊂ XH,2 .
Take any vector w orthogonal to H. Since μ ∈ H, for any μ , μ ∈ M  we have
|w · μ − w · μ| > 0 and sign(w · μ − w · μ) = sign(w · μ − w · μ). Take any
μ ∈ M  and define:

w , if w · μ − w · μ < 0
wµ = .
−w , if w · μ − w · μ > 0

In this case wµ · μ > wµ · μ for any μ ∈ M  . 




Theorem 2. Let Π, M , M and V be defined as in the previous theorem. Let Γ be the set
of nondominated policies of an IRMDP where R = {r(s, a)|w ∈ [−1, 1]k and r(s, a) =
w · φ(s, a)}. Let MΓ = {μπ |π ∈ Γ }, then V = MΓ .
2
The operator co(·) stands for the convex hull operator.
446 V.F. da Silva and A.H. Reali Costa

Proof. In theorem 1 we prove that μ ∈ V ⇒ μ ∈ MΓ . Now we prove the reverse.


Consider the set D = MΓ − V, we have D ⊂ M − V and we want to show that D must
be empty.
Suppose D is not empty and consider a feature vector μ ∈ D. Suppose that exists
wµ such that:
wµ · μ − wµ · μ > 0 for any μ
= μ ∈ M. (5)
In this case wµ and μ define uniquely a hyperplane H  = {μ|wµ ·μ − wµ ·μ = 0}.
Because μ is not a vertex of M, the half-spaces XH  ,1 and XH  ,2 defined by H  are
such that ∃μ ∈ M ∩ XH  ,1 and ∃μ ∈ M ∩ XH,2 . Therefore equation 5 is not true for
any wµ and D = ∅. 


4.2 Finding Nondominated Policies


Theorem 2 shows that finding the set of nondominated policies is the same as finding
the set of vertices of the convex hull of feasible expected feature vector.
Given a set of vectors, the Quickhull algorithm finds the convex hull of such set. At
any iteration the Quickhull algorithm maintains a current polytope. For a chosen facet
of the current polytope, the Quickhull algorithm analyses vectors above the facet by cal-
culating the distance between each vector and the facet. The farthest vector is included
in the current polytope by creating new facets based on the facets that such vector can
see. The algorithms iterates until there exist vectors outside the current polytope [1].
Just as in the Quickhull algorithm, we consider an initial polytope M, and for each
facet we search for the farthest vector not in M. However, if we consider the set of all
policies Π and their respective expected feature vectors M , the number of vectors to be
analysed is too big. Instead, we find the farthest vector of a facet by choosing wisely a
weight vector w and solving an MDP with the reward function modelled by the weight
vector w. We show how to choose such weight vector in the next theorem.
Consider an initial polytope M whose vertices V are nondominated policies with
hypervolume greater than 0. Note that the number of vertices must be at least k + 1. We
and add new vertices until polytopes M
start with V = co(V) and M = co({μπ |π ∈
Π}) are the same. The idea is to test whether each facet of the polytope M is also a
facet of M. If it is not the case, look for a new vertex μ and add μ to V. We can do this
thanks to the following theorem.
Theorem 3. Let H be the smallest set of hyperplanes which constrain the polytope
M. Let H be the smallest set of hyperplanes that bound a polytope M
⊂ M. Take a

hyperplane H ∈ H. H ∈ H iff there is no policy π such that:
wH,M
· μπ > wH,M
· μH ,

where μH is any feature vector in H and wH,M is such that: wH,M


and H are or-

thogonal, and for any μ ∈ M we have wH,M · μH ≥ wH,M · μ.

Proof. Consider that there exists a policy π such that wH,M · μπ > wH,M · μH . Be-
cause of the definition of wH,M
, μ π is beyond 3
hyperplane H in the direction wH,M .

3
A vector μ is beyond a hyperplane H with respect to the direction w, if for any vector x ∈ H
it is true that w, μ > w, x.
A Geometric Approach to Nondominated Policies 447

a) b) c)
μ2 μ2 μ2
μ max
2 μ max
2 μH´ μmax
2 μH´
wH´
.
μmax
1 μmax
1 μmax
1 =μH´´
μmin
1 μmin
1 μmin
1 .
μmin μmin H´ μmin wH´´
2 2 2
μ1 μ1 μ1
H´´

Fig. 1. Constructing the set of feasible expected feature vectors in two dimensions. a) The initial
polytope (polygon) M has vertices that minimise and maximise each feature separately. b) Ex-
ample of a hyperplane (edge) of the polytope M which is not in the polytope M. c) Example of

a hyperplane (edge) of the polytope M which is in the polytope M.

Therefore H constrains the feature vector μπ and μπ


However, μπ ∈ M because
∈ H.
μπ is a feasible expected feature vector. Therefore H
∈ H.
Now suppose H
∈ H. Because M ⊂ M and constraints in H are tighter than
constraints in H, there exists a feasible expected feature vector beyond H. Therefore
there exists a policy π such that wH,M · μπ > wH,M · μH . 


so that it becomes the set V desired:


Theorem 3 suggests a rule to improve the set V
is considered and the polytope M
– Initially a set of vertices V = co(V)
is computed
(figure 1a).
– For each hyperplane H  which constrains the polytope M and the vector wH 

orthogonal to H with direction outwards regarding to the polytope M (figures 1b

and 1c), calculate an optimal policy πwH  for wH and its corresponding expected


π∗
feature vector μ wH  .
π∗ π∗ (figure 1b).
– If μ wH  is beyond H  , then μ wH  can be added to V

– Otherwise the hyperplane H constrains the set M (figure 1c).
– also constrain the set M, then M
If all the hyperplanes that constrain M = M.
– The end of the process is guaranteed since the cardinality of the set of nondomi-
nated policies is finite.

4.3 Normal Vectors and Reward Constraints


In previous section we give some directions on how to find nondominated policies when
the set of reward functions are unconstrained. In fact, we consider reward functions to
be constrained only to a description based on features. In order to find the farthest
feature vector, we considered an orthogonal vector to each facet. However, although
an orthogonal vector can be easily calculated, it may not be a feasible weight vector.
In this section, we find a potential farthest feature vector in three steps: (i) by using
the orthogonal vector wort , find the farthest potential feature vector μf ar , (ii) verify if
there exists a witness wwit to μf ar , and (iii) find a feasible expected feature vector by
maximising the witness wwit .
448 V.F. da Silva and A.H. Reali Costa

In the first step, instead of solving an MDP and finding a feasible expected feature
vector, which requires optimisation within |S| constraints, we work with potential fea-
ture vectors, i.e., vectors in the space under relaxed constraints. By doing so, we can
approximate such a solution to a linear optimisation within k constraints in the feature
space. First, in the feature space, the lower and upper bounds in each axis can be found
previously. We solve MDPs for weight vectors in every possible direction, obtaining
respectively upper and lower scalars bounds μtop i and μbottom
i for every feature i. Sec-
ond, when looking for vectors beyond a facet, only constraints applied to such a facet
should be considered. Then, given a facet constructed from expected feature vectors
μ1 , . . . , μk with corresponding weight vectors w1 , . . . , wk and the orthogonal vector
wort , the farthest feature vector μf ar is obtained from:

max wort · μf ar
µf ar
subject to: wi · μf ar < wi · μi , for i = 1, . . . , k . (6)
μbottom
i ≤ μfi ar ≤ μtop
i , for i = 1, . . . , k

The second step verifies if there exists a witness that maximises μf ar compared to
μ1 , . . . , μk . Note that μf ar may not be a feasible expected weight vector (f s,a is ac-
tually feasible in πWitnessBound). But μf ar indicates an upper limit regarding to the
distance of the farthest feasible feature vector.
The third step solves an MDP and finds a new expected feature vector to be added to
This step is the most expensive and we will explore the second step to avoid running
V.
into it unnecessarily. If a limited number of nondominated policies is required, not all
the policies will be added to V. We can save run-time if we conduct the third step only
when necessary, adopting the second step as an advice.

4.4 Initial Polytope

Previously we considered a given initial polytope with hypervolume greater than 0. In


this section we will discuss how to obtain that. Two necessary assumptions regarding
to the existence of a polytope are: (i) all features are relevant, i.e. none can be directly
described by the others and, (ii) for each feature, different expected occurrences can be
obtained with the MDP.
First, if we consider an initial set of weight vector W = {w1 } and its corresponding
set of optimal expected feature vector V = {μ1 }, the idea is to find the farthest feasible
feature vector, regarding to the MDP dynamics and the constrained weight vector as
previously done. However, there exist more than one orthogonal vector. In order to
overcome such problem, we use the following linear program:
 w
min wort,− ·
wort,− |W| . (7)
w∈W
subject to: wort,− · (μi − μj ) = 0, ∀μi

= μj ∈ V

Note that wort,− must be orthogonal to any ridge formed by feature vectors in V, but at
the same time it tries to be opposite to the weight vectors already maximised. A version
wort,+ in the average direction of W is also obtained.
A Geometric Approach to Nondominated Policies 449

We use two directions when looking for the farthest feature vectors because it is
not possible to know which direction a ridge faces. By using equation 6 we look for
the farthest feature vectors μf ar,+ and μf ar,− . Witnesses w+ and w− and optimal
expected feature vectors μ+ and μ− are found for both of them. Then, the farthest
Here, the distance is measured in the directions
feature vector of both are added to V.
+ −
w and w relatively to the set V.
This process goes on until |M| = k + 1, when it is possible to construct a polytope.
Table 2 presents the the initHull algorithm.

Algorithm 2. The initHull algorithm


Input: IRMDP, k, φ(·)
choose w such that r(s, a) = w · φ(s, a) ∈ R
make W = {w}
find the best expected feature vector μ to w
make V = {μ}
while |W| < k + 1 do
calculate wort,+ and wort,− (equation 7)
calculate μf ar,+ and μf ar,−
calculate witnesses w+ and w− (equation 3)
calculate the best policies μ+ and μ− (equation 1)
choose between μ+ and μ− the farthest feature vector μ
get the respective weight vector w
update W ←W ∪ {w}
update V ←V ∪ {μ}
Output: W, V

4.5 The πHull Algorithm


Finally, before presenting the πHull algorithm, we will discuss some performance is-
sues. The analysis of each facet of a polytope consists of three steps: (i) find the out-
ward orthogonal direction and the farthest potential feature vector, (ii) find a witness
and (iii) find an optimal policy. The first step is done with a linear program with k con-
straints and upper and lower bounds, whereas the third step consists in a linear program
with |S| constraints. Differently from the πWitness algorithm where witnesses must be
compared against all policies in Γagenda , the second step should only look at policies
in the facet.
Although there is a gain here when using findWitnessReward(·), the number
of facets in the polytope M increases exponentially. If a few nondominated policies are
required, this gain may overcome the exponential increase. Therefore, it is interesting
to provide a solution which does not increase exponentially, even if it leads to some loss
in the quality of the subset of nondominated policies.
The algorithm πHull keeps three sets of facets: Hnull , Hwit and Hbest . Hnull keeps
facets which have not gone through any step of analyses. Hwit keeps facets which have
gone through the first step and they are ordered by the distance between the facet and
the potential feature vector. Finally, Hbest keeps facets which have passed the three
steps and they are ordered by the distance to the farthest expected feature vector.
450 V.F. da Silva and A.H. Reali Costa

In each iteration the πHull algorithm processes randomly Nnull facets from set
Hnull , then it processes in the given order Nwit facets from set Hwit . Finally, it chooses
from Hbest the best facet, i.e., the one with the farthest corresponding expected feature
Table 3 summarises the πHull algorithm.
vector and adds it to V.

Algorithm 3. The πHull algorithm


Input: IRMDP, NΓ , Nnull , Nwit , k, φ(·)
V
W, ← initHull(IRMDP, k, φ(·))
Hnull = co(V)
Hwit = ∅
Hbest = ∅
< NΓ do
while |V|
for Nnull facets H in Hnull do
Hnull ← Hnull /{H}
wort ←vector orthogonal to H
μf ar ←farthest potential feature vector
α ←distance between μf ar and H
associate α, μf ar and wort with H
Hwit ← Hwit ∪ {H}
for first Nwit facets H in Hwit do
Hwit ← Hwit /{H}
wwit ← findWitnessReward(μf ar , H)
μbest ← findBest(wwit )
α ←distance between μbest and H
associate α, μbest and wwit with H
Hbest ← Hbest ∪ {H}
μnew , wwit ←first facet in Hbest
←V
V ∪ {μnew }
←W
W ∪ {wwit }

Hnull = co(V)

Output: V

4.6 Complexity Analysis


We argued that a description of reward functions with features is much more compact
than a description of reward functions in the full state-action space. In the πHull algo-
rithm we take advantage of such description to introduce a new algorithm to calculate
nondominated policies. On the other hand, our algorithm depends on the number of
facets of a convex hull.
The number of facets of a convex hull is known to grow exponentially with the space
dimension k, but only linear with the number of vertices |V| [5]. In the πHull algorithm,
the number of vertices grows every iteration, which means that a small time is spent in
first iterations. However, for k = 5 the linear factor is around 32, while for k = 10 the
linear factor is around 106 .
It is clear that our adoption of Hnull and Hwit is necessary to calculate nondominated
policies even with k = 10. On the other hand, the distance of the farthest vector of
A Geometric Approach to Nondominated Policies 451

a facet will be smaller and smaller as new facets are added to V. Then, in the first
iterations, when the number of facets is small, the farthest vector of all facets will be
calculated and kept in Hbest . As the number of iterations grows and the farthest vector
cannot be promptly calculated for all facets, the facets saved in Hbest can be good
if not the best ones.
candidates to be added in V,
Another interesting point in favour of the πHull algorithm is the relationship with
MDP solvers. While πWitnessBound relies on small changes in known nondominated
policies, the πHull algorithm relies on the expected feature vector of known nondomi-
nated policies. For instance, if a continuous state and continuous action IRMDP is used,
how to iterate over states and actions? What would be the occupancy frequency in this
case?
The πHull algorithm allows any MDP solver to be used. For instance, if the MDP
solver finds approximated optimal policies or expected feature vector, the πHull algo-
rithm would not be affected in the first iterations, where the distance of farthest feature
vectors is big.

5 Experiments
We performed experiments on synthetic IRMDPs. Apart from the number of features
and the number of constraints, all IRMDPs are randomly drawn from the same distri-
bution. We have |S| = 50, |A| = 5, γ = 0.9 and β is a uniform distribution. Every
state can transit only to 3 other states drawn randomly and such transition function is
 drawn randomly. φ(·) is defined in such a way that for any feature i, we have
also
s,a φi (s, a) = 10.
We constructed two groups of 50 IRMDPs. A group with R defined on 5 features
and 3 linear constraints, and another group with R defined on 10 features and 5 linear
constraints.
The first experiment compares πWitnessBound, πHull without time limits (Nnull =
∞ and Nwit = ∞), and πHull with time limits (Nnull = 50 and Nwit = 10). This
experiment was run on the first group of IRMDPs, where k = 5. Figure 2 shows the
results comparing the run-time spent in each iteration and the error regarding to the
recommended decision and its maximum regret, i.e., in each interaction f ∗ is chosen
to be the occupancy frequency that minimises the maximum regret regarding to the
current set of nondominated policies and effectiveness is measured by the error =
MR(f ∗ , R) − MMR(R)4 .
The second experiment compares πWitnessBound and πHull with time limits
(Nnull = 100 and Nwit = 20). This experiment was run on the second group of
IRMDPs where k = 10. Figure 3 shows the results.
In both groups the effectiveness of the subset of nondominated policies was sim-
ilar for the πWitnessBound algorithm and the πHull algorithm. In both cases, a few
policies are enough to make decision under minimax regret criterium. The time spent
per iteration in πHull with time limited is at least four times lesser when compared to

4
We estimate MMR(R) considering the union of policies found at the end of the experiment
within all of the algorithms: πWitness and πHull.
452 V.F. da Silva and A.H. Reali Costa

4.0 4.0

error piHull (no limit) time


3.5 3.5
piHull (with limits)
piWitnessBound
3.0 3.0

2.5 2.5
error in MMR

time (s)
2.0 2.0

1.5 1.5

1.0 1.0

0.5 0.5

0.0 0.0
0 5 10 15 20 25 30 35 40 45 50
Number of Nondominated Policies

Fig. 2. Results of experiment within 5 features

6 42

5 35

piHull (with limits)


4 28
piWitnessBound
error in MMR

time (s)
3 21
time

2 14

1 7
error
time
0 0
0 5 10 15 20 25 30
Number of Nondominated Policies

Fig. 3. Results of experiment within 10 features

πWitnessBound. In the case that πHull is used without time limit, the time spent per
iteration increases with iterations, but in the early iterations it is still smaller than that
found by πWitnessBound.
Our experiments were made with small MDPs (50 states and 5 actions). The run-
time of each algorithm depends on the technique used to solve an MDP. However, as
πWitnessBound does not take advantage of the reward description based on features,
it depends also on how big the sets of state and actions are. On the other hand, πHull
depends only the number of features used. In conclusion, the higher the cardinality of
the sets of states and actions, the greater the advantage of πHull over πWitnessBound.
Although πWitnessBound is not clearly dependent on the number of features, we
can see in the second experiment a run-time four times that of first experiment. When
the number of features grows, the number of witnesses also increases, which requires a
larger number of MDPs to be solved.
A Geometric Approach to Nondominated Policies 453

6 Conclusion

We presented two algorithms: πWitnessBound and πHull. The first is a slight modifi-
cation to πWitness, while the second is a completely new algorithm. Both are effective
when they define a small subset of nondominated policies to be used for calculating
the minimax regret criterium. πHull shows a better run-time performance in our experi-
ments, mainly due to the large difference in the number of features (k = 5 and k = 10)
and number of states (S = 50), since πWitnessBound depends on the second.
Although πHull shows a better run-time performance and similar effectiveness, we
have not presented a formal proof that πHull always has a better performance. Fu-
ture works should seek for three formal results related to the πHull algorithm. First, to
prove that our analysis of facet reaches the farthest feature vector when the constraints
are considered. Second, to establish a formal relation between the number of nondomi-
nated policies and the error of calculating MMR(R). Third, to set the speed with which
πHull reaches a good approximation of the set V, which is very important given the
exponential growth in the number of facets. We must also examine how the parameters
Nnull and Nwit affect this calculation.
Besides the effectiveness and better run-time performance of πHull confronted with
πWitnessBound, there is also qualitative characteristics. Clearly the πHull algorithm
cannot be used if a feature description is not at hand. However, a reward function is
hardly defined on the state-action space. πHull would present problem if the number of
features to be used is too big.
The best advantage of πHull is regarding to the MDP solver to be used. In real
problems an MDP solver would take advantage of the problems structure, like fac-
tored MDPs [14], or would approximate solutions in order to make it feasible [8].
πWitnessBound must be adapted somehow to work with such solvers.
It is worth to notice that nondominated policies can be a good indicator for a pref-
erence elicitation process. They give us a hint about policies to be confronted. For in-
stance, the small set of nondominated policies can be used when enumerative analysis
must be done.

References
1. Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. ACM
Trans. Math. Softw. 22, 469–483 (1996), http://doi.acm.org/10.1145/235815.
235821
2. Bertsekas, D.P.: Dynamic Programming - Deterministic and Stochastic Models. Prentice-
Hall, Englewood Cliffs (1987)
3. Boutilier, C., Patrascu, R., Poupart, P., Schuurmans, D.: Constraint-based optimization and
utility elicitation using the minimax decision criterion. Artificial Intelligence 170(8), 686–
713 (2006)
4. Braziunas, D., Boutilier, C.: Elicitation of factored utilities. AI Magazine 29(4), 79–92 (2008)
5. Buchta, C., Muller, J., Tichy, R.F.: Stochastical approximation of convex bodies. Mathema-
tische Annalen 271, 225–235 (1985), http://dx.doi.org/10.1007/BF01455988,
doi:10.1007/BF01455988
454 V.F. da Silva and A.H. Reali Costa

6. Chajewska, U., Koller, D., Parr, R.: Making rational decisions using adaptive utility elic-
itation. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence
and Twelfth Conference on Innovative Applications of Artificial Intelligence, pp. 363–369.
AAAI Press / The MIT Press, Austin, Texas (2000)
7. Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable
stochastic domains. Artificial Intelligence 101(1-2), 99–134 (1998)
8. Munos, R., Moore, A.: Variable resolution discretization in optimal control. Machine Learn-
ing 49(2/3), 291–323 (2002)
9. Patrascu, R., Boutilier, C., Das, R., Kephart, J.O., Tesauro, G., Walsh, W.E.: New approaches
to optimization and utility elicitation in autonomic computing. In: Proceedings, The Twen-
tieth National Conference on Artificial Intelligence and the Seventeenth Innovative Appli-
cations of Artificial Intelligence Conference, pp. 140–145. AAAI Press / The MIT Press,
Pittsburgh, Pennsylvania, USA (2005)
10. Regan, K., Boutilier, C.: Regret-based reward elicitation for markov decision processes. In:
UAI 2009: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelli-
gence, pp. 444–451. AUAI Press, Arlington (2009)
11. Regan, K., Boutilier, C.: Robust policy computation in reward-uncertain mdps using non-
dominated policies. In: Fox, M., Poole, D. (eds.) AAAI, AAAI Press, Menlo Park (2010)
12. White III, C.C., Eldeib, H.K.: Markov decision processes with imprecise transition probabil-
ities. Operations Research 42(4), 739–749 (1994)
13. Xu, H., Mannor, S.: Parametric regret in uncertain markov decision processes. In: 48th IEEE
Conference on Decision and Control, CDC 2009 (2009)
14. Guestrin, C., Koller, D., Parr, R., Venkataraman, S.: Efficient solution algorithms for factored
MDPs. Journal of Artificial Intelligence Research 19, 399–468 (2003)
Label Noise-Tolerant Hidden Markov Models
for Segmentation: Application to ECGs

Benoı̂t Frénay, Gaël de Lannoy , and Michel Verleysen

Machine Learning Group, ICTEAM Institute, Université catholique de Louvain


3 place du Levant, B-1348 Louvain-la-Neuve, Belgium

Abstract. The performance of traditional classification models can ad-


versely be impacted by the presence of label noise in training observa-
tions. The pioneer work of Lawrence and Schölkopf tackled this issue
in datasets with independent observations by incorporating a statistical
noise model within the inference algorithm. In this paper, the specific
case of label noise in non-independent observations is rather considered.
For this purpose, a label noise-tolerant expectation-maximisation algo-
rithm is proposed in the frame of hidden Markov models. Experiments
are carried on both healthy and pathological electrocardiogram signals
with distinct types of additional artificial label noise. Results show that
the proposed label noise-tolerant inference algorithm can improve the
segmentation performances in the presence of label noise.

Keywords: label noise, hidden Markov models, expectation maximisa-


tion algorithm, segmentation, electrocardiograms.

1 Introduction

In standard situations, supervised machine learning algorithms learn their pa-


rameters to fit previously labelled data, called training observations, as best as
possible. In real situations, however, it is difficult to guarantee perfect labelling,
e.g. because of the subjectivity of the labelling task, of the lack of information or
of communication noise. In particular, label errors are likely to arise in biomed-
ical applications involving the tedious and time-consuming labelling of a large
amount of data by one or several medical experts. The label noise issue is typi-
cally addressed in regression problems by assuming independent Gaussian noise
on the regression target. In classification problems, although standard algorithms
such as support vector machines are able to cope with outliers and feature noise
to some degree, the label noise issue is however mostly left untreated.
Previous work addressing the label noise issue incorporated a noise model
into a generative model which assumes independent and identically distributed
(i.i.d.) observations [1–3]. Nevertheless, this issue is mostly left untreated in the
case of models for the segmentation of sequential (non i.i.d.) observations such
as hidden Markov models (HMMs). In this work, a variant of HMMs which is

Gaël de Lannoy is funded by a Belgian F.R.I.A. grant.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 455–470, 2011.

c Springer-Verlag Berlin Heidelberg 2011
456 B. Frénay, G. de Lannoy, and M. Verleysen

robust to the label noise is proposed. To illustrate the relevance of the proposed
model, artificial electrocardiogram (ECG) signals generated using ECGSYN [4]
and real ECG recordings from the Physiobank database [5] are used in the
experiments. The label noise issue is indeed known affect the segmentation of
waveform boundaries by experts in ECG signals [6]. Nevertheless, the proposed
model also applies to any kind of sequential data facing the label noise issue, for
example biomedical signals such as EEGs, EMGs and many others.
This paper is organised as follows. Section 2 reviews related work. Section 3 in-
troduces hidden Markov models and two standard inference algorithms. Section
4 derives a new, label noise-tolerant algorithm. Section 5 quickly reviews elec-
trocardiogram signals and details the experimental settings. Finally, empirical
results are presented in Section 6 and conclusions are drawn in Section 7.

2 Related Work
Before presenting the state-of-the-art in classification with label noise, it is first
important to distinguish the label noise issue from the semi-supervised paradigm
where some data points in the training set are completely left unlabelled. Here,
we rather consider the framework where an unknown proportion of the observa-
tions are wrongly labelled. To our knowledge, existing approaches to this problem
are relatively few. These approaches can be divided in three categories: filtering
approaches, model-based approaches and plausibilistic approaches.
Filtering techniques act as a preprocessing of the training set to either remove
noisy observations or correct their labels. These methods involve the use of a
criterion to detect mislabelled observations. For example, [7] uses disagreement
in ensemble methods. Furthermore, [8] introduces an algorithm to iteratively
modify the examples whose class label disagrees with the class labels of most of
their neighbours. Eventually, [9] uses information gain to detect noisy labels.
On the other hand, model-based approaches tackle the label noise by incor-
porating the mislabelling process as an integral part of the probabilistic model.
Pioneer work by [1] incorporated a probabilistic noise model in a kernel-Fisher
discriminant for binary classification. Later, [2] extended this model by relaxing
the Gaussian distribution assumption and carried out extensive experiments on
more complex datasets, which convincingly demonstrated the value of explicit
label noise modeling. More recently the same model has been extended to multi-
class datasets [10]. Bouveyron et al. also proposes a distinct robust mixture dis-
criminant analysis [3], which consists in two steps: (i) learning an unsupervised
Gaussian mixture model and (ii) computing the probability that each cluster
belongs to a given class.
Eventually, plausibilistic approaches assume that the experts have explicitly
provided uncertainties over labels. Specific algorithms are then developed to
integrate and to focus on such uncertainties [11].
This work concentrates on model-based approaches to embed the noise pro-
cess into classifiers. Model-based approaches have a sound theoretical foundation
and tackle the noise issue in a more principled and transparent manner without
Label Noise-Tolerant Hidden Markov Models 457

discarding potentially useful observations. Our contribution in this field is the


development of a label noise-tolerant hidden Markov model for labelling of se-
quential (non i.i.d.) observations.

3 Hidden Markov Models for Segmentation


This section introduces hidden Markov models for segmentation. Two widely
used inference algorithms are detailed: supervised learning and the Baum-Welch
algorithm. Their application to ECG segmentation is discussed in Section 5.

3.1 Hidden Markov Models


HMMs are probabilistic models of time series generating processes where two dis-
tinct sequences are considered: the states S1 . . . ST and observations O1 . . . OT .
Here T is the length of these sequences (see Fig. 1). At a given time step t, the
current observation Ot and the next state St+1 are considered to depend only on
the current state St . For example, in the case of ECGs, the process under study
is the human heart. Hence, states and observations correspond to the inner state
and electrical activity of the heart, respectively (see Section 5 for more details).

Fig. 1. Conditional dependencies in an hidden Markov model

Using the independence assumptions introduced above, an HMM is completely


specified by its set of parameters Θ = (q, a, b) where qi is the prior of state i,
aij is the transition probability from state i to state j and bi is the observation
distributions for state i [12]. Usually, bi is modelled by a Gaussian mixture
model (GMM) with parameters (πik , μik , Σik ) where πik , μik and Σik are the
prior, mean and covariance matrix of the kth Gaussian component, respectively.
Given a sequence of observations with expert annotations, the HMM inference
problem consists in learning the parameters Θ from data. The remaining of this
section presents two approaches for estimating the parameters. Once that an
HMM is inferred, the segmentation of new signals can be done using the Viterbi
algorithm, which looks for the most probable state sequence [12].

3.2 Algorithms for Inference


A simple solution to infer an HMM from data consists in assuming that the
expert annotations are correct and trustworthy. Given this assumption, q and
a are simply obtained by counting the state occurrences and transitions in the
458 B. Frénay, G. de Lannoy, and M. Verleysen

data. Then, each observation distribution is fitted using the observations labelled
accordingly. This approach has the advantage of being simple to implement and
having a very low computational cost. However, if the labels are not perfect and
polluted by some label noise, the produced HMM may be significantly altered.
The Baum-Welch algorithm is another, unsupervised algorithm [12]. More
precisely, it assumes that the true labels are unknown, i.e. it ignores the ex-
pert annotations. The likelihood of the observations is maximised using an
expectation-maximisation (EM) scheme [13], since no closed-form maximum
likelihood estimator is available in this case. During the E step, the posteri-
ors P (St = i|O1 . . . OT ) and P (St−1 = i, St = j|O1 . . . OT ) are estimated for
each time step t and states i and j. Then, these posteriors are used during the
M step in order to estimate the prior vector q, the transition matrix a and the
observation distributions bi .
The main advantage of Baum-Welch is that wrong expert annotations should
have no impact on the inferred HMM. However, in practice, expert annotations
are used to compute a initial estimate of the HMM parameters, which is nec-
essary for the first E step. Moreover, ignoring expert annotations can also be a
disadvantage: if the expert uses a specific decomposition of the ECG dynamic,
such subtleties may be lost in the unsupervised learning process.

4 A Label Noise-Tolerant Algorithm

Two algorithms for HMM inference have been introduced in Section 3. However,
neither of them is satisfying when label noise is introduced. On the one hand,
supervised learning is bound to trust blindly the expert annotations. Therefore,
as shown in Section 6, label noise can degrade the segmentation quality. On
the other hand, the Baum-Welch algorithm fails to encode precisely the expert
knowledge. Indeed, as shown experimentally in Section 6, even predictions on
clean, easy-to-segment signals do not match accurately the expert annotations.
This section introduces a new algorithm for HMM inference which lies in-
between supervised learning and the Baum-Welch algorithm: expert annotations
are used, but the label noise is modelled during the inference process in order to
decrease the influence of wrong annotations.

4.1 Label Noise Modelling


Previous works showed the value of explicit label noise modelling for i.i.d. data
in classification [1–3]. Here, a similar approach is used for non-independent se-
quential data. Two distinct, yet related sequences of states are considered (see
Fig. 2): the sequence of observed, noisy annotations Y and the sequence of hid-
den, true labels S. In this paper, Yt is assumed to depend only on St , i.e. Yt is
a (possibly noisy) copy of St .
An additional quantity dij = P (Yt = j|St = i, Θ) is introduced for each
pair of states (i, j), which is called the annotation probability. In order to avoid
overfitting, the annotations probabilities take in this paper the restricted form
Label Noise-Tolerant Hidden Markov Models 459

Fig. 2. Conditional dependencies in a label noise-tolerant hidden Markov model


1 − pi (i = j)
dij = (1)
pi
|S|−1 (i = j)
where pi is the probability that the expert makes an error in state i and |S| is
the number of possible states. Hence dii = 1 − pi is the probability of correct
annotation in state i. Notice that dij is only used during inference. Here, Y is an
extra layer put on a standard HMM to model the label noise. For segmentation,
only the parameters linked to S and O are used, i.e. q, a, π, μ and Σ.

4.2 Finding the HMM Parameters with a Label Noise Model


Finding an estimate for both the HMM and label noise model parameters is
achieved by maximising the incomplete log-likelihood

log P (O, Y |Θ) = log P (O, Y, S|Θ), (2)
S

where the sum spans all possible sequences of true states. As a closed-form
solution does not exist, one can use the EM algorithm which is derived in the
rest of this section. Notice that only approximate solutions are obtained, for EM
algorithms are iterative procedures and may converge to local minima [13].

Definition of the Q(Θ, Θ old ) Function. The EM algorithm builds successive


approximations of the incomplete log-likelihood in (2) and use them to main-
tain an estimate of the parameters [12, 14]. In the settings introduced above, it
consists in alternatively (i) estimating the functional

Q(Θ, Θold ) = P (S|O, Y, Θold ) log P (O, Y, S|Θ) (3)
S

using the current estimate Θold (E step) and (ii) maximising Q(Θ, Θold ) with
respect to the parameters Θ in order to update their estimate (M step). Since


T 
T 
T
P (O, Y, S|Θ) = qs1 ast−1 st bst (ot ) dst yt , (4)
t=2 t=1 t=1
460 B. Frénay, G. de Lannoy, and M. Verleysen

where ot , yt , s1 , st−1 and st are the actual values taken by the random variables
Ot , Yt , S1 , St−1 and St , the expression of Q(Θ, Θold ) becomes
|S| |S| |S|
 
T  
γ1 (i) log qi + t (i, j) log aij
i=1 t=2 i=1 j=1
|S| |S|

T  
T 
+ γt (i) log bi (ot ) + γt (i) log diyt (5)
t=1 i=1 t=1 i=1

where the posterior probabilities γ and  are defined as


γt (i) = P (St = i|O, Y, Θold ) (6)
and
t (i, j) = P (St−1 = i, St = j|O, Y, Θold ). (7)

E Step. The γ and  variables must be computed in order to evaluate (5), which
is necessary for the M step. In standard HMMs, these quantities are estimated
during the E step by the forward-backward algorithm [12, 14]. Indeed, if forward
variables α, backward variables β and scaling coefficients c are defined as
αt (i) = P (St = i|O1...t , Y1...t , Θold ) (8)
P (Ot+1...T , Yt+1...T |St = i, Θold )
βt (i) = (9)
P (Ot+1...T , Yt+1...T |O1...t , Y1...t , Θold )
ct = P (Ot , Yt |O1...t−1 , Y1...t−1 , Θold ), (10)
one eventually obtains
γt (i) = αt (i)βt (i) (11)
and
t (i, j) = αt−1 (i)c−1
t aij bj (ot )djyt βt (j). (12)
Here, the scaling coefficients ct are introduced in order to avoid numerical issues.
Indeed, for sufficiently large T (i.e. 10 or more), the dynamic range of both α
and β will exceed the precision range of any machine. The scaling factors ct
are therefore introduced to keep the values within reasonable bounds [12]. The

incomplete likelihood can be computed using P (O, Y |Θold ) = Tt=1 ct .
The forward-backward algorithm consists in using the recursive relationship

qi bi (o1 )diy1 (t = 1)
αt (i)ct = |S| (13)
bi (ot )diyt j=1 aji αt−1 (j) (t > 1).
linking the α and c variables and the recursive relationship

1 (t = T )
βt (i) = 1
|S| (14)
ct+1 j=1 aij bj (ot+1 )djyt+1 βt+1 (j) (t < T )

linking the β and c variables. The scaling coefficients can be computed using the
|S|
constraint i=1 αt (i) = 1 jointly with (13).
Label Noise-Tolerant Hidden Markov Models 461

M Step. The values of the γ and  computed during the E step can be used to
maximise Q(Θ, Θold ). Using (5), one obtains
γ1 (i)
qi = |S| (15)
i=1 γ1 (i)

and T
t=2 t (i, j)
aij = T |S| (16)
t=2 j=1 t (i, j)

for the state prior and transition probabilities. The GMMs parameters become
T
γt (i, l)
πil = t=1
T
, (17)
t=1 γt (i)
T
t=1 γt (i, l)ot
μil = T (18)
t=1 γt (i)
and T
t=1 γt (i, l)(ot − μil )T (ot − μil )
Σil = T (19)
t=1 γt (i)
where
πil bil (ot )
γil (t) = γi (t) . (20)
bi (ot )
Eventually, the expert error probabilities are obtained using

=i γt (i)
t|Y 
pi = T t (21)
t=1 γt (i)

and the annotations probabilities can be computed using (1).

The EM Algorithm. The EM algorithm can be implemented using the equa-


tions detailed above. Θ must be initialised before the first E step. This problem
is already addressed in the literature for all the parameters, except d. A simple
solution, used in this paper, consists in initialising d using

1 − pe (i = j)
dij = (22)
pe
|S|−1 (i = j)

where pe is a small probability of expert annotation error. Equivalently, one can


set pi = pe . For example, pe = .05 is used in the experiments in Section 6.

5 ECG Segmentation
This section (i) quickly reviews ECG segmentation and the use of HMMs in this
context and (ii) details the methodology used for the experiments in Section 6.
462 B. Frénay, G. de Lannoy, and M. Verleysen

Fig. 3. Example of ECG signal, with annotations

5.1 ECG Signals


Electrocardiograms (ECGs) are periodic signals measuring the electrical activity
of the heart. These time series are typically associated to a sequence of labels,
called annotations (see Fig. 3). Indeed, physicians distinguish different kind of
patterns called waves: P waves, QRS complexes and T waves. Moreover, physi-
cians talk about baselines when the signal is flat, outside of waves. Here, only
the B3 baseline between T and P waves is considered.
The ECG segmentation problem consists in predicting the labels for unlabeled
observations, using the annotated part of the signal. Indeed, ECGs usually last
for hours and it is of course impossible to annotate the entire signal manually.
In the context of ECG segmentation, (15) cannot be used directly. Indeed, only
one ECG is available for HMM inference: the ECG of the patient under treat-
ment. This is due to large inter-patient differences which prevent generalisation
from one patient to the other. Here, q is simply estimated as the percentage of
each observed annotation, as in the case of the supervised learning.

5.2 State of the Art


One of the most widely used, successful tool for ECG segmentation are HMMs
[6, 15]. Typically, each ECG is firstly filtered using a 3-30 Hz band-pass fil-
ter. Then it is transformed using a continuous wavelet transform (WT) with
an order 2 coiflet wavelet. The dyadic scales from 21 to 27 are kept in order to
build the observations. Eventually, the resulting observations may be normalised
component-wise, which is done in this paper. Fig. 4 shows the theoretical tran-
sitions in a HMM modelling an ECG.
Label noise has already been considered by [16] in the ECG context by using
a semi-supervised approach. Annotations around boundaries are simply deleted,

Fig. 4. Theoretical transitions in an ECG signal


Label Noise-Tolerant Hidden Markov Models 463

which results in an intermediate situation between supervised learning and the


Baum-Welch algorithm. Indeed, the remaining annotations are considered trust-
worthy and only the posteriors of the deleted annotations are estimated by EM.
The width of the deletion window has to be selected, whereas this paper uses a
noise model where the level of label noise is automatically estimated.

5.3 Experimental Settings


The two algorithms which are used for comparison are supervised learning and
the Baum-Welch algorithm. Each emission model uses a GMM with 5 compo-
nents. The EM algorithms are repeated 10 times and each repetition consists of
at most 300 iterations. The initial mean of each GMM component is randomly
chosen among the data in the corresponding class; the initial covariance matrix
of each GMM component is set as a small multiple of the covariance matrix of
the corresponding class.
Three classes of ECGs are used. Firstly, a set of 10 artificial ECGs are gen-
erated and annotated using the popular ECG waveform generator ECGSYN
[4]. Secondly, 10 real ECGs are selected in the sinus MIT-QT database from
Physiobank [5]. These Holter ECG recordings have been manually annotated
by cardiologists with waveform boundaries for 30 to 50 selected beats in each
recording. All recordings are sampled at 250 Hz. These ECGs were measured on
real patients, but they are quite clean and easy to segment, for the patients were
healthy. Thirdly, 10 ECGs are selected in the arrhythmia MIT-QT database from
Physiobank [5]. These ECGs are more difficult to segment and the annotations
are probably less reliable. Indeed, these patients were being treated for cardiac
diseases and their ECG often differ significantly from the text-book ECGs. Only
P waves, QRS complexes, T waves and B3 baselines are annotated.
Each ECG is segmented before and after the addition of artificial label noise.
Different types and levels of artificial label noise are added to the annotations
in order to test the robustness of each algorithm. The two label noises which are
used here are called horizontal and uniform noise:
– The horizontal noise moves the boundaries of P and T waves by a random
number of milliseconds drawn from a uniform distribution. This type of noise
is particularly interesting in the context of ECG segmentation since it mimics
the errors made by medical experts in practice. The uniform distribution
used in the experiments is symmetric around zero and its half width is a
given percentage of the considered wave duration.
– The uniform noise consist in randomly flipping a given percentage of the
labels. This type of noise is the same as in previous experiments [1].
For each experiment, two measures are given: the average recall and precision.
For each wave, the recall is the percentage of observations belonging to that
wave which are correctly classified. The precision is the percentage of predicted
labels which are correct, for a given label. Both measures are estimated using
the ECGSYN annotations for the artificial ECGs, whereas human expert an-
notations are used for the real ECGs. In Section 5, recalls and precisions are
systematically averaged over the four possible labels (P, QRS, T and B3).
464 B. Frénay, G. de Lannoy, and M. Verleysen

For each algorithm, ECGs are split into training and test sets. The training
set is used to learn the HMM, whereas the test set allows testing the HMM on
independent data. For artificial ECGs, 10% of the signal is used for training,
whereas the remaining 90% is used for test. For real ECGs, 50% of the signal is
used for training, whereas the remaining 50% is used for test. This way, the size
of the training sets are roughly equal for both artificial and real ECGs.

6 Experimental Results
This section compares the label noise-tolerant algorithm proposed in Section 4 to
the two standard algorithms described in Section 3. The tests are carried out on
three classes of ECGs, which are altered by two types of label noise. See Section
5 for more details about ECGs and the methodology used in this section.

6.1 Noise-Free Results


Tables 1 and 2 respectively show the recalls and precisions obtained on test
beats for artificial, sinus and arrhythmia ECGs using supervised learning, Baum-
Welch and the proposed algorithm. The annotations are the original annotations,
without additional noise. Each ECG signal is segmented 40 times in order to
evaluate the variability of the results. The results in the three first rows average
the results of all runs for all ECGs, whereas other rows average the results of all
runs for two selected ECGs. For the two selected ECGs, standard deviations are
given. The standard deviations shown on the three first lines are the average of
the standard deviations obtained for each ECG.
Results show that if one completely discards the available labels (i.e. with
the Baum-Welch algorithm), the information loss is important. Indeed, the un-
supervised algorithm always achieves significantly lower recalls and precisions.
The results in terms of recall and precision are approximatively equal for the

Table 1. Recalls on original artificial, sinus and arrhythmia ECGs for supervised
learning, Baum-Welch and the proposed algorithm

supervised Baum- proposed


learning Welch algorithm
artificial 95.21 ± 0.31 89.17 ± 2.20 95.14 ± 0.43
average sinus 95.35 ± 0.28 92.71 ± 1.60 95.60 ± 0.59
arrhythmia 89.51 ± 0.69 82.48 ± 2.57 89.10 ± 0.90
artificial 94.98 ± 0.24 88.09 ± 2.33 94.87 ± 0.40
ECG 1 sinus 95.75 ± 0.30 92.28 ± 1.50 96.44 ± 0.44
arrhythmia 94.36 ± 0.46 81.77 ± 3.36 92.80 ± 1.79
artificial 93.34 ± 0.88 87.44 ± 3.17 93.50 ± 1.04
ECG 2 sinus 95.88 ± 0.14 93.43 ± 0.74 96.01 ± 0.29
arrhythmia 90.07 ± 0.35 88.01 ± 1.42 91.22 ± 0.46
Label Noise-Tolerant Hidden Markov Models 465

Table 2. Precisions on original artificial, sinus and arrhythmia ECGs for supervised
learning, Baum-Welch and the proposed algorithm

supervised Baum- proposed


learning Welch algorithm
artificial 95.53 ± 0.27 87.58 ± 3.06 95.34 ± 0.39
average sinus 95.86 ± 0.26 89.85 ± 2.23 94.38 ± 1.14
arrhythmia 87.28 ± 0.75 77.37 ± 2.73 84.56 ± 1.45
artificial 95.51 ± 0.17 86.16 ± 3.10 95.14 ± 0.43
ECG 1 sinus 96.81 ± 0.19 91.89 ± 1.50 95.96 ± 0.89
arrhythmia 95.11 ± 0.96 72.13 ± 3.15 87.08 ± 4.03
artificial 94.76 ± 0.63 86.17 ± 4.26 94.67 ± 0.73
ECG 2 sinus 96.42 ± 0.11 91.33 ± 1.87 95.82 ± 0.56
arrhythmia 88.25 ± 0.32 86.14 ± 0.05 89.51 ± 0.49

proposed algorithm and supervised learning. One exception is the precision on


arrhythmia signals, where the labels themselves are less reliable, making the
performance assessment less reliable too.

6.2 Results with Horizontal Noise


Fig. 5, 6 and 7 show the results obtained for artificial, sinus and arrhythmia
ECGs, respectively. The annotations are polluted by a horizontal noise, with the
maximum boundary movement varying from 0% to 50% of the modified wave.
For each figure, the first row shows the recall, whereas the second row shows the
precision, both obtained on test beats. Each ECG signal is noised and segmented
40 times in order to evaluate the variability of the results. The curves in the first
column average the results of all runs for all ECGs, whereas the curves in the
second and third columns average the results of all runs for two selected ECGs.
For the two last plots of each row, the error bars show the 95 % confidence
interval around the mean on the 40 runs. The error bars shown on the first plot
of each line are the average of the error bars obtained for each ECG.
Again, the performances of the unsupervised algorithm are the worst ones
for small levels of noise. However, the results obtained using Baum-Welch seem
to be less affected by the label noise. In most cases, the unsupervised algorithm
achieves better results than supervised learning for large levels of noise. The effect
of the noise level is probably due to the fact that the EM algorithm is initialised
using the labelled observations. Therefore, the final result is also influenced by
the label noise, for it depends on the initial starting point.
The performances of supervised learning and the label noise-tolerant algo-
rithm are both affected by the increasing label noise. However, for large levels of
noise, the label noise-tolerant algorithm achieves significantly better recalls and
precisions than supervised learning. Supervised learning is only better in terms
of precision for low levels of noise. Since the horizontal noise mimics errors made
by medical experts, the above results suggest that using the proposed algorithm
can improve the segmentation quality when the expert is not fully reliable.
466 B. Frénay, G. de Lannoy, and M. Verleysen

Fig. 5. Recalls and precisions on artificial ECGs with horizontal noise for supervised
learning (black plain line), Baum-Welch (grey plain line) and the proposed algorithm
(black dashed line), with respect to the percentage of the maximum boundary move-
ment (0% to 50% of the modified wave). See text for details.

Fig. 6. Recalls and precisions on sinus ECGs with horizontal noise for supervised learn-
ing (black plain line), Baum-Welch (grey plain line) and the proposed algorithm (black
dashed line), with respect to the percentage of the maximum boundary movement (0%
to 50% of the modified wave). See text for details.

6.3 Results with Uniform Noise

Fig. 8, 9 and 10 shows the recalls and precisions obtained for artificial, sinus and
arrhythmia ECGs, respectively. The annotations are polluted by a uniform noise,
with a percentage of flipped labels varying from 0% to 20%. For each figure, the
first row shows the recall, whereas the second row shows the precision, both
obtained on test beats. Each ECG signal is noised and segmented 40 times in
Label Noise-Tolerant Hidden Markov Models 467

Fig. 7. Recalls and precisions on arrhythmia ECGs with horizontal noise for supervised
learning (black plain line), Baum-Welch (grey plain line) and the proposed algorithm
(black dashed line), with respect to the percentage of the maximum boundary move-
ment (0% to 50% of the modified wave). See text for details.

Fig. 8. Recalls and precisions on artificial ECGs with uniform noise for supervised
learning (black plain line), Baum-Welch (grey plain line) and the proposed algorithm
(black dashed line), with respect to the percentage of flipped labels (0% to 20%). See
text for details.

order to evaluate the variability of the results. The curves in the first column
average the results of all runs for all ECGs, whereas the curves in the second and
third columns average the results of all runs for two selected ECGs. For the two
last plots of each row, the error bars show the 95 % confidence interval around
the mean on the 40 runs. The error bars shown on the first plot of each line are
the average of the error bars obtained for each ECG.
468 B. Frénay, G. de Lannoy, and M. Verleysen

Fig. 9. Recalls and precisions on sinus ECGs with uniform noise for supervised learning
(black plain line), Baum-Welch (grey plain line) and the proposed algorithm (black
dashed line), with respect to the percentage of flipped labels (0% to 20%). See text for
details.

Fig. 10. Recalls and precisions on arrhythmia ECGs with uniform noise for supervised
learning (black plain line), Baum-Welch (grey plain line) and the proposed algorithm
(black dashed line), with respect to the percentage of flipped labels (0% to 20%). See
text for details.

As for horizontal noise, the performances of Baum-Welch are significantly


worse and decrease as the percentage of label noise increases. For the proposed
algorithm, the recall and precision seem to be almost unaffected by the increas-
ing level of label noise. For supervised learning, the recall and precision slowly
decrease as the label noise increases. In terms of both recall and precision, the
label noise-tolerant algorithm performs better than supervised learning when
the level of noise is larger than 5%.
Label Noise-Tolerant Hidden Markov Models 469

7 Conclusion

In this paper, a variant of the EM algorithm for label noise-tolerant HMM in-
ference is proposed. More precisely, each observed label is assumed to be a noisy
copy of the true, unknown state. The proposed EM algorithm relies on two steps
to automatically estimate the level of noise in the set of available labels. First,
during the E step, the posterior of the hidden state is estimated for each sam-
ple. Next, the M step computes the HMM parameters using the hidden true
states, and not the noisy labels themselves, which results in a model which is
less impacted by label noise.
Experiments are carried on both healthy and pathological ECGs signals artifi-
cially polluted by distinct types of label noise. Three types of inference algorithms
for HMMs are compared: supervised learning, the Baum-Welch algorithm and
the proposed noise-tolerant algorithm. The results show that the performances
of the three approaches are adversely impacted by the level of label noise. How-
ever, the proposed noise-tolerant algorithm can yield better performances than
the other two algorithms, which confirms the benefit of embedding the noise pro-
cess into the inference algorithm. This improvement is particularly pronounced
when the artificial label noise mimics errors made by medical experts, which
suggests that the proposed algorithm could be useful when expert annotations
are less reliable. The recall is improved for any label noise level, and the precision
is improved for large levels of noise.

References
1. Lawrence, N.D., Schölkopf, B.: Estimating a kernel fisher discriminant in the pres-
ence of label noise. In: Proceedings of the Eighteenth International Conference on
Machine Learning, ICML 2001, pp. 306–313. Morgan Kaufmann Publishers Inc,
San Francisco (2001)
2. Li, Y., Wessels, L.F.A., de Ridder, D., Reinders, M.J.T.: Classification in the pres-
ence of class noise using a probabilistic kernel fisher method. Pattern Recogni-
tion 40, 3349–3357 (2007)
3. Bouveyron, C., Girard, S.: Robust supervised classification with mixture mod-
els: Learning from data with uncertain labels. Pattern Recognition 42, 2649–2658
(2009)
4. McSharry, P.E., Clifford, G.D., Tarassenko, L., Smith, L.A.: Dynamical model for
generating synthetic electrocardiogram signals. IEEE Transactions on Biomedical
Engineering 50(3), 289–294 (2003)
5. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark,
R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., Stanley, H.E.: PhysioBank, Phys-
ioToolkit, and PhysioNet: Components of a new research resource for complex
physiologic signals. Circulation 101(23), e215–e220 (2000)
6. Hughes, N.P., Tarassenko, L., Roberts, S.J.: Markov models for automated ECG
interval analysis. In: NIPS 2004: Proceedings of the 16th Conference on Advances
in Neural Information Processing Systems, pp. 611–618 (2004)
7. Brodley, C.E., Friedl, M.A.: Identifying Mislabeled Training Data. Journal of Ar-
tificial Intelligence Research 11, 131–167 (1999)
470 B. Frénay, G. de Lannoy, and M. Verleysen

8. Barandela, R., Gasca, E.: Decontamination of training samples for supervised pat-
tern recognition methods. In: Proceedings of the Joint IAPR International Work-
shops on Advances in Pattern Recognition, pp. 621–630. Springer, London (2000)
9. Guyon, I., Matic, N., Vapnik, V.: Discovering informative patterns and data clean-
ing. In: Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.)
Advances in Knowledge Discovery and Data Mining, pp. 181–203 (1996)
10. Bootkrajang, J., Kaban, A.: Multi-class classification in the presence of labelling
errors. In: Proceedings of the 19th European Conference on Artificial Neural Net-
works, pp. 345–350 (2011)
11. Côme, E., Oukhellou, L., Denoeux, T., Aknin, P.: Mixture model estimation with
soft labels. In: Proceedings of the 4th International Conference on Soft Methods
in Probability and Statistics, pp. 165–174 (2008)
12. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in
speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
13. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete
Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B
(Methodological) 39(1), 1–38 (1977)
14. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and
Statistics), 1st ed. 2006. corr. 2nd printing edition. Springer, Heidelberg (2007)
15. Clifford, G.D., Azuaje, F., McSharry, P.: Advanced Methods And Tools for ECG
Data Analysis. Artech House, Inc., Norwood (2006)
16. Hughes, N.P., Roberts, S.J., Tarassenko, L.: Semi-supervised learning of probabilis-
tic models for ecg segmentation. In: IEMBS 2004: Proceedings of the 26th Annual
International Conference of the IEEE Engineering in Medicine and Biology Society,
vol. 1, pp. 434–437 (2004)
Building Sparse Support Vector Machines for
Multi-Instance Classification

Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang

Monash University, Churchill VIC 3842, Australia


{zhouyu.fu,guojun.lu,kaiming.ting,dengsheng.zhang}@monash.edu

Abstract. We propose a direct approach to learning sparse Support


Vector Machine (SVM) prediction models for Multi-Instance (MI) clas-
sification. The proposed sparse SVM is based on a “label-mean” for-
mulation of MI classification which takes the average of predictions of
individual instances for bag-level prediction. This leads to a convex op-
timization problem, which is essential for the tractability of the opti-
mization problem arising from the sparse SVM formulation we derived
subsequently, as well as the validity of the optimization strategy we em-
ployed to solve it. Based on the “label-mean” formulation, we can build
sparse SVM models for MI classification and explicitly control their spar-
sities by enforcing the maximum number of expansions allowed in the pre-
diction function. An effective optimization strategy is adopted to solve
the formulated sparse learning problem which involves the learning of
both the classifier and the expansion vectors. Experimental results on
benchmark data sets have demonstrated that the proposed approach is
effective in building very sparse SVM models while achieving comparable
performance to the state-of-the-art MI classifiers.

1 Introduction
Multi-instance (MI) classification is a paradigm in supervised learning first in-
troduced by Dietterich [1]. In a MI classification problem, training examples are
presented in the form of bags associated with binary labels. Each bag contains
a collection of instances, whose labels are not perceived in priori. The key as-
sumption for MI classification is that a positive bag contains at least one positive
instance and a negative bag contains only negative instances.
We focus on SVM-based methods for MI classification. Existing methods in-
clude the MI-kernel [2], MI-SVM [3], mi-SVM [3], a regularization approach to
MI [4], MILES [5], MissSVM [6], and KI-SVM [7]. Most existing approaches are
based on the modification of standard SVM models and kernels in the single
instance (SI) case to adapt to the MI setting. The SVM classifiers obtained from
these methods take a similar analytic form as standard kernel SVMs. The pre-
diction function can be expressed by a generalized linear model with predictor
variables given by the kernel values evaluated for the Support Vectors (SVs),
training instances with nonzero coefficients in the expansion. The speed of SVM
prediction is proportional to the number of SVs. The sparser the SVM model

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 471–486, 2011.

c Springer-Verlag Berlin Heidelberg 2011
472 Z. Fu et al.

(i.e. the smaller the number of SVs), the more efficient the prediction. The issue
is more serious for MI classification for two reasons. Firstly, the number of SVs
in the trained classifier largely depends on the number of training instances.
Even with a moderate-size MI problem, a large number of instances may be
encountered, consequently leading to a non-sparse SVM classifier with many
SVs. Secondly, in MI classification, the prediction function is usually defined at
instance-level either explicitly [3] or implicitly [2]. In order to predict the bag
label, one needs to apply the classifier to each instance in the bag. Reducing a
single SV in the prediction function would result in savings proportional to the
size of the bag. Thus it is more important in MI prediction to remove redun-
dant SVs for efficiency in prediction. For practical MI applications where fast
prediction speed is required, it is highly desirable to have sparse SVM prediction
models. Existing SVM solutions to MI classification are unlikely to produce a
sparse prediction model in this respect. MILES [5] is the only exception, which
produces a sparser solution than other methods, by using an L1 norm on the co-
efficients to encourage the sparsity of the solution. However, it can not explicitly
control the sparsity of the learned classifier.
In this paper, we propose a principled approach to learning sparse SVM for MI
classification. The proposed approach can achieve controlled sparsity while main-
taining competitive predictive performance as compared to existing non-sparse
SVM models for MI classification. To the best of our knowledge, this is the
first explicit formulation for sparse SVM learning in the MI setting. In the con-
text of SI classification, sparse SVMs have attracted much attention in the re-
search community [8–12]. The major difficulty to extend sparse learning to the
MI scenario is the complexity of MI classification problems compared to SI ones.
The MI assumption has to be explicitly taken into account in SVM modeling.
This eventually leads to complicated SVM formulations with extra non-convex
constraints and hence difficult optimization problems. To consider sparse MI
learning, more constraints need to be considered. This further complicates the
issue of modeling and leads to intractable optimization problems. Alternatively,
one can use some pre- [9] or post-processing [10] schemes to empirically learn
a sparse MI SVM. These schemes do not consider the special nature of an MI
classification problem and are likely to yield inferior performance for prediction.
The method most closely related to ours was proposed in Wu et al. [12] for
the SI case. Nevertheless, their method can not be directly applied to the MI
case since the base model they considered is the standard SVM classifier. It
is well-known that learning a standard SVM is equivalent to solving a convex
quadratic problem [10] with the guarantee of unique global minimum. This is the
pre-requisite for the optimization strategy employed in [12]. In contrast, existing
SVM solutions to MI classification are either non-convex [3, 4, 6] or based on
complex convex approximations [7]. It is non-trivial to extend the sparse learning
framework [12] to MI classification based on the existing formulations. To this
end, we adopt a simple “label-mean” formulation which takes the average value
of instance label predictions for bag-level prediction. This is contrary to most
existing formulations which makes bag-level prediction by taking the maximum
Building Sparse Support Vector Machines for MI Classification 473

of instance predictions as indicated by the MI assumption. However, we justify


the validity of the use of “label-mean” by showing theoretically that target MI
concepts can be correctly classified via the proper choice of instance-level kernel
function. This is established by linking it with the MI-kernel [2] and extending
the theoretical results in [2]. Departing from the “label-mean” formulation, the
sparse SVM for MI learning can then be formulated by imposing additional
constraint on the classifier weight vector to explicitly control the complexity
of the resulting prediction function. The proposed approach can simultaneously
learn the MI SVM classifier as well as select the expansion vectors in the classifier
function in a single discriminative framework. A similar optimization scheme to
[12] can then be employed to effectively solve the newly formulated problem.

2 “Label-Mean” Formulation for MI Classification


In MI classification, we are given m training bags {B1 , . . . , Bm } and their cor-
responding labels {y1 , . . . , ym } with yi ∈ {−1, 1}, where Bi = {xi,1 , . . . , xi,ni }
is the ith bag containing ni instances. Each instance xi,p in the bag is also
associated with a hidden label yi,p . By the MI assumption, yi,p = −1 for all
p ∈ {1, . . . , ni } if yi = −1 and yi,p = 1 for some p if yi = 1. Note that only bag
labels are available for training and instance labels are not observable from the
data. The relation between them can be captured by a single equation as follows
ni
yi = max yi,p ∀i (1)
p=1

This naturally inspires a bottom-up “label-max” formulation for MI classifi-


cation, which forms the backbone for many existing SVM-based formulations
[3, 4, 6]. In this formulation, one aims at learning a prediction model at instance
level and making the bag prediction by taking the maximum value of predic-
tions obtained from individual instances. Specifically, the following condition is
encoded for bag Bi in the SVM objective function either as hard [3, 6] or soft
constraints [4] in accordance with the MI assumption in Equation 1
ni
F (Bi ) = max f (xi,p ) (2)
p=1

where f and F denotes instance and bag-level prediction functions respectively.


As a result, the above constraints lead to a non-convex and non-differentiable
SVM model for MI classification and consequently result in a hard optimization
problem. One has to resort to heuristics [3] or complicated optimization routines
like the Concave-Convex Procedure (CCCP) [4, 6] to solve the hard optimization
problems arising from these SVM models.
Here, we adopt an alternative SVM formulation for the MI classification prob-
lem. Instead of taking the maximum of instance predictions for bag-level predic-
tion, we advocate the use of average operation over instance predictions. This
leads to the following constraint for bag-level prediction
ni
1 
F (Bi ) = f (xi,p ) (3)
ni p=1
474 Z. Fu et al.

Unlike the “label-max” constraints in Equation 2, the above constraints lead to


a convex formulation for learning MI classifiers, which we term the “label-mean”
formulation in the following.
We now formally define the optimization problem for the “label-mean” formu-
lation. Let Ω denote the space of instance features, and H denote the Reproduc-
ing Kernel Hilbert Spaces (RKHS) of functions f : Ω → R induced by the kernel
k : Ω × Ω → R. As with the case of standard SVM, we consider instance-level
linear prediction functions in the RKHS. Hence, we have
f (x) = wT φ(x) + b (4)
where φ(x) ∈ H is the projection of instance x from input space Ω into the
RKHS, w ∈ H is the weight vector for the linear model in RKHS, and b is the
bias term. Note that both w and φ(x) can be infinite dimensional depending on
the mapping induced by the corresponding kernel k. The following optimization
problem in RKHS is solved for MI classification
 m
1
min w2 + C (yi , F (Bi )) (5)
w,b 2
i=1

The first and second terms in the above objective function correspond to the
regularization and data terms respectively, with parameter C controlling
the trade-off between them. (yi , F (Bi )) is the bag-level loss function penal-
izing the discrepancy between bag label yi and prediction F (Bi ) as given by
Equation 3. We stick to 2-norm SVM throughout the paper, which uses the
following squared Hinge loss function
2
(yi , F (Bi )) = max (0, 1 − yi F (Bi )) (6)
The purpose of the formulation in Equation 5 is to learn an instance-level pre-
diction function f which gives good performance on bag-level prediction based
on the “label-mean” relation in Equation 3. The objective function is continu-
ously differentiable and convex with unique global minimum. Both properties are
required for the successful application of the optimization strategy we used for
solving the sparse SVMs proposed in the next section based on the “label-mean”
formulation. The standard 1-norm SVM with the non-differentiable Hinge loss
can also be used here, which was adopted in Wu et al. [12] for solving sparse
SVM for SI learning. However, in this case, the optimization problem has to be
solved in the dual formulation. This is computationally much more expensive
than the primal formulation as we will discuss later.
The main difference between the proposed formulation and existing ones
[3, 4, 6] is in the relation between F (Bi ), bag-level prediction, and f (xi,p )’s,
the prediction values for instances. We have made a small but important modifi-
cation here by substituting the constraints in Equation 2 with those in Equation
3. This may look like a violation of the MI assumption in the beginning. Nev-
ertheless, the rationale for taking the average will become self-evident once we
reveal the relationship between the proposed formulation and the existing MI
kernel method [2]. Specifically, we have the following lemma.
Building Sparse Support Vector Machines for MI Classification 475

Lemma 1. Training the SVM in Equation 5 is equivalent to training a stan-


dard 2-norm SVM at bag level with the normalized set kernel kset (Bi , Bj ) =
1 ni nj
k(xi,p , xj,q ), where k is the instance-level kernel.
ni nj p=1 q=1
Proof. By substituting Equations 3 and 4 into Equation 5 and introducing slack
variables ξi ’s for the square Hinge losses, we can convert the unconstrained
problem into a constrained quadratic program in the following
1 
min w2 + C ξi2 (7)
w,b,ξ 2
i
 ni

1  T
s.t. yi w φ(xi,p ) + b ≥ 1 − ξi ∀i
ni p=1

By making use of the Lagrangian and KKT conditions, we can derive the dual
of the above problem in the following1
m
 m


max αi yi αj yj Ki,j − αi (8)
α≥0
i,j=1 i=1

s.t. yi αi = 0
i

 1
with Ki,j = kset (Bi , Bj ) + δi,j . This is exactly a 2-norm SVM for bags with
2C
kernel specified by kset .
With the above lemma, we can proceed to show the following theorem, which
justifies the use of the mean operation for bag-level prediction in Equation 3.
Theorem 1. If positive and negative instances are separable with margin  with
respect to feature map φ in the RKHS induced by kernel k, then for sufficiently
large integer r, positive and negative bags are separable with margin  using the
bag-level prediction function in Equation 3 with instance kernel k r .
Proof. The proof follows directly from Lemma 4.2 in [2], which states that if
positive and negative instances are separable with margin  with respect to kernel
k, then positive and negative bags can be separated with the same margin by
the following MI kernel
|Bi | |Bj |
1 
kMI (Bi , Bj ) = k r (xi,p , xj,q ) (9)
|Bi ||Bj | p=1 q=1

This is basically the normalized set kernel for kernel k r at instance level. The
conclusion is then established by the equivalence between the set kernel and the
proposed “label-mean” formulation, which takes the average of instance labels
for bag prediction, as shown in the previous lemma.
1
The detailed steps are quite standard and similar to the derivation of dual of 2-norm
SVMs, and are thus omitted here.
476 Z. Fu et al.

Theorem 1 establishes the validity of the “label-mean” formulation for MI clas-


sification. Intuitively, if positive and negative instances are linearly separable,
we can build a SVM classifier based on the “label-mean” formulation with an
order-r polynomial kernel to guarantee separability at bag-level. In the nonlinear
case, if instances are separable with the Gaussian kernel, bags are separable with
a Gaussian kernel with a larger scale parameter in the formulation.
The weight vector w of the SVM classifier can be expressed in terms of the
dual variables αi ’s from the solution of the dual formulation in Equation 8 by
the KKT condition
 ni
1 
w= αi yi φ(xi,p ) (10)
i
ni p=1
By substituting the equation into Equation 4, we have the following instance
and bag-level classifiers for the proposed formulation
 ni
1 
f (x) = αi yi k(xi,p , x) + b (11)
i
ni p=1

F (B) = αi yi kset (Bi , B) + b (12)
i
|B| ni
1  1 
= αi yi k(xi,p , xq ) + b
|B| q=1 i ni p=1

The complexity of the above prediction functions depends on the number of


nonzero αi ’s as well as the number of instances in each bag. A single nonzero
dual variable would render all instances in the corresponding bag to be SVs.
Thus the total number of SVs is given by
m

Nsv = ni (13)
i=1,αi 
=0

which is proportional to the number of instances in the training set. Moreover,


to make a bag-level prediction, one needs to evaluate f (x) for all instances in
the bag and average the instance prediction values. This creates a more adverse
situation for MI prediction as compared to the SI instance case with SVM. To
make prediction faster, it is essential to have a model with fewer vectors in the
expansion in Equation 11.

3 Sparse SVM for MI Classification


A straightforward approach to achieve a sparse prediction model is to fit the
trained classifier with a new classifier with reduced complexity. The Reduced
Set (RS) is one such method [10]. It approximates the weight vector in Equation
10 with the following new weight vector
N
 xv

w= βk φ(zk ) (14)
k=1
Building Sparse Support Vector Machines for MI Classification 477

where zk ’s are Expansion Vectors (XV) 2 whose feature maps φ(zk )’s form the
basis of linear expansion for w, and Nxv denotes the number of XVs. Let β =
[β1 , . . . , βNxv ] be the vector of expansion coefficients and Z = [zT1 , . . . , zTNxv ]T
the concatenation of XVs in a column vector. They can be solved by solving the
following optimization problem
N 2
 xv  1 
ni 
 
(β, Z) = arg min  βk φ(zk ) − αi yi φ(xi,p ) (15)
β,Z  ni 
k=1 i p=1

It can be solved by a greedy approach developed in [10].


Despite the simplicity of RS, it is basically a postprocessing method used
for fitting any trained model. It does not produce classifier models by itself.
Moreover, RS simply minimizes fitting error in model learning without utilizing
any discriminant information. This may not be desirable for low sparsity cases
when the fitting errors are high.
Alternatively, a sparse model can be learned directly from the SVM formula-
tion by imposing the L1 norm on the coefficients due to its sparsity preserving
property. This strategy was employed by MILES [5], a popular SVM method
for MI classification. However, we are still confronted with the trade-off between
sparsity and accuracy. This is because sparsity is not explicitly imposed on the
model. Therefore for difficult problems, either sparsity is sacrificed for better
accuracy, or the other way around.

3.1 Model

Based on the aforementioned concerns, we propose a direct approach for building


sparse SVM models for MI classification based on the formulation presented in
the previous section. This is achieved by adding an explicit constraint to control
the complexity of linear expansion for the weight vector w in Equation 10.
In this way, we can control the number of kernel evaluations involved in the
computation of prediction functions in Equations 11 and 12. Specifically, we aim
at approximating w with a reduced set of feature maps while maintaining the
large margin of the SVM objective function. The new optimization problem can
thus be formulated as
m
 ni

1  1  T
min w + C2
 yi , (w φ(xi,p ) + b) (16)
Z,w,b 2
i=1
ni p=1

Intuitively, we can view the above formulation as searching for the optimal so-
lution that minimizes the bag-level loss in a subspace spanned by φ(zk )’s in the
RKHS H induced by instance kernel k instead of the whole RKHS. By directly
specifying the number of XVs Nxv in w, the optimal solution is guaranteed to
2
We have adopted the same terminology here as in [12] for the same reason. Techni-
cally, an XV, which can be an arbitrary point in the instance space, is different from
an SV, which must be chosen from an existing instance in the training set.
478 Z. Fu et al.

reside in a subspace whose dimension is no larger than Nxv . The above formu-
lation can be regarded as a joint optimization problem that toggles between
searches for the optimal subspace and for the optimal solution in the subspace.
With Nxv  Nsv , we can build a much sparser model for MI prediction.
Let KZ denote the Nxv ×Nxv Gram matrix for XVs zk ’s, with the (i, j)th entry
given by k(zi , zj ), KBi ,Z denote the ni × Nxv Gram matrix between instances
in bag i and XVs, with the (p, j)th entry given by k(xi,p , zj ), and 1ni be the
ni dimensional column vector with value of 1 for each element. We can rewrite
optimization problem in Equation 16 in terms of β by substituting Equation
14 into the cost function in Equation 16. This leads to the following objective
function we solve for sparse SVM model for MI classification.
1 T
min Q(β, b, Z) = β KZ β (17)
β,b,Z 2
  1

+C  yi , 1Tni KBi ,Z β + b
i
ni

3.2 Optimization Strategy


The sparse SVM formulation in Equation 17 involves the joint optimization of
two inter-dependent sets of target variables - the classifier weights (β, b) and XVs
zk ’s. Change in one of them would influence the optimal solution of the other. A
straightforward strategy to tackle this problem is via alternating minimization
[13]. However, alternating optimization lacks convergence guarantee and usually
leads to slow convergence in practice [13, 14]. Hence, we adopt a more efficient
and effective strategy to solve the sparse SVM model here by viewing it as an
optimization problem for the optimal value function. Specifically, we convert
the original problem defined in Equation 17 into the following problem which
depends on variable Z only
min g(Z) with g(Z) = min Q(β, b, Z) (18)
Z β,b

g(Z), the new objective function, is special in the sense that it is the optimal
value of Q optimized over variables (β, b). The evaluation of g(Z) at a fixed point
Z is equivalent to training a 2-norm SVM given the XVs and computing the cost
of the trained model in Equation 17. To train the 2-norm SVM, we minimize
function Q(β, b, Z) over β and b by fixing Z. This can be done easily with vari-
ous numerical optimization routines. In our work, we used the limited memory
BFGS (L-BFGS) algorithm for its efficiency and super-linear convergence rate.
The implementation of L-BFGS requires the cost Q(β, b, Z) and the gradient
information below
 1  m
∂Q
= KZ β + C fi (yi , fi )KTBi ,Z 1ni (19)
∂β i=1
n i
m
 1 
∂Q
=C  (yi , fi ) (20)
∂b n fi
i=1 i
Building Sparse Support Vector Machines for MI Classification 479

where the partial derivative of the squared Hinge loss function  with respect to
fi is given by
 2(fi − yi ) yi fi < 1
fi (yi , fi ) = (21)
0 yi fi ≥ 1

Denote β and b as the minimizer of Q(β, b, Z) at Z, the value of function g(Z)


is then given by

1 T   1 T

g(Z) = β KZ β + C  yi , 1ni KBi ,Z β + b (22)
2 i
ni

Note that we assume the Gram matrix KZ to be positive definite, which is always
the case with the Gaussian kernel for distinct zk ’s3 . Q is then a strictly convex
function with respect to β and b, and the optimal solution at each Z is unique.
This makes g(Z) a proper function with unique value for each Z. Moreover, the
uniqueness of optimal solution also makes it possible for the derivative analysis
for g(Z).
Existence and computation of the derivative of the optimal value function
has been well studied in the optimization literature. Specifically, Theorem 4.1
of [15] has provided the sufficient conditions for the existence of derivative of
g(Z). According to the theorem, the differentiability of g(Z) is guaranteed by
the uniqueness of optimal solution β and b as we discussed earlier, and by the
differentiability of Q(β, b, Z) with respect to β and b, which is ensured by the
square Hinge loss function we adopted. Moreover, the derivative of g(Z) can be
computed at each given Z by substituting the minimizers β and b into Equation
22 and taking the derivative in the following as if g(Z) does not depend on β
and b
 N
∂g ∂k(zi , zk )
= β iβ k (23)
∂zk i=1
∂zk
m ni
1  ∂k(xi,p , zk )
+C  (yi , fi )β k
n
i=1 i p=1
∂zk

The derivative terms in the above equation depend on the specific choice of
kernels. In our work, we have adopted the following Gaussian kernel

k(x, z) = exp (−γx − z2 )

where γ is the scale parameter for the kernel. Note that the power of a
Gaussian kernel is still a Gaussian kernel. Hence, according to Theorem 2 in Sec-
tion 2, if instances are separable with Gaussian kernel with scale parameter γ,
then bags are separable with Gaussian kernel with scale parameter rγ for some r

3
In practice, we can enforce the positive definiteness by adding a small value to the
diagonal of KZ .
480 Z. Fu et al.

Algorithm 1. Sparse SVM for MI Classification


Input: data (Bi , yi ), Nxv , λ, smax and tmax
Output: classifier weights β, b and XVs Z
Set t = 0 and initialize Z
Solve minβ,b Q(β, b, Z(t) ) and denote the optimizer as (β (t) , b(t) ) and the optimal
value as g(Z(t) )
repeat
for s = 1 to smax do
 ∂g(Z)
Set Z = Z(t) − λ
∂Z  
Solve minβ,b Q(β, b, Z ) and denote the optimizer as (β , b ) and the optimal

value as g(Z )

if g(Z ) < g(Z(t) ) then
  
Set (β (t+1) , b(t+1) ) = (β , b ), Z(t+1) = Z
Set λ = 2λ if s equals 1
break
end if
Set λ = λ/2
end for
Set t = t + 1
until Convergence or t ≥ tmax or s ≥ smax

using the “label-mean” formulation. Thus in practice, we only have to search


over the γ parameter for Gaussian kernel. With the use of Gaussian kernel, the
partial derivative terms in Equation 23 can be replaced by

∂k(x, zk )
= 2γ(x − zk )k(x, zk )
∂zk

3.3 Algorithm and Extension

With g(Z) and its derivative given in Equations 22 and 23, we can develop a
gradient descent approach to solve the overall optimization problem for sparse
SVM in Equation 17. The detailed steps are outlined in Algorithm 1. Besides
input data and Nxv , additional input parameters include λ, the initial step size,
smax , the maximum number of line searches allowed, and tmax , the maximum
number of iterations. For line search, we implemented a strategy similar to back-
tracking, without imposing the condition on sufficient decrease. Any step size
resulting in a decrease in function value is immediately accepted. This is more
efficient than more sophisticated line search strategies with significant reduction
in the number of expensive function evaluations.
The optimization scheme we used has also been adopted previously for sparse
kernel machine [12] and simple Multiple Kernel Learning (MKL) [14]. The major
difference here is that the SVM problem is solved directly in its primal formula-
tion when evaluating the optimal value function g(Z), whereas in [12] and [14],
a dual formulation of SVM is solved instead by invoking standard SVM solvers.
Building Sparse Support Vector Machines for MI Classification 481

This is mainly due to the use of squared Hinge loss in the cost function, which
makes the primal formulation continuously differentiable with respect to classi-
fier parameters. In contrast, both sparse kernel machine [12] and simple MKL
[14] used non-differentiable Hinge loss for the basic SVM model, which has to be
treated in the dual form to guarantee the differentiability of the optimal value
function. Solving the primal problem has great advantage in computational com-
plexity as compared to the dual. For our formulation, it is cheaper to solve the
primal problem for SVM since it only involves Nxv + 1 variables. In contrast, the
complexity of the dual problem is much higher. It involves the computation and
cache of the kernel matrix in solving the SVM. Moreover, the gradient evalua-
tion with respect to the XVs is also much more costly, as it needs to aggregate
over the gradient value for each entry in the kernel matrix, which is the sum
of inner products between vectors of kernel function values over instances and
XVs. The complexity for gradient computation scales with O(Nxv N 2 ), where N
is the total number of instances in the training set. This is much higher than the
complexity of O(Nxv N ) in the primal case.
The proposed algorithm can also be extended to deal with multi-class MI
problems. Since a multi-class problem can be decomposed into several binary
problems using various decomposition schemes, each binary problem may intro-
duce a different set of XVs by applying the binary version of algorithm directly.
Therefore, the XVs have to be learned in a joint fashion for multi-class problems
to ensure that each binary classifier shares the same XVs. Consider M binary
problems and let β c , bc denote the weight and bias for the cth classifier respec-
tively (c = 1, . . . , M ), the optimization problem is only slightly different from
Equation 17

M

min Q((β, b, Z) = Qc (β c , bc , Z) (24)
β,b,Z
c=1
 
M
 1 cT  
c c 1 T c c
= β KZ β + C  yi , 1ni KBi ,Z β + b
c=1
2 i
ni

The same strategy can be adopted for the optimization of the above equation by
introducing g(Z) as the optimal value function of Q over β c ’s and b’s. The first
step of each iteration is the same, only that M SVM classifiers are trained instead
of one. These M classifiers can be trained separately by minimizing Qc (β c , bc , Z)
for c = 1, . . . , M . The argument on the existence of the derivative of g(Z) is still
true, which can be computed with a minor modification via

 M N
xv
∂g c c ∂k(zi , zk )
= βi βk (25)
∂zk c=1 i=1
∂zk
M  m ni
1  c c  ∂k(xi,p , zk )
+C  (yi , fi )β k
n
c=1 i=1 i p=1
∂zk
482 Z. Fu et al.

(a)

(b)

Fig. 1. Demonstration of the proposed sparse SVM classifier on synthetic (a) binary
and (b) 3-class MI data sets

4 Experimental Results
4.1 Synthetic Data Examples
In our experiments, we first show two examples on synthetic data to demon-
strate the interesting properties of the proposed sparse SVM algorithm for MI
classification. Figure 1(a) shows an example of binary MI classification, where
each positive bag contains at least one instance in the center, while each neg-
ative bag contains only instances on the ring. Figure 1(b) shows a MI classifi-
cation problem with three classes. Instances are generated from four Gaussians
N (μi , σ 2 )(i = 1, . . . , 4) with μ1 = [−2, 2]T , μ2 = [2, 2]T , μ3 = [2, −2]T , μ4 =
[−2, −2]T , and σ = 0.25. Bags from class 3 contains only instances randomly
drawn from N (μ2 , σ 2 ) and N (μ4 , σ 2 ), whereas each bag from class 1 contains
at least one instance drawn from N (μ1 , σ 2 ) and each bag from class 2 contains
at least one instance drawn from N (μ3 , σ 2 ). For the binary example, a single
XV in the center is sufficient to discriminate between bags from two classes.
For the 3-class example, two XVs are needed for discriminant purposes whose
optimal locations should overlap with μ1 and μ3 , the centers of class 1 and 2.
For both cases, we initialize our algorithm with poor XV locations as shown
by the circles on the first column of Figure 1. The next two columns show
the data overlayed with XVs updated after the first and final iterations respec-
tively. Based on the trained classifier, we can also partition the data space into
predicted regions for each class. The partition is also shown in our plots and
regions for different classes are described by different shades. It can be seen
that, despite the poor initialization, the algorithm is able to find good XVs
and make the right decision. The convergence is quite fast, with the first itera-
tion already making a pronounced improvement over the initialization and XV
Building Sparse Support Vector Machines for MI Classification 483

locations are refined over the remaining iterations. This is further demonstrated
by the monotonically decreasing function values over iterations on the rightmost
column of Figure 1, with significant decrease in the first few iterations.

4.2 Results on Real-World Data


We now turn our attention to real-world data. Five data sets were used in our ex-
periment. These include two data sets on drug activity prediction (MUSK1 and
MUSK2), two data sets on image categorization (COREL-10 and COREL-20)
and one data set on music genre classification (GENRE). MUSK1 and MUSK2
were first introduced in [1] and have been widely used as the benchmark for MI
classification. MUSK1 contains 47 positive bags and 45 negative ones, with an
average of 5.2 instances in each bag. MUSK2 contains 39 positive bags and 63
negative ones, with an average of 64.7 instances in each bag. COREL-10 and
COREL-20 were from the COREL image collection and introduced for MI clas-
sification in [5]. COREL-20 contains 2000 JPEG images from 20 categories, with
100 images per category. COREL-10 is a subset of COREL-20 with images from
the first 10 categories. Each image is segmented into 2 − 13 regions from which
color and texture features are extracted. For our experiment, we simply used
the processed features obtained from the author’s website. GENRE was origi-
nally introduced in [16] and is first tested for MI classification in this paper. The
data set contains 1000 audio tracks equally distributed in 10 music genres. We
split each track, which is roughly 30 seconds long, into overlapping 3-second seg-
ments with 1-second steps. Audio features are then extracted from each segment
following the procedures in [16].
We have compared the proposed sparse SVM for MI classification (SparseMI)
with existing SVM implementations for MI classification, including MI-SVM
[3],mi-SVM [3], MI-kernel [2] and MILES [5], as well as two alternative naive
approaches to sparse MI classification - RS and RSVM. RS is discussed in the
beginning of Section 3, and RSVM is a reduced SVM classifier learned from
a randomly selected set of XVs [9], which is equivalent to running SparseMI
without any optimization. All three methods can control the sparsity of the pre-
diction model explicitly, so we have tested them with varied sparsity by ranging
from 10, 50 up to 100 XVs. For MUSK1 and MUSK2, we performed 10-fold
Cross Validation (CV) and recorded CV accuracies. For other data sets, we have
randomly selected 50% of the data for training and the remaining 50% for test-
ing and recorded test accuracies. The experiments were repeated 10 times for
each data set with different random partitions. Instance features are normalized
to zero mean and unit standard deviation for all data sets. The Gaussian kernel
was adopted for all SVM models being compared. Kernel parameter γ and reg-
ularization parameter C were chosen via 3-fold cross validation on the training
set. For SparseMI, we set smax = 10, tmax = 50 and λ equal to the average
pairwise distance between initial XVs. Test results for different methods used in
our experiment are reported in Table 1.
484 Z. Fu et al.

Table 1. Performance comparison for different methods on MI classification. Results


are organized in four blocks. The first block shows the mean accuracies in percentage
(first row for each method) as well as number of SVs (second row for each method)
obtained by existing SVM methods. The next three blocks show the mean accuracies
for three sparse SVM models with given number of XVs Nxv . For each Nxv value (i.e.
block), the highest accuracy for each data set (i.e. column) is highlighted. More than
one accuracy values may be highlighted in each column if the methods achieve equal
performances according to the results of significance tests on the accuracy values over
10 test runs.

Data set MUSK1 MUSK2 COREL10 COREL20 GENRE


87.30 80.52 75.62 52.15 80.28
mi-SVM
400.2 2029.2 1458.2 2783.3 12439
77.10 83.20 74.35 55.37 72.48
MI-SVM
277.1 583.4 977 2300.1 3639.8
89.93 90.26 84.30 73.19 77.05
MI-Kernel
362.4 3501 1692.6 3612.7 13844
85.73 87.64 82.86 69.16 69.87
MILES
40.8 42.5 379.1 868 1127.9
RS 88.68 86.39 75.13 55.20 54.68
Nxv = 10 RSVM 74.66 77.70 69.25 48.53 46.01
SparseMI 88.44 88.52 80.10 62.49 71.28
RS 89.61 89.92 79.87 66.27 65.27
Nxv = 50 RSVM 87.23 86.71 76.94 62.54 63.58
SparseMI 89.98 88.02 84.19 71.66 75.28
RS 90.18 89.16 78.81 65.35 67.77
Nxv = 100 RSVM 89.02 88.26 77.63 63.82 67.41
SparseMI 90.40 87.98 84.31 72.22 76.06

From the results, we can see that the proposed SparseMI is quite promising
with the optimal balance between performance and sparsity. Compared to exist-
ing SVM methods like MI-kernel and MILES, SparseMI has comparable accuracy
rates yet with significantly fewer XVs in the prediction function. Compared to
alternative sparse implementations like RS and RSVM, SparseMI achieves better
performances in majority of the cases. The gap in performance is more evident
with smaller number of XVs, as can be observed for the case of Nxv = 10. With
increasing value of Nxv , the performance of SparseMI is further improved. For
Nxv = 100, SparseMI performs comparably with MI-Kernel, which can be re-
garded as the dense version of SparseMI. As a result, MI-Kernel produces more
complex prediction models with far more SVs than SparseMI. Compared with
MILES, which has better sparsity than other SVM methods, SparseMI not only
achieves better performance but obtains sparser models. This is especially true
in the cases of multiclass MI classification, for reasons we have discussed in the
previous section. Moreover, with SparseMI, we can explicitly control its spar-
sity by specifying the number of XVs, while the sparsity of MILES can only be
specified implicitly with the regularization parameter.
Building Sparse Support Vector Machines for MI Classification 485

(a) MUSK1 (b) MUSK2

(c) COREL-20 (d) GENRE

Fig. 2. Stepwise demonstration of SparseMI optimization algorithm with plots of cost


function values (in solid lines) and test accuracies (broken lines) over each iteration

To further demonstrate the effectiveness of SparseMI in optimizing the XVs


over iterations, we show some stepwise examples in Figure 2. For each data set
(except COREL-10 which behaves similarly to COREL-20), we plotted the cost
function value and test accuracy over each iteration for a 50% − 50% random
partitioning of the training and testing sets. It can be seen clearly that the cost
function value monotonically decreases with each iteration, accompanied by the
tendency of improvement for the test accuracy over the iterations. The decrease
of cost function value and improvement of test accuracy are especially obvious
for the first few iterations. After about 10 iterations, the test accuracy becomes
gradually stable and does not change too much over further iterations despite
small perturbations.

5 Conclusions

In this paper, we have proposed the first explicit formulation for sparse SVM
learning in MI classification. Our formulation is descended from sparse kernel
machines [12] and builds on top of an equivalent formulation of MI kernel [2].
Unlike MI kernel, the proposed technique can produce a sparse prediction model
with controlled complexity while still performing competitively compared to MI
kernel and other non-sparse methods.
486 Z. Fu et al.

Acknowledgment. This works is supported by the Australian Research Coun-


cil under the Discovery Project (DP0986052) entitled “Automatic music feature
extraction, classification and annotation”.

References
1. Dietterich, T.G., Lathrop, R.H., Lozano-perez, T.: Solving the multiple-instance
problem with axis-parallel rectangles. Artificial Intelligence 89, 31–71 (1997)
2. Gartner, T., Flach, A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In: Intl.
Conf. Machine Learning, pp. 179–186 (2002)
3. Andrews, S., Tsochantaridis, I., Hofmann, T.: Support vector machines for
multiple-instance learning. In: Advances in Neural Information Processing Sys-
tems, pp. 561–568 (2003)
4. Cheung, P.M., Kwok, J.T.Y.: A regularization framework for multiple-instance
learning. In: Intl. Conf. Machine Learning (2006)
5. Chen, Y., Bi, J., Wang, J.: Miles: Multiple-instance learning via embedded instance
selection. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(12), 1931–
1947 (2006)
6. Zhou, Z.H., Xu, J.M.: On the relation between multi-instance learning and semi-
supervised learning. In: Intl. Conf. Machine Learning (2007)
7. Li, Y.F., Kwok, J.T., Tsang, I., Zhou, Z.H.: A convex method for locating regions of
interest with multi-instance learning. In: European Conference on Machine Learn-
ing (2009)
8. Smola, A.J., Scholkopf, B.: Sparse greedy matrix approximation for machine learn-
ing. In: Intl. Conf. Machine Learning, pp. 911–918 (2000)
9. Lee, Y.J., Mangasarian, O.L.: Rsvm: Reduced support vector machines. In: SIAM
Conf. Data Mining (2001)
10. Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Reg-
ularization, Optimization, and Beyond. The MIT Press, Cambridge (2002)
11. Keerthi, S., Chapelle, O., DeCoste, D.: Building support vector machines with
reduced classifier complexity. Journal of Machine Learning Research 7, 1493–1515
(2006)
12. Wu, M., Scholkopf, B., Bakir, G.: A direct method for building sparse kernel learn-
ing algorithms. Journal of Machine Learning Research 7, 603–624 (2006)
13. Bezdek, J.C., Harthaway, R.J.: Convergence of alternating optimization. Journal
Neural, Parallel & Scientific Computations 11(4) (2003)
14. Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y.: Simplemkl. Journal of
Machine Learning Research 9, 2491–2521 (2008)
15. Bonnans, J.F., Shapiro, A.: Optimization problems with pertubation: A guided
tour. SIAM Review 40(2), 202–227 (1998)
16. Tzanetakis, G., Cook, P.: Music genre classification of audio signals. IEEE Trans.
Speech and Audio Processing 10(5), 293–302 (2002)
Lagrange Dual Decomposition for Finite Horizon
Markov Decision Processes

Thomas Furmston and David Barber

Department of Computer Science,


University College London,
Gower Street, London, WC1E 6BT, UK

Abstract. Solving finite-horizon Markov Decision Processes with sta-


tionary policies is a computationally difficult problem. Our dynamic dual
decomposition approach uses Lagrange duality to decouple this hard
problem into a sequence of tractable sub-problems. The resulting proce-
dure is a straightforward modification of standard non-stationary Markov
Decision Process solvers and gives an upper-bound on the total expected
reward. The empirical performance of the method suggests that not only
is it a rapidly convergent algorithm, but that it also performs favourably
compared to standard planning algorithms such as policy gradients and
lower-bound procedures such as Expectation Maximisation.

Keywords: Markov Decision Processes, Planning, Lagrange Duality.

1 Markov Decision Processes


The Markov Decision Process (MDP) is a core concept in learning how an
agent should act so as to maximise future expected rewards [1]. MDPs have
a long history in machine learning and related areas, being used to model many
sequential decision making problems in robotics, control and games, see for
example [1–3]. Fundamentally, an MDP describes how the environment responds
when the agent performs an action at ∈ A when the environment is in state
st ∈ S. More formally, an MDP is described by an initial state distribution
p1 (s1 ), transition distributions p(st+1 |st , at ), and a reward function Rt (st , at ).
For a discount factor γ the reward is defined as Rt (st , at ) = γ t−1 R(st , at ) for
a stationary reward R(st , at ), where γ ∈ [0, 1). At the tth time-point a deci-
sion is made according to the policy πt , which is defined as a set of conditional
distributions over the action space,
πt (a|s) = p(at = a|st = s, πt ).
A policy is called stationary if it is independent of time, i.e. πt (at |st ) = π(a|s),
∀t ∈ {1, . . . , H} for some policy π. For planning horizon H and discrete states
and actions, the total expected reward (the policy utility) is given by
H 

U (π1:H ) = Rt (st , at )p(st , at |π1:t ) (1)
t=1 st ,at

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 487–502, 2011.

c Springer-Verlag Berlin Heidelberg 2011
488 T. Furmston and D. Barber

π1 π2 π3 π4 π

a1 a2 a3 a4 a1 a2 a3 a4

s1 s2 s3 s4 s1 s2 s3 s4

R1 R2 R3 R4 R1 R2 R3 R4

(a) (b)

Fig. 1. (a) An influence diagram representation of an unconstrained finite horizon


(H = 4) MDP. Rewards depend on the state and action, Rt (st , at ). The policy
p(at |st , πt ) determines the decision and the environment is modeled by the transition
p(st+1 |st , at ). Based on a history of actions, states and reward, the task is maximize
the expected summed rewards with respect to the policy π1:H . (b) For a stationary
policy there is a single policy π that determines the actions for all time-points, which
adds a large clique to the influence diagram.

where p(st , at |π1:t ) is the marginal of the joint state-action trajectory distribution
 H−1
 
p(s1:H , a1:H |π) = p(aH |sH , πH ) p(st+1 |st , at )p(at |st , πt ) p1 (s1 ). (2)
t=1

Given a MDP the learning problem is to find a policy π (in the stationary case)
or set of policies π1:H (in the non-stationary case) that maximizes (1). That is,
we wish to find

π1:H = argmax U (π1:H ).
π1:H

In the case of infinite horizon H = ∞, it is natural to assume a stationary policy,


and many classical algorithms exist to approximate the optimal policy [1].
In this paper we concentrate exclusively on the finite horizon case, which is an
important class of MDPs with many practical applications. We defer the appli-
cation of dual decomposition techniques to the infinite horizon case to another
study.

1.1 Non-stationary Policies


It is well known that an MDP with
 a non-stationary
 policy can be solved using
dynamic programming [3] in O S 2 AH for a number of actions A = |A|, states
S = |S|, and horizon H, see algorithm(1). This follows directly from the linear
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 489

Algorithm 1. Dynamic Programming non-stationary MDP solver


βH (sH , aH ) = RH (sH , aH )
for t = H − 1, . . . , 1 do
a∗t+1 (st+1 ) = argmax βt+1 (st+1 , at+1 )
at+1

βt (st , at ) ≡ Rt (st , at ) + st+1 p(st+1 |st , at )βt+1 (st+1 , a∗t+1 (st+1 ))
end for
a∗1 (s1 ) = argmax β1 (s1 , a1 )
a1
The optimal policies are deterministic with πt∗ (at |st ) = δ (at − a∗t (st ))

chain structure of the corresponding influence diagram fig(1a) [4]. The optimal
policies resulting from this procedure are deterministic, where for each given
state all the mass is placed on a single action.

1.2 Stationary Policies

While unconstrained finite-horizon MDPs can be solved easily through dynamic


programming1 this is not true when the policy is constrained to be stationary.
For stationary policies the finite horizon MDP objective is given by
H 

U (π) = Rt (st , at )p(st , at |π), (3)
t=1 st ,at

where p(st , at |π) is the marginal of the trajectory distribution, which is given by
 t−1
 
p(s1:t , a1:t |π) = π(at |st ) p(sτ +1 |aτ , sτ )π(aτ |sτ ) p0 (s1 ).
τ =1

Looking at the influence diagram of the stationary finite horizon MDP problem
fig(1b) it can be seen that restricting the policy to be stationary causes the
influence diagram to lose its linear chain structure. Indeed the stationary policy
couples all time-points together and a dynamic programming solution no longer
exists, making the problem of finding the optimal policy π ∗ much more complex.
Another way of viewing the complexity of stationary policy MDPs is in terms
of Bellman’s principal of optimality [3]. A control problem is said to satisfy the
principal of optimality if, given a state and time-point, the optimal action is
independent of the trajectory preceding that time-point. In order to construct a
dynamic programming algorithm it is essential that the control problem satisfies
this principal. It is easy to see that while MDPs with non-stationary policies
satisfy this principal of optimality, hence permitting a dynamic programming
solution, this is not true for stationary policy MDPs.
1
 
In practice, the large state-action space can make the O S 2 AH complexity imprac-
tical, an issue that is not addressed in this study.
490 T. Furmston and D. Barber

While it is no longer possible to exactly solve stationary policy finite-horizon


MDP’s using dynamic programming there still exist algorithms that can approx-
imate optimal policies, such as policy gradients [5] or Expectation Maximisation
(EM) [6, 7]. However, policy gradients is susceptible to to getting trapped in
local optima while EM can have poor convergence rate [8]. It is therefore of
theoretical, and practical, interest to construct an algorithm that approximately
solves this class of MDPs and doesn’t suffer from these problems.

2 Dual Decomposition
Our task is to solve the stationary policy finite-horizon MDP. Here our intuition
is to exploit the fact that solving the unconstrained (non-stationary) MDP is
easy, whilst solving the constrained MDP is difficult. To do so we use dual
decomposition and iteratively solve a series of unconstrained MDPs, which have
a modified non-stationary reward structure, until convergence. Additionally our
dual decomposition provides an upper bound on the optimal reward U (π ∗ ).
The main idea of dual decomposition is to separate a complex primal opti-
misation problem into a set of easier slave problems (see appendix(A) and e.g.
[9, 10]). The solutions to these slave problems are then coordinated to generate a
new set of slave problems in a process called the master problem. This procedure
is iterated until some convergence criterion is met, at which point a solution to
the original primal problem is obtained.
As we mentioned in section(1) the stationary policy constraint results in a
highly connected influence diagram which is difficult to optimise. It is there-
fore natural to decompose this constrained MDP by relaxing this stationarity
constraint. This can be achieved through Lagrange relaxation where the set of
non-stationary policies, π1:H = (π1 , . . . , πH ), is adjoined to the objective func-
tion
H 

U (π1:H , π) = R(st , at )p(st , at |π1:t ), (4)
t=1 st ,at

with the additional constraint that πt = π, ∀t ∈ {1, ..., H}. Under this constraint,
we see that (4) reduces to the primal objective (3). We also impose that all πt
and π are restricted to the probability simplex. The constraints that each πt is
a distribution will be imposed explicitly during the slave problem optimisation.
The stationary policy π will be constrained to be a distribution through the
equality constraints π = πt .

2.1 Naive Dual Decomposition


A naive procedure to enforce the constraints that π t = π, ∀t ∈ {1, .., H} is to
form the Lagrangian
H 

U (λ1:H , π1:H , π) = {Rt (s, a)pt (s, a|π1:t ) + λt (s, a) [πt (a|s) − π(a|s)]} ,
t=1 s,a
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 491

where we have used the notation pt (s, a|π1:t−1 ) ≡ p(st = s, at = a|π1:t−1 ), and
we have included the Lagrange multipliers, λ1:H .
To see that this Lagrangian cannot be solved efficiently for π1:H through
dynamic programming consider the optimisation over πH , which takes the form
 
max RH (s, a)πH (a|s)pH (s|π1:H−1 ) + λH (s, a)πH (a|s) .
πH
s,a

As pH (s|π1:H−1 ) depends on all previous policies this slave problem is compu-


tationally intractable and does not have a simple structure to which dynamic
programming may be applied. Note also that, whilst the constraints are linear
in the policies, the marginal p(st , at |π1:t ) is non-linear and no simple linear pro-
gramme exists to find the optimal policy. One may consider the Jensen inequality
t

log pt (s|π1:t−1 ) ≥ H(q) + log p(sτ |sτ −1 , πτ )q + log πτ q
τ =1

for variational q and entropy function H(q) (see for example [11]) to decouple the
policies, but we do not pursue this approach here, seeking a simpler alternative.
From this discussion we can see that the naive application of Lagrange dual
decomposition methods does not result in a set of tractable slave problems.

3 Dynamic Dual Decomposition

To apply dual decomposition it is necessary to express the constraints in a way


that results in a set of tractable slave problems. In section(2.1) we considered
the naive constraint functions

gt (a, s, π, πt ) = πt (a|s) − π(a|s), (5)

which resulted in an intractable set of slave problems. We now consider the


following constraint functions

ht (a, s, π, π1:t ) = gt (a, s, π, πt )pt (s|π1:t−1 ).

Provided pt (s|π1:t−1 ) > 0, the zeros of the two sets of constraint functions,
g1:H and h1:H , are equivalent2 . Adjoining the constraint functions h1:H to the
objective function (4) gives the Lagrangian

H 

L(λ1:H , π1:H , π) = {(Rt (s, a) + λt (s, a)) πt (a|s)pt (s|π1:t−1 )
t=1 s,a

− λt (s, a)π(a|s)pt (s|π1:t−1 )} (6)


2
In the case that pt (s|π1:t−1 ) = 0, the policy πt (a|s) is redundant since the state s
cannot be visited at time t.
492 T. Furmston and D. Barber

We can now eliminate the original primal variables π from (6) by directly per-
forming the optimisation over π, giving the following set of constraints


λt (s, a)pt (s|π1:t−1 ) = 0, ∀(s, a) ∈ S × A. (7)
t

Once the primal variables π are eliminated from (6) we obtain the dual objective
function
H 
 
L(λ1:H , π1:H ) = (Rt (s, a) + λt (s, a))πt (a|s)pt (s|π1:t−1 ) , (8)
t=1 s,a

where the domain is restricted to π1:H , λ1:H that satisfy (7) and π1:H satisfying
the usual distribution constraints.

3.1 The Slave Problem


We have now obtained a dual decomposition of the original constrained MDP.
Looking at the form of (8) and considering the Lagrange multipliers λ1:H as fixed,
the optimisation problem over π1:H is that of a standard non-stationary MDP.
Given the set of Lagrange multipliers, λ1:H , we can define the corresponding
slave MDP problem
H 

Uλ (π1:H ) = R̃t (a, s)πt (a|s)pt (s|π1:t−1 ), (9)
t=1 s,a

where the slave reward is given by

R̃t (a, s) = Rt (a, s) + λt (a, s). (10)


 
The set of slave problems maxπ1:H Uλ (π1:H ) is then readily solved in O AS 2 H
time using algorithm(1) for modified rewards R̃t .

3.2 The Master Problem


The master problem consists of minimising the Lagrangian upper bound w.r.t.
the Lagrange multipliers. From the general theory of Lagrange duality, the dual
is convex in λ, so that a simple procedure such as a subgradient method should
suffice to find the optimum. At iteration i, we therefore update

λi+1
t = λit − αi ∂λt L(π1:H , λ1:H ) (11)

where αi is the ith step size parameter and ∂λt L(π1:H , λ1:H ) is the subgradient
i−1
of the dual objective w.r.t. λt . As the subgradient contains the factor pt (s|π1:t ),
which is positive and independent of the action, we may consider the simplified
update equation
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 493

λi+1
t = λit − αi πti .

Once the Lagrange multipliers have been updated through this subgradient step
they need to be projected down on to the feasible set, which is defined through
the constraint (7). This is achieved through a projected subgradient method, see
e.g. [12, 13]. We enforce (7) through the projection
H

λi+1
t (s, a) ← λi+1
t (s, a) − ρτ (s)λi+1
τ (s, a), (12)
τ =1

where we define the state-dependent time distributions

pτ (s|π1:τ −1 )
ρτ (s) ≡ H . (13)
τ  =1 pτ  (s|π1:τ  −1 )

One may verify this ensures the projected λi+1


t satisfy the constraint (7).

3.3 Algorithm Overview


We now look at two important aspects of the dual decomposition algorithm;
obtaining a primal solution from a dual solution and interpreting the role the
Lagrange multipliers play in the dual decomposition.

Obtaining a Primal Solution - A standard issue with dual decomposition al-


gorithms is obtaining a primal solution once the algorithm has terminated. When
strong duality holds, i.e. the duality gap is zero, then π = πt ∀t ∈ {1, . . . , H}
and a solution to the primal problem can be obtained from the dual solution.
However, this will not generally be the case and we therefore need to specify a
way to obtain a primal solution. We considered two approaches; in the first we
take the mean of the dual solutions
H
1 
π(a|s) = πt (a|s), (14)
H t=1

while in the second we optimise the dual objective function w.r.t. π to obtain
the dual primal policy π(a|s) = δa,a∗ (s) , where

a∗ (s) = argmin λt (s, a)pt (s|π1:t−1 ), (15)
a t

and the λt are taken before projection.


A summary of the complete dynamic dual decomposition procedure is given
in algorithm(2).
494 T. Furmston and D. Barber

Algorithm 2. Dual Decomposition Dynamic Programming


Initialize the Lagrange multipliers λ = 0.
repeat
Solve Slave Problem: Solve the finite horizon MDP with non-stationary rewards

R̃t (s, a) = λit (s, a) + Rt (s, a),


i
using algorithm(1), to obtain the optimal non-stationary policies π1:H .
Subgradient Step: Update the Lagrange multipliers:

λi+1
t = λit − αi πti .

Projected Subgradient Step: Project the Lagrange multipliers to the feasible


set:

H
λi+1
t ← λi+1
t − ρτ λi+1
τ .
τ =1

until |U (π) − L(λ1:H , π1:H )| < , for some convergence threshold .


Output a feasible primal solution π(a|s).

Interpretation of the Lagrange Multipliers - We noted earlier that


the slave problems correspond to an unconstrained MDP problem with non-
stationary rewards given by (10). The Lagrange multiplier λt (a, s) therefore ei-
ther encourages, or discourages, πt to perform action a (given state s) depending
on the sign of λt (a, s). Now consider how the Lagrange multiplier λt (s, a) gets
updated at the ith iteration of the dual decomposition algorithm. Prior to the
projected subgradient step the update of Lagrange multipliers takes the form

λi+1 i i
t (s, a) = λt (s, a) − αi πt (a|s),

i
where π1:H denotes the optimal non-stationary policy of the previous round of
slave problems. Noting that the optimal policy is deterministic gives
i
i+1
λt (s, a) − αi if a = argmax πti (a|s)
λt (s, a) = a
λit (s, a) otherwise.

Once the Lagrangian multipliers are projected down to the space of feasible
parameters through (12) we have
i  
λ̄t (s, a) + αi τ ∈Nai (s) ρτ − 1 , if a = argmax πti (a|s)
i+1
λt (s, a) =  a
λ̄it (s, a) + αi τ ∈Nai (s) ρτ otherwise


where Nai (s) = t ∈ {1, ..., H}|πti (a|s) = 1 is the set of time-points for which
action a was optimal in state s in the last round of slave problems. We use the
notation λ̄it = λit − λit ρ , where ·ρ means taking the average w.r.t. to the
distribution (13).
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 495


Noting that ρt is a distribution over t means that the terms τ ∈Nai (s) ρτ
 
and ρ
τ ∈Nai (s) τ − 1 are positive and negative respectively. The projected
sub-gradient step therefore either adds (or subtracts) a positive term to the
projection, λ̄, depending on the optimality of action a (given state s at time t)
in the slave problem from the previous iteration. There are hence two possibilities
for the update of the Lagrange multiplier; either the action was optimal and a
lower non-stationary reward is allocated to this state-action-time triple in the
next slave problem, or conversely it was sub-optimal and a higher reward term
is attached to this triple. The master algorithm therefore tries to readjust the
Lagrange multipliers so that (for each given state) the same action is optimal for
all time-points, i.e. it encourages the non-stationary
 policies to take the same
form. Additionally, as |Nai (s)| → H then τ ∈Nai (s) ρτ → 1, which means that
a smaller quantity is added (or subtracted) to the Lagrange multiplier. The
converse happens in the situation |Nai (s)| → 0. This means that as |Nai (s)| → 1
the time-points t ∈ / Nai (s) will have a larger positive term added to the reward
for this state-action pair, making it more likely that this action will be optimal
given this state in the next slave problem. Additionally, those time-points t ∈
Nai (s) will have a smaller term subtracted from their reward, making it more
likely that this action will remain optimal in the next slave problem. The dual
decomposition algorithm therefore automatically weights the rewards according
to a ‘majority vote’. This type of behaviour is typical of dual decomposition
algorithms and is known as resource allocation via pricing [12].

4 Experiments
We ran our dual decomposition algorithm on several benchmark problems, in-
cluding the chain problem [14], the mountain car problem [1] and the puddle
world problem [15]. For comparison we included planning algorithms that can
also handle stationarity constraints on the policy, in particular Expectation Max-
imisation and Policy Gradients.

Dual Decomposition Dynamic Programming (DD DP)


The overall algorithm is summarised in algorithm(2) in which dynamic pro-
gramming is used to solve the slave problems. In the master problem we used
a predetermined sequence of step sizes for the subgradient step. Taking into
account the discussion in section(3.3) we used the following step sizes

αn = n−1 max R(s, a).


s,a

In the experiments we obtained the primal policy using both the time-
averaged policy (14) and the dual primal policy (15). We found that both
policies obtained a very similar level of performance and so we show only
the results of (14) to make the plots more readable. We declared that the
algorithm had converged when the duality gap was less than 0.01.
496 T. Furmston and D. Barber

a,10
a,0 a,0 a,0 a,0
s1 s2 s3 s4 s5

b,2

Fig. 2. The chain problem state-action transitions with rewards R(st , at ). The ini-
tial state is state 1. There are two actions a, b, with each action being flipped with
probability 0.2.

Expectation Maximisation (EM)


The first comparison we made was with the MDP EM algorithm [6, 7, 11,
16, 17] where the policy update at each M step takes the form
H 
 t
π new (a|s) ∝ q(sτ = s, aτ = a, t),
t=1 τ =1

and q is a reward weighted trajectory distribution. For a detailed derivation


of this finite horizon model based MDP EM algorithm see e.g. [7, 11].

Policy Gradients (PG) - Fixed Step Size.


In this algorithm updates are taken in the direction of the gradient of the
value function w.r.t. to the policy parameters. We parameterised the policies
with a softmax parameterisation
exp γ(s, a)
π(a|s) =  
, γ(s, a = 1) = 0, (16)
a exp γ(s, a )

and took the derivative w.r.t. the parameters γ. During the experiments we
used a predetermined step size sequence for the gradient steps. We considered
two different step sizes parameters

α1 (n) = n−1 , α2 (n) = 10n−1 ,

and selected the one that gave the best results for each particular experiment.

Policy Gradients (PG) - Line Search


We also ran the policy gradients algorithm (using the parameterisation (16))
with a line search procedure during each gradient step3 .
Expectation Maximisation - Policy Gradients (EM PG)
The algorithm was designed to accelerate the rate of convergence of the EM
algorithm, while avoiding issues of local optima in policy gradients [18]. The
algorithm begins by performing EM steps until some switching criterion is
met, after which policy gradient updates are used.
3
We used the conjgrad.m routine in the Netlab toolbox - http://www1.aston.ac.uk/
eas/research/groups/ncrg/resources/netlab/.
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 497

90

80
DD DP Algorithm
Total Expected Reward

70 EM Algorithm

PG Algorithm − Fixed
60

EM −PG Algorithm
50

PG Algorithm − Line Search

40

30
0 0.005 0.01 0.015 0.02 0.025
Run Time (Seconds)

Fig. 3. Chain experiment with total expected reward plotted against run time (in
seconds). The plot shows the results of the DD DP algorithm (blue), the EM algorithm
(green), policy gradients with fixed step size (purple), policy gradients with a line search
(black) and the switching EM-PG algorithm (red). The experiment was repeated 100
times and the plot shows the mean and standard deviation of the results.

There are different criteria for switching between the EM algorithm and
policy gradients. In the experiments we use the criterion given in [19] where
the EM iterates are used to approximate the amount of hidden data (and
an additional estimate of the error of λ̂)

|π (t+1) − π (t) | |π (t+1) − π (t) |


λ̂ = , ˆ = . (17)
|π (t) − π (t−1) | 1 + |π (t+1) |

In the experiment we switched from EM to policy gradients when λ̂ > 0.9 and
ˆ < 0.01. During policy gradients steps we used fixed step size parameters.

4.1 Chain Problem


The chain problem [14] has 5 states each having 2 possible actions, as shown
in fig(2). The initial state is 1 and every action is flipped with ‘slip’ probability
pslip = 0.2, making the environment stochastic. If the agent is in state 5 it receives
a reward of 10 for performing action ‘a’, otherwise it receives a reward of 2 for
performing action ‘b’ regardless of the state. In the experiments we considered a
planning horizon of H = 25 for which the optimal stationary policy is to travel
down the chain towards state 5, which is achieved by always selecting action ‘a’.
The results of the experiment are shown in fig(3). We can see that the DD
DP algorithm consistently outperforms the other algorithms, converging to the
global optimum in an average of 3 iterations. On the other hand the policy
gradient algorithms often got caught in local optima in this problem, which is
498 T. Furmston and D. Barber

Fig. 4. A graphical illustration of the mountain car problem. The agent (driver) starts
the the problem at the bottom of a valley in a stationary position. The aim of the
agent is to get itself to the right most peak of the valley.

because the gradient of the initial policy is often in the direction of the local
optima around state 1. The EM and EM-PG algorithms perform better than
policy gradients, being less susceptible to local optima. Additionally, they were
able to get close to the optimum in the time considered, although neither of
these algorithms actually reached the optimum.

4.2 Mountain Car Problem

In the mountain car problem the agent is driving a car and its state is described
by its current position and velocity, denoted by x ∈ [−1, 1] and v ∈ [−0.5, 0.5]
respectively. The agent has three possible actions, a ∈ {−1, 0, 1}, which corre-
spond to reversing, stopping and accelerating respectively. The problem is de-
picted graphically in fig(4) where it can be seen that the agent is positioned in
a valley. The leftmost peak of the valley is given by x = −1, while the rightmost
peak is given by x = 1. The continuous dynamics are nonlinear and are given by

v new = v + 0.1a − 0.0028 cos(2x − 0.5), xnew = x + v new .

At the start of the problem the agent is in a stationary position, i.e. v = 0, and
its position is x = 0. The aim of the agent is to maneuver itself to the rightmost
peak, so the reward is set to 1 when the agent is in the rightmost position and
0 otherwise. In the experiment we discretised the position and velocity ranges
into bins of width 0.1, resulting in S = 231 states. A planning horizon H = 25
was sufficient to reach the goal state.
As we can see from fig(5) the conclusions from this experiment are similar to
those of the chain problem. Again the DD DP algorithm consistently outper-
formed all of the comparison algorithms, converging to the global optimum in
roughly 7 iterations, while the policy gradients algorithms were again suscep-
tible to local optima. The difference in convergence rates between the DD DP
algorithm and both the EM and EM-PG algorithms is more pronounced here,
which is due to an increase in the amount of hidden data in this problem.
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 499

20

18

16 DD DP Algorithm

14
Total Expected Reward

EM Algorithm
12

10 PG Algorithm − Fixed

8
EM−PG Algorithm
6

4
PG Algorithm − Line Search

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Run Time (Seconds)

Fig. 5. Mountain car experiment with total expected reward plotted against run time
(in seconds). The plot shows the results for the DD DP algorithm (blue), the EM algo-
rithm (green), policy gradients with fixed step size (purple), policy gradients with a line
search (black) and the switching EM-PG algorithm (red). The experiment was repeated
100 times and the plot shows the mean and standard deviations of the experiments.

4.3 Puddle World


In the puddle world problem [15] the state space is a continuous 2-dimensional
grid (x, y) ∈ [0, 1]2 that contains two puddles. We considered two circular puddles
(of radius 0.1) where the centres of the puddles were generated uniformly at
random over the grid [0.2, 0.8]2. The agent in this problem is a robot that is
depicted by a point mass in the state space. The aim of the robot is to navigate
itself to a goal region, while avoiding areas of the state space that are covered
in puddles. The initial state of the robot was set to the point (0, 0). There are
four discrete actions (up, down, left and right ) each moving the robot 0.1 in that
direction. The dynamics where made stochastic by adding the Gaussian noise
N (0, 0.01) to each direction. A reward of 1 is received for all states in the goal
region, which is given by those states satisfying x + y ≥ 1.9. A negative reward
of −40(1 − d) is received for all states inside a puddle, where d is the distance
from the centre of the puddle. In the experiment we discretised the x and y
dimensions into bins of width 0.05, which gave a total of S = 441 states. In this
problem we found that setting the planning horizon to H = 50 was sufficient to
reach the goal region.
The results of the puddle world experiment are shown in fig(6). For the range
of step sizes considered we were unable to obtain any reasonable results for the
policy gradients algorithm with fixed step sizes or the EM-PG algorithm, so we
omit these from the plot. The policy gradients with line search performed poorly
and consistently converged to a local optimum. The DD DP algorithm converged
after around 7 seconds of computation (this is difficult to see because of the scale
500 T. Furmston and D. Barber

50

40

30

DD DP Algorithm
Total Expected Reward

20

10

EM Algorithm
0

−10

PG Algorithm − Line Search


−20

−30

−40
0 50 100 150 200 250 300
Run Time (Seconds)

Fig. 6. Puddle world experiment with total expected reward plotted against run time
(in seconds). The plot shows the results for the DD DP algorithm (blue), the EM
algorithm (green) and policy gradients with line search (black).

needed to include the EM results) which corresponds to around 30 iterations of


the algorithm. In this problem the MDP EM algorithm has a large amount of
hidden data and as a result has poor convergence, consistently failing to converge
to the optimal policy after 1000 EM steps (taking around 300 seconds).

5 Discussion
We considered the problem of optimising finite horizon MDP’s with stationary
policies. Our novel approach uses dual decomposition to construct a two stage
iterative solution method with excellent empirical convergence properties, often
converging to the optimum within a few iterations. This compares favourably
against other planning algorithms that can be readily applied to this problem
class, such as Expectation Maximisation and policy gradients. In future work
we would like to consider more general settings, including partially-observable
MDPs and problems with continuous state-action spaces. Whilst both of these
extensions are non-trivial, this work suggests that the application of Lagrange
duality to these cases could be a particularly fruitful area of research.

References
1. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press,
Cambridge (1998)
2. Vlassis, N.: A Concise Introduction to Multiagent Systems and Distributed Artifi-
cial Intelligence. Synthesis Lectures on Artificial Intelligence and Machine Learn-
ing 1(1), 1–71 (2007)
Lagrange Dual Decomposition for Finite Horizon Markov Decision Processes 501

3. Bertsekas, D.P.: Dynamic Programming and Optimal Control, 2nd edn. Athena
Scientific, Belmont (2000)
4. Shachter, R.D.: Probabilistic Inference and Influence Diagrams. Operations Re-
search 36, 589–604 (1988)
5. Williams, R.: Simple Statistical Gradient Following Algorithms for Connectionist
Reinforcement Learning. Machine Learning 8, 229–256 (1992)
6. Toussaint, M., Storkey, A., Harmeling, S.: Bayesian Time Series Models. In:
Expectation-Maximization Methods for Solving (PO)MDPs and Optimal Con-
trol Problems, Cambridge University, Cambridge (in press 2011), userpage.
fu-berlin.de/~mtoussai
7. Furmston, T., Barber, D.: Efficient Inference in Markov Control Problems. In:
Uncertainty in Artificial Intelligence. North-Holland, Amsterdam (2011)
8. Furmston, T., Barber, D.: An analysis of the Expectation Maximisation algorithm
for Markov Decision Processes. Research Report RN/11/13, Centre for Computa-
tional Statistics and Machine Learning, University College London (2011)
9. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont
(1999)
10. Sontag, D., Globerson, A., Jaakkola, T.: Introduction to Dual Decomposition for
Inference. In: Sra, S., Nowozin, S., Wright, S. (eds.) Optimisation for Machine
Learning, MIT Press, Cambridge (2011)
11. Furmston, T., Barber, D.: Variational Methods for Reinforcement Learning. AIS-
TATS 9(13), 241–248 (2010)
12. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press,
Cambridge (2004)
13. Komodakis, N., Paragios, N., Tziritas, G.: MRF Optimization via Dual Decom-
position: Message-Passing Revisited. In: IEEE 11th International Conference on
Computer Vision, ICCV, pp. 1–8 (2007)
14. Dearden, R., Friedman, N., Russell, S.: Bayesian Q learning. AAAI 15, 761–768
(1998)
15. Sutton, R.: Generalization in Reinforcment Learning: Successful Examples Using
Sparse Coarse Coding. NIPS (8), 1038–1044 (1996)
16. Hoffman, M., Doucet, A., De Freitas, N., Jasra, A.: Bayesian Policy Learning with
Trans-Dimensional MCMC. NIPS (20), 665–672 (2008)
17. Hoffman, M., de Freitas, N., Doucet, A., Peters, J.: An Expectation Maximization
Algorithm for Continuous Markov Decision Processes with Arbitrary Rewards.
AISTATS 5(12), 232–239 (2009)
18. Salakhutdinov, R., Roweis, S., Ghahramani, Z.: Optimization with EM and
Expectation-Conjugate-Gradient. ICML (20), 672–679 (2003)
19. Fraley, C.: On Computing the Largest Fraction of Missing Information for the
EM Algorithm and the Worst Linear Function for Data Augmentation. Research
Report EDI-INF-RR-0934, University OF Washington (1999)
20. Barber, D.: Bayesian Reasoning and Machine Learning. Cambridge University
Press, Cambridge (2011)

A Appendix: Dual Decomposition


We follow the description in [20]. A general approach is to decompose a difficult
optimisation problem into a set of easier problems. In this approach we first
identify tractable ‘slave’ objectives Es (x), s = 1, . . . , S such that the ‘master’
objective E(x) decomposes as
502 T. Furmston and D. Barber


E(x) = Es (x) (18)
s

Then the x that optimises the master problem is equivalent to optimising each
slave problem Es (xs ) under the constraint that the slaves agree xs = x, s =
1, . . . , S [9]. This constraint can be be imposed by a Lagrangian
 
L(x, {xs } , λ) = Es (xs ) + λs (xs − x) (19)
s s

Finding the stationary point w.r.t. x, gives the constraint s λs = 0, so that we
may then consider

L({xs } , λ) = Es (xs ) + λs xs (20)
s

Given λ, we then optimise each slave problem

x∗s = argmax (Es (xs ) + λs xs ) (21)


xs

The Lagrange dual is given by

Ls (λs ) = max (Es (xs ) + λs xs ) (22)


xs

In this case the dual bound on the primal is



Ls (λs ) ≥ E(x∗ ) (23)
s

where x∗ is the solution of the primal problem x∗ = argmax E(x). To update λ


x
one may use a projected sub-gradient method to minimise each Ls (λs )

λs = λ − αx∗s (24)

where α is a chosen positive constant. Then we project,


1 
λ̄ = λ , λnew = λs − λ̄ (25)
S s s s


which ensures that s λnew
s = 0.
Unsupervised Modeling of Partially Observable
Environments

Vincent Graziano, Jan Koutnı́k, and Jürgen Schmidhuber

IDSIA, SUPSI, University of Lugano,


Manno, CH-6928, Switzerland
{vincent,hkou,juergen}@idsia.ch

Abstract. We present an architecture based on self-organizing maps for


learning a sensory layer in a learning system. The architecture, tempo-
ral network for transitions (TNT), enjoys the freedoms of unsupervised
learning, works on-line, in non-episodic environments, is computation-
ally light, and scales well. TNT generates a predictive model of its inter-
nal representation of the world, making planning methods available for
both the exploitation and exploration of the environment. Experiments
demonstrate that TNT learns nice representations of classical reinforce-
ment learning mazes of varying size (up to 20 × 20) under conditions of
high-noise and stochastic actions.

Keywords: Self-Organizing Maps, POMDPs, Reinforcement Learning.

1 Introduction
Traditional reinforcement learning (RL) is generally intractable on raw high-
dimensional sensory input streams. Often a sensory input processor, or more
simply, a sensory layer is used to build a representation of the world, a simplifier
of the observations of the agent, on which decisions can be based. A nice sensory
layer produces a code that simplifies the raw observations, usually by lowering
the dimensionality or the noise, and maintains the aspects of the environment
needed to learn an optimal policy. Such simplifications make the code provided
by the sensory layer amenable to traditional learning methods.
This paper introduces a novel unsupervised method for learning a model of
the environment, Temporal Network Transitions (TNT), that is particularly well-
suited for forming the sensory layer of an RL system. The method is a gener-
alization of the Temporal Hebbian Self-Organizing Map (THSOM), introduced
by Koutnı́k [7]. The THSOM places a recurrent connection on the nodes of the
Self-Organizing Maps (SOM) of Kohonen [5]. The TNT generalizes the recurrent
connections between the nodes in the THSOM map to the space of the agent’s
actions. In addition, TNT brings with it a novel aging method that allows for
variable plasticity of the nodes. Theses contributions make SOM-based systems
better suited to on-line modeling of RL data.
The quality of the sensory layer learned is dependent on at least (I) the rep-
resentational power of the sensory layer, and (II) the ability of the RL layer get
the agent the sensory inputs it needs to best improve its perception of the world.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 503–515, 2011.

c Springer-Verlag Berlin Heidelberg 2011
504 V. Graziano, J. Koutnı́k, and J. Schmidhuber

There are still few works where unsupervised learning (UL) methods have
been combined with RL [1,8,4]. Most of these approaches deal with the UL
separately from the RL, by alternating the following two steps: (1) improving the
UL layer on a set of observations, modifying its encoding of observations, and (2)
developing the RL on top of the encoded information. (1) involves running UL,
without being concerned about the RL, and (2) involves running RL, without
being concerned about the UL. It is assumed that the implicit feedback inherent
in the process leads to both useable code and an optimal policy. For example, see
the SODA architecture, introduced by Provost et al. [12,11], which first uses a
SOM-like method for learning the sensory layer and then goes on to create high-
level actions to transition the agent between internal states. Such approaches
mainly ignore aspect (II). Still fewer approaches actually deal with the mutual
dependence of the UL and RL layers, and how to solve the bootstrapping problem
that arises when integrating the two [9].
The TNT system immediately advances the state-of-the-art for SOM-like sys-
tems as far as aspect (I) is concerned. In this paper we show how the TNT
significantly outperforms the SOM and THSOM methods for learning state rep-
resentations in noisy environments. It can be introduced into any pre-existing
system using SOM-like methods without any fuss and will generate, as we shall
see in Section 4, nice representations of challenging environments. Further, the
TNT is a natural candidate for addressing both aspects (I) and (II) simultane-
ously, since a predictive model of the environment grows inside of it. We only
touch on this in the final section of the paper, leaving it for a further study.
In Section 2 we describe the THSOM, some previous refinements, the Topo-
logical THSOM (T2 HSOM) [2], as well as some new refinements. Section 3 intro-
duces the TNT, a generalization of the T2 HSOM architecture that is suited for
on-line modeling noisy high-dimension RL environments. In Section 4 we study
the performance of the TNT on partially observable RL maze environments, in-
cluding an environment with 400 underlying states and large amounts of noise
and stochastic actions. We make direct comparisons to both SOM and T2 HSOM
methods. Section 5 discusses future directions of research.

2 Topological Temporal Hebbian Self-Organizing Map


We review the original THSOM and T2 HSOM architectures here, along with
some refinements which appear for the first time here. This helps to introduce
the TNT, and aids comprehension when the three methods are compared in Sec-
tion 4. Briefly, a THSOM is a Self-organizing Map (SOM) that uses a recurrent
connection (trained using a Hebbian update rule) on its nodes.
The THSOM consists of N nodes placed at the vertices of a finite lattice in
Euclidean space. Each node i has a prototype vector wi ∈ RD , where D is the
dimension of the observations. In what follows, the variable for the time step, t,
is suppressed only when its omission cannot create confusion.
Unsupervised Modeling of Partially Observable Environments 505

2.1 Network Activation


The activation yi (t) of node i at step t consists of a spatial component, yi,S (t),
and a temporal component, yi,T (t). For observation x(t) at time t the spatial
activation is

y(t)S,i = D − x(t) − wi (t), (1)
and the temporal activation is

y(t)T,i = y(t − 1) · mi (t), (2)


where y(t − 1) is the normalized network activation from the previous time step,
and mi (t) is row i of the N x N temporal weight matrix M , whose entry mi,j
represents the strength of the connection from node j to node i.
In the original formulation of the THSOM, the network
√ activation, before nor-
malization, was given by yi = yS,i + yT,i . The factor D in the spatial activation
was used to help equalize the influence the spatial and temporal components.
Experimental evidence has shown that the network often benefits from a finer
weighting of the two components. A parameter η is introduced to alter the bal-
ance between the components,

yi = ηyS,i + (1 − η)yT,i . (3)



The offset D in the spatial component (obviated by η) is kept for historical
reasons, so that when η = 0.5 the original balance is recovered. Putting η = 1
makes the activation of the map, and therefore also the learning (as we shall
later see), entirely based on the spatial component, i.e., a SOM. In the case of
a deterministic HMM (a cycle of underlying states) a perfectly trained THSOM
can ‘predict’ the next underlying state blindly, that is, without making use of
the spatial component, η = 0, when given the initial underlying state.

Normalizing the Activation. It is necessary to normalize the activations


with each step to maintain the stability of the network, otherwise the values
become arbitrarily large. In the original formulation of the THSOM the activa-
tion of each node yi was normalized by the maximum activation over all nodes,
yi = yi / maxk yk . Normalization is now done using the softmax function and
a parameter τ for the temperature. The temperature decreases as the network
is trained; training provides an increasingly accurate model of the transition
matrix, and a cooler temperature allows the network to make good use of the
model.

2.2 Learning
The spatial and temporal components are learned separately. The spatial com-
ponent is trained in the same way as a conventional SOM and the training of
the temporal component is based on a Hebbian update rule. Before each up-
date the node b with the greatest activation at step t is determined. This node
506 V. Graziano, J. Koutnı́k, and J. Schmidhuber

b(t) = arg maxk yk (t) is called the best matching unit (BMU). The neighbors of
a node are those nodes which are nearby on the lattice, and are not necessarily
the nodes with the most similar prototype vector.

Learning the Spatial Component. The distance di of each node i on the


lattice to b is computed using the Euclidean distance, di = c(i) − c(b), where
c(k) denotes the location of node k on the lattice. The BMU and its neighbors
are trained in proportional to the cut-Gaussian spatial neighborhood function
cS , defined as follows:
⎧  2

⎨exp − di if di ≤ νS ,
cS,i = 2σS2 (4)

0 if di > νS ,
where σS and νS are functions of the age of the network (discussed in Sec-
tion 3.3). Typically, these functions are monotone decreasing with respect to t.
The value of σS determines the shape of the Gaussian and νS is the topological
neighborhood cut-off. The prototype vector corresponding to node i is updated
using the following rule:

w i (t + 1) = wi (t) + αS cS,i (x(t) − w i (t)) , (5)


where αS is the learning rate, and is also a function of the age of the network.

Learning the Temporal Component. To learn the temporal weight matrix


M , the BMUs b(t − 1) and b(t) are considered and three Hebbian learning rules
are applied:

1. Temporal connections from node j = b(t − 1) to i = b(t) and its neighbors


are strengthened (see Figure 1(a)):

mi,j (t + 1) = mi,j (t) + αT cT,i (1 − mi,j (t)) . (6)


2. Temporal connections from all nodes j, except b(t − 1), to i = b(t) and its
neighbors are weakened (see Figure 1(b)):

mi,j (t + 1) = mi,j (t) − αT cT,i mi,j (t). (7)


3. Temporal connections from j = b(t − 1) to all nodes outside some neighbor-
hood of b(t) are weakened (see Figure 1(c)):

mi,j (t + 1) = mi,j (t) − αT (1 − cT,i )mi,j (t). (8)

The temporal neighborhood function cT,i is computed in the same way the spatial
neighborhood cS,i (Equation 4) is except that temporal parameters are used.
That is, the temporal learning has its own parameters, σT , νT , and αT , all of
which, again, are functions of the age of the network.
Unsupervised Modeling of Partially Observable Environments 507

t 1 t cT t 1 t cT t 1 t 1cT

2 T

(a) (b) (c)

Fig. 1. T2 HSOM learning rules: The BMU nodes are depicted with filled circles. (a)
excitation of temporal connections from the previous BMU to the current BMU and its
neighbors, (b) inhibition of temporal connections from all nodes, excluding the previous
BMU, to the current BMU and its neighbors, (c) inhibition of temporal connections
from the previous BMU to all nodes outside some neighborhood of the current BMU.
The connections are modified using the values of a neighborhood function (Gaussian
given by σT ), a cut-off (νT ), and a learning rate (αT ). Figure adapted from [2].

The original THSOM [6,7] contained only the first 2 temporal learning rules,
without the use of neighborhoods. In [2], Ferro et al. introduced the use of
neighborhoods for the training of temporal weights as well as rule (3). They
named the extension the Topological Temporal Hebbian Self-organizing Map
(T2 HSOM).

3 Temporal Network for Transitions


SOMs lack the recurrent connection found in the T2 HSOM. As a result, when
using a SOM to learn an internal state-space representation for RL the following
factor plays an important role: noise may conflate the observations arising from
the underlying states in the environment, preventing an obvious correlation be-
tween observation and underlying state from being found. Noise can make the
disambiguation of the underlying state impossible, regardless of the number of
nodes used. The recurrent connection of the T2 HSOM keeps track of the previous
observation, allowing the system to learn a representation of the environment
that can disambiguate states that a SOM cannot. The theoretical representa-
tional power of the T2 HSOM architecture is that it can model HMMs (which
can be realized as POMDPs with a single action) but not POMDPs. While the
T2 HSOM does consider the previous observation for determining the current
internal state it ignores the action that was taken between the observations. The
TNT architecture extends the T2 HSOM by making explicit use of the action
taken between observations. As a result, TNT can in theory model a POMDP.
508 V. Graziano, J. Koutnı́k, and J. Schmidhuber

3.1 Network Activation


As in the THSOM, two components are used to decide the activation of the
network.

Spatial Activation. As with SOMs, the spatial matching for both the TNT
and THSOM can be carried-out with metrics other than the Euclidean distance.
For example a dot-product can be used to measure the similarity between an
observation and the prototype vectors. In any case, the activation and learning
rules need to be mutually compatible. See [5] for more details.

Temporal Activation. The key difference between the THSOM and TNT
architectures is the way in which the temporal activation is realized. Rather
than use a single temporal weight matrix M , as is done in the THSOM, to
determine the temporal component, the TNT uses a separate matrix Ma , called
a transition-map, for each action a ∈ A. The ‘temporal’ activation is now given by

y(t)T,i = y(t − 1) · ma
i (t), (9)
where ma i is row i of transition-map Ma and y(t − 1) is the network activation
at the previous time step (see Equation 2).

Combining the Activations. We see two general ways to combine the com-
ponents: additively and multiplicatively. The formulations of the THSOM and
T2 HSOM considered only the additive method, and can be used successfully
with the TNT as well. The two activations are summed using a balancing term
η, precisely as they were in Equation 3.
Since the transition-maps are effectively learning a model of the transition
probabilities between the nodes on the lattice it is reasonable to interpret both
the spatial and the temporal activations as likelihoods and use an element-wise
product to combine them. The spatial activation roughly gives a likelihood that
an observation belongs to a particular node, with better matching units being
considered more likely. In the case of the Euclidean distance, the closer the
observation is to a prototype vector the more likely the correspondence, whereas
with the dot-product the better matched units take on values close to 1.
For example, when using a Euclidean metric with a multiplicative combination
the spatial component can be realized as

y(t)S,i = exp (−η x(t) − wi (t)) , (10)


where η controls how quickly the “likelihood” values drop-off with respect to the
distance. This is important for balancing the spatial and temporal components
when using a multiplicative method.
After the two activations are combined the value needs to be normalized.
Normalization by the maximum value, the length of the activation vector, and
the softmax all produce excellent results. In our experiments, for simplicity, we
have chosen to normalize by the length of the activation vector.
Unsupervised Modeling of Partially Observable Environments 509

The experiments in Section 4 were carried out using a multiplicative method


with the spatial activation determined by Equation 10. The multiplicative ap-
proach produces excellent results, and though the additive method also produces
significant results when compared to the SOM and THSOM we have found it
to be less powerful than the multiplicative approach. As such, we do not report
the results of the additive approach in this paper.

3.2 Learning
The learning step for the TNT is, mutatis mutandis, the same as it was for the
T2 HSOM. Rather than training the temporal weight matrix M at step t, we
train the appropriate transition-map Ma (which is given by the action at ).

3.3 Aging the TNT


Data points from RL environments do not arrive in an independent and iden-
tically distributed (i.i.d.) fashion. Therefore, it is important to impart any sys-
tem learning an internal representation of the world with a means to balance
plasticity and stability. Prototype nodes and transition-maps that have been up-
dated over many samples should be fairly stable, and less likely to forget while
maintaining the ability to slowly adapt to a changing environment. Likewise,
untrained nodes and transition-maps should adapt quickly to new parts of the
environment.
To achieve this variable responsiveness a local measure of age or plasticity
is required. In classical SOM-like systems the learning parameters decay with
respect to the global age of the network. In [5], the use of individual learning
rates1 , αS , is discussed but dismissed, as they are unnecessary for the successful
training of SOMs on non-RL data. Decay with respect to a global age is sufficient
in batch-based settings where the ordering of the data is unimportant. Assigning
an age to the nodes allows for a finer control of all the parameters. As a result,
powerful learning dynamics emerge. See Figure 2 for examples.
Formally, training in SOM-like systems amounts to moving points P , proto-
type vectors w, towards a point Q, the input vector x, along the line connecting
them:

P ← (1 − h)P + hQ,
where h is the so-called neighborhood function. E.g., see Equation 5. This neigh-
borhood function, for both the spatial and temporal learning, is simply the
product of the learning rate of the prototype vector, i, and the cut-Gaussian of
the BMU, b,

hi (t) = αi (t)cb (t).


1
Marsland [10] uses counters to measure the activity of the nodes in a growing SOM.
The counters are primarily used to determine when to instantiate new nodes. They
are also used to determine the learning rate of the nodes, though in a less sophisti-
cated way than introduced here.
510 V. Graziano, J. Koutnı́k, and J. Schmidhuber

(a) (b)

Fig. 2. Plasticity: (a) when the BMU is young, it strongly affects training of a large
perimeter in the network grid. The nodes within are dragged towards the target (×),
the old nodes are stable due to their low learning rate. (b) when the BMU is old, a small
neighborhood of nodes is weakly trained. In well-known parts of the environment new
nodes tend not be recruited for representation, while in new parts the young nodes,
and their neighbors are quickly recruited for representation, while the older nodes are
left in place.

a
Each node i is given a spatial age, ξS,i , and a temporal age ξT,i for each action
a ∈ A. After determining the activation of the TNT and the BMU, b, the age
of the nodes are simply incremented by the value of their spatial and temporal
neighborhood functions, hS,i and hT,i respectively:

ξi (t + 1) = ξi (t) + αi (t)cb (t). (11)

3.4 Varying the Parameters


Now that the nodes have individual ages we can determine the learning rate
and cut-Gaussian locally. One approach is to decay the spatial, σS , νS , αS , and
temporal, σT , νT , αT , learning parameters using the exponential function. For
each parameter Λ choose an initial value Λ◦ , a decay rate Λk , and an asymptotic
value Λ∞ . The value of any parameter is then determined by

Λ(ξ) = (Λ◦ − Λ∞ ) exp(−ξ/Λk ) + Λ∞ , (12)


where ξ is the appropriate age for the parameter being used. Note that setting
Λk to “∞” makes the parameter constant.

4 Experiments
The aim of the experiments is to show how the recurrent feedback (based on
previous observations and actions) empowers the TNT to discern underlying-
states from noisy observations.
Unsupervised Modeling of Partially Observable Environments 511

(a) (b) (c)

Fig. 3. 5 × 5 maze experiment: (a) two dimensional maze with randomly placed
walls, (b) noisy observations of a random walk (σ = 1/3 cell width), (c) trained TNT.
Disks depict the location of the prototype vectors, arrows represent the learned tran-
sitions, and dots represent transitions to the same state.

4.1 Setup
The TNT is tested on two-dimensional mazes. The underlying-states of the maze
lie at the centers of the tiles constituting the maze. There are four possible ac-
tions: up, down, left, and right. A wall in the direction of intended movement,
prevents a change in underlying-state. After each action the TNT receives a
noisy observation of the current state. In the experiments a random walk (Fig-
ure 3(b)) is presented to the TNT. The TNT tries to (1) learn the underlying-
states (2-dimensional coordinates) from the noisy observations and (2) model the
transition probabilities between states for each of the actions. The observations
have Gaussian noise added to the underlying-state information. The networks
were trained on a noise level of σ = 1/3 of the cell width, so that 20% of the
observations were placed in a cell of the maze that did not not correspond to
the underlying-state.
Mazes of size 5 × 5, 10 × 10, and 20 × 20 were used. The length of the training
and random walks were 104 , 5 × 104 , and 105 respectively.
A parameter γ determines the reliability, or determinism, of the actions. A
value of 0.9 means that the actions move in a direction other than the one
indicated by their label with a probability of 1 − 0.9 = 0.1.
The parameters used for training the TNT decay exponentially according to
Equation 12. Initially, the spatial training has a high learning rate at the BMU,
0.9, and includes the 12 nearest nodes on the lattice. This training decays to
a final learning rate of 0.001 at the BMU and also includes the neighboring 4
nodes. The temporal training only effects only the BMU, with an initial learning
rate of 0.25 that decays to 0.001. The complete specification of the learning
parameters is recorded in Table 1.
After a network is trained, disambiguation experiments are performed. The
TNT tries to identify the underlying-state from the noisy observations on newly
generated random walks. The amount of noise and the level of determinism are
512 V. Graziano, J. Koutnı́k, and J. Schmidhuber

Table 1. TNT learning parameters:

αS σS νS αT σT ν T
Λ◦ 0.90 1.75 2.10 0.25 2.33 0.5
Λ∞ 0.001 0.75 1.00 0.001 0.75 0.5
Λk 10 10 10 10 10 ∞

varied throughout the disambiguation experiments. 10 trials of each disambigua-


tion experiment were run, with the average results reported. Added noise was
drawn from normal distributions with σ = { 16 , 13 , 12 , 1, 32 }, and the determinism
of the actions was either 1.0 or 0.9. Putting σ = 1/6 results in only slightly
noisy observations, whereas with σ = 3/2 more than 80% of the observations are
placed in the wrong cell, effectively hiding the structure of the maze, see Table 2.
We compare the TNT to a SOM and a T2 HSOM on mazes, with varying
levels of noise and determinism. The SOM and T2 HSOM were also trained on-
line using the aging methods introduced for the TNT. We then examine the
performance of the TNT on larger environments with determinism γ = 0.9,
while varying the amount of noise.

4.2 Results

In the 5 × 5 mazes the variable plasticity allows for the prototype vectors (the
spatial component) of all three methods to learn the coordinates of the under-
lying states arbitrarily well. The performance of the SOM reached its theoretical
maximum while being trained on-line on non-i.i.d data. That is, it misidentified
only those observations which fell outside the correct cell. At σ = 1/2 the SOM
can only identify around 55% of the observations correctly. The TNT on the
other-hand is able to essentially identify 100% of all such observations in a com-
pletely deterministic environment (γ = 1.0), and 85% in the stochastic setting
(γ = 0.9). Both the SOM and THSOM are unaffected by the level of stochas-
ticity, since they do not model the effects of particular actions. See Table 2 for
further comparisons.
After training, in both deterministic and stochastic settings, the transition-
maps accurately reflect the transition tables of the underlying Markov model.
The tables learned with deterministic actions are essentially perfect, while the
ones learned in the stochastic case, suffer somewhat, see Table 3. This is reflected
in the performance degradation with higher amounts of noise.
The degradation seen as the size of the maze increases is partially a result us-
ing the same learning parameters for environments of different size; the trouble
is that the younger nodes need to recruit enough neighboring nodes to rep-
resent new parts of the environment while stabilizing before some other part
of the maze is explored. The learning parameters need to be sensitive to the
size of the environment. Again, this problem arises since we are training on-line
and the data is not i.i.d. This problem can be addressed somewhat by making
the ratio of nodes-to-states greater than 1. We return to this issue in Section 5.
Unsupervised Modeling of Partially Observable Environments 513

Table 2. Observation disambiguation: For a variance of Gaussian noise relative


to the maze cell width, the percentage of ambiguous observations is shown on the
first line. A SOM can, at best, achieve the same performance. The second line shows
the percentage of observations correctly classified by the T2 HSOM (single recurrent
connection). The third line shows the percentage of observations correctly classified by
the TNT when the actions are deterministic, γ = 1. The last lines gives the performance
of the TNT when the actions are stochastic, γ = 0.9.

σ = 1/6 1/3 1/2 1 3/2


SOM 99.6 79.4 55.6 25.5 16.6
T2 HSOM 96.6 80.4 59.2 27.8 17.9
TNT, γ = 1 100.0 100.0 99.9 99.0 98.2
TNT, γ = 0.9 96.4 94.4 85.0 73.1 56.0

Table 3. Observation disambiguation for larger mazes: Percentage of observa-


tions assigned to correct states in stochastic setting, γ = 0.9

σ = 1/6 1/3 1/2 1 3/2


TNT, 5 × 5 96.4 94.4 85.0 73.1 56.0
TNT, 10 × 10 93.8 92.2 77.0 56.9 29.4
TNT, 20 × 20 86.8 83.6 72.9 45.3 23.0

It is important to note that the percentage of ambiguous observations that


can actually be distinguish decreases sharply as the noise increases in stochastic
settings. Simply, when the noise is sufficiently high there is no general way to dis-
ambiguate an observation following a misrepresented action from an observation
following a well-represented action with high noise. See Figure 4.

5 Discussion

Initial experiments show that the TNT is able to handle much larger environ-
ments without a significant degradation of results. The parameters while they
do not require precise tuning, do require that they are reasonably matched to
the environment. The drop-off seen in Table 3 can be mostly attributed to not
having used better suited learning parameters, as the same decay functions were
used in all the experiments. Though the aging rules and variable plasticity have
largely addressed the on-line training problem, they have not entirely solved
it. As a result, we plan to explore a constructive TNT, inspired by the “grow
514 V. Graziano, J. Koutnı́k, and J. Schmidhuber

Fig. 4. The trouble with stochastic actions: The “go right” action can transition
to 4 possible states. Noisy observations coupled with stochastic actions can make it
impossible to discern the true state.

when required” [10] and the “growing neural gas” [3] architectures. Nodes will
be added on the fly making the recruitment of nodes to new parts of the envi-
ronment more organic in nature; as the agent goes somewhere new it invokes a
new node. We expect such an architecture to be more flexible, able to handle a
larger range of environments without an alteration of parameters, bringing us
closer a general learning system.
In huge environments where continual-learning plays an increasing role, the
TNT should have two additional, related features. (1) The nodes should be able
to forget, so that resources might be recruited to newly visited parts of the
environment, and (2) a minimal number of nodes should be used to represent
the same underlying state. (1) can be solved by introducing a youthening aspect
to the aging. Simply introduce another term in the aging function which slightly
youthens the nodes, so that nodes near the BMU increase in age overall, while
nodes further away decrease. (2) is addressed by moving from a cut-Gaussian to
a similar “Mexican hat” function for the older nodes. This will push neighboring
nodes away from the expert, making him more distinguished.
We have found that the Markov model can be learned when the nodes in
the TNT are not in 1-to-1 correspondence with the underlying states. Simple
clustering algorithms, based on the proximity of the prototype vectors and the
transition-maps, are able to detect likely duplicates. An immediate and impor-
tant follow-up to this work would consider continuous environments. We expect
the topological mixing, inherent to SOM-based architectures, to give dramatic
results.
A core challenge in extending reinforcement learning (RL) to real-world agents
is uncovering how such an agent can select actions to autonomously build an ef-
fective sensory mapping through its interactions with the environment. The use
of artificial curiosity [13] with planning to address this problem has been carried
out in [9], where the sensory layer was built-up using vector quantization (a SOM
without neighborhoods). Clearly, as we established in this paper, a TNT can
learn a better sensory layer than any SOM. The transition-maps effectively model
the internal-state transitions and therefore make planning methods naturally
available to a learning system using a TNT. A promising line of inquiry, therefore,
Unsupervised Modeling of Partially Observable Environments 515

is to derive a curiosity signal from the learning updates inside the TNT to supply
the agent with a principled method to explore the environment so that a nicer
representation of it can be learned.

Acknowledgments. This research was funded in part through the following


grants: SNF– Theory and Practice of Reinforcement Learning (200020-122124/1),
and SNF– Recurrent Networks (200020-125038/1).

References
1. Fernández, F., Borrajo, D.: Two steps reinforcement learning. International Journal
of Intelligent Systems 23(2), 213–245 (2008)
2. Ferro, M., Ognibene, D., Pezzulo, G., Pirrelli, V.: Reading as active sensing: a
computational model of gaze planning during word discrimination. Frontiers in
Neurorobotics 4 (2010)
3. Fritzke, B.: A growing neural gas network learns topologies. In: Advances in Neural
Information Processing Systems, vol. 7, pp. 625–632. MIT Press, Cambridge (1995)
4. Gisslén, L., Graziano, V., Luciw, M., Schmidhuber, J.: Sequential Constant Size
Compressors and Reinforcement Learning. In: Proceedings of the Fourth Confer-
ence on Artificial General Intelligence (2011)
5. Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Heidelberg (2001)
6. Koutnı́k, J.: Inductive modelling of temporal sequences by means of self-
organization. In: Proceeding of Internation Workshop on Inductive Modelling
(IWIM 2007), pp. 269–277. CTU in Prague, Ljubljana (2007)
7. Koutnı́k, J., Šnorek, M.: Temporal hebbian self-organizing map for sequences. In:
ICANN 2006, vol. 1, pp. 632–641. Springer, Heidelberg (2008)
8. Lange, S., Riedmiller, M.: Deep auto-encoder neural networks in reinforce-
ment learning. In: The 2010 International Joint Conference on Neural Networks
(IJCNN), pp. 1–8 (July 2010)
9. Luciw, M., Graziano, V., Ring, M., Schmidhuber, J.: Artificial Curiosity with Plan-
ning for Autonomous Perceptual and Cognitive Development. In: Proceedings of
the International Conference on Development and Learning (2011)
10. Marsland, S., Shapiro, J., Nehmzow, U.: A self-organising network that grows when
required. Neural Netw. 15 (October 2002)
11. Provost, J.: Reinforcement Learning in High-Diameter, Continuous Environments.
Ph.D. thesis, Computer Sciences Department, University of Texas at Austin,
Austin, TX (2007)
12. Provost, J., Kuipers, B.J., Miikkulainen, R.: Developing navigation behavior
through self-organizing distinctive state abstraction. Connection Science 18 (2006)
13. Schmidhuber, J.: Formal Theory of Creativity, Fun, and Intrinsic Motivation
(1990–2010). IEEE Transactions on Autonomous Mental Development 2(3), 230–
247 (2010)
Tracking Concept Change with Incremental Boosting by
Minimization of the Evolving Exponential Loss

Mihajlo Grbovic and Slobodan Vucetic

Department of Computer and Information Sciences


Temple University, Philadelphia, USA
{mihajlo.grbovic,slobodan.vucetic}@temple.edu

Abstract. Methods involving ensembles of classifiers, such as bagging and boost-


ing, are popular due to the strong theoretical guarantees for their performance
and their superior results. Ensemble methods are typically designed by assum-
ing the training data set is static and completely available at training time. As
such, they are not suitable for online and incremental learning. In this paper we
propose IBoost, an extension of AdaBoost for incremental learning via optimiza-
tion of an exponential cost function which changes over time as the training data
changes. The resulting algorithm is flexible and allows a user to customize it
based on the computational constraints of the particular application. The new al-
gorithm was evaluated on stream learning in presence of concept change. Exper-
imental results showed that IBoost achieves better performance than the original
AdaBoost trained from scratch each time the data set changes, and that it also
outperforms previously proposed Online Coordinate Boost, Online Boost and its
non-stationary modifications, Fast and Light Boosting, ADWIN Online Bagging
and DWM algorithms.

Keywords: Ensemble Learning, Incremental Learning, Boosting, Concept Change.

1 Introduction

There are many practical applications in which the objective is to learn an accurate
model using training data set which changes over time. A naive approach is to retrain
the model from scratch each time the data set is modified. Unless the new data set is
substantially different from the old data set, retraining can be computationally waste-
ful. It is therefore of high practical interest to develop algorithms which can perform
incremental learning. We define incremental learning as the process of updating the ex-
isting model when the training data set is changed. Incremental learning is particularly
appealing for Online Learning, Active Learning, Outlier Removal and Learning with
Concept Change.
There are many single-model algorithms capable of efficient incremental learning,
such as linear regression, Naı̈ve Bayes and kernel perceptrons. However, it is still an
open challenge how to develop efficient ensemble algorithms for incremental learning.
In this paper we consider boosting, an algorithm that trains a weighted ensemble of
simple weak classifiers. Boosting is very popular because of its ease of implementation
and very good experimental results. However, it requires sequential training of a large

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 516–532, 2011.

c Springer-Verlag Berlin Heidelberg 2011
Incremental Boosting by Minimization of the Evolving Exponential Loss 517

number of classifiers which can be very costly. Rebuilding a whole ensemble upon
slight changes in training data can put an overwhelming burden to the computational
resources. As a result, there exists a high interest for modifying boosting for incremental
learning applications.
In incremental learning with concept change, where properties of a target variable
which we are predicting can unexpectedly change over time, a typical approach is to
use a sliding window and train a model using examples within the window. Upon each
window repositioning the data set changes only slightly and it is reasonable to attempt
to update the existing model instead of training a new one. Many ensemble algorithms
have been proposed for learning with concept change [1–5]. However, in most cases, the
algorithms are based on heuristics and applicable to a limited set of problems. Boost-
ing algorithm proposed in [4] uses each new data batch to train an additional classifier
and to recalculate the weights for the existing classifiers. These weights are recalcu-
lated instead of updated, thus discarding the influence of the previous examples. In [5]
a new data batch is weighted depending on the current ensemble error and used to train
a new classifier. Instead of classifier weights, probability outputs are used for making
ensemble predictions. OnlineBoost [6], which uses a heuristic method for updating the
example weights, was modified for evolving concepts in [8] and [7]. The Online Coor-
dinate Boost (OCB) algorithm proposed in [9] performs online updates of weights of a
fixed set of base classifiers trained offline. The closed form weight update procedure is
derived by minimizing the approximation on AdaBoost’s loss. Because OCB does not
have a mechanism for adding and removing base classifiers and one cannot straightfor-
wardly be derived, the algorithm is not suitable for concept change applications.
In this paper, an extension of the popular AdaBoost algorithm for incremental learn-
ing is proposed and evaluated on concept change applications. It is based on the treat-
ment of AdaBoost as the additive model that iteratively optimizes an exponential cost
function [10]. Given this, the task of IBoost can be stated as updating of the current
boosting ensemble to minimize the modified cost function upon change of the training
data. The issue of model update consists of updating the existing classifiers and their
weights or adding new classifiers using the updated example weights. We intend to
experimentally show that IBoost, in which the ensemble update always leads towards
minimization of the exponential cost, can significantly outperform heuristically based
modifications of AdaBoost for incremental learning with concept change which do not
consider this.

1.1 Preliminaries

The AdaBoost algorithm is formulated in [10] as an ensemble of base classifiers trained


in a sequence using weighted data set versions. At each iteration, it increases the weights
of examples which were misclassified by the previously trained base classifier. Final
classifier is defined as a linear combination of all base classifiers.
While AdaBoost has been developed using arguments from the statistical learning
theory, it has been shown that it can be interpreted as fitting an additive model through
an iterative optimization of an exponential cost function.
For a two-class classification setup, let us assume a data set D is available for train-
ing, D = {(xi , yi ), i = 1, ..., N }, where xi is a K-dimensional feature vector and
518 M. Grbovic and S. Vucetic

Algorithm 1. AdaBoost algorithm


Input: D = {(xi , yi ), i = 1, ..., N }, initial data weights wi0 = 1/N , number of iterations M

FOR m = 0 TO M − 1
(a) Fit a classifier fm+1 (x) to training data by minimizing

N
Jm+1 = wim I(yi = fm+1 (xi )) (1)
i=1

(b) Evaluate the quantities:



N 
N
εm+1 = wim I(yi = fm+1 (xi ))/ wim (2)
i=1 i=1

(c) and then use these to evaluate


1 − εm+1
αm+1 = ln( ) (3)
εm+1
(d) Update the example weights
wim+1 = wim eαm+1 I(yi =fm+1 (xi )) (4)
END
Make Predictions for new point X using:

M
Y = sign( αm fm (X)) (5)
m=1

yi ∈ {+1, −1} is its class label. The exponential cost function is defined as


N
Em = e−yi ·Fm (xi ) , (6)
i=1

where Fm (x) is the current additive model defined as a linear combination of m base
classifiers produced so far,
m
Fm (x) = αj fj (x), (7)
j=1

where base classifier fj (x) can be any classification model with output values +1 or
−1 and αj are constant multipliers called the confidence parameters. The ensemble
prediction is made as the sign of the weighted committee, sign(Fm (x)).
Given the additive model Fm (x) at iteration m−1 the objective is to find an improved
one, Fm+1 (x) = Fm (x) + αm+1 fm+1 (x), at iteration m. The cost function can be
expressed as


N 
N
Em+1 = e−yi ·(Fm (xi )+αm+1 fm+1 (xi )) = wim e−yi αm+1 fm+1 (xi ) , (8)
i=1 i=1

where
wim = e−yi Fm (xi ) (9)
Incremental Boosting by Minimization of the Evolving Exponential Loss 519

are called the example weights. By rearranging Em+1 we can obtain an expression that
leads to the familiar AdaBoost algorithm,

N 
N
Em+1 = (eαm+1 − e−αm+1 ) = fm+1 (xi )) + e−αm+1
wim I(yi  wim . (10)
i=1 i=1

For fixed αm+1 , classifier fm+1 (x) can be trained by minimizing (10). Since αm+1 is
fixed, the second term is constant and the multiplication factor in front of the sum in
the first term does not affect the location of minimum, the base classifier can be found
as fm+1 (x) = arg minf (x) Jm+1 , where Jm+1 is defined as the weighted error func-
tion (1). Depending on the actual learning algorithm, the classifier can be trained by
directly minimizing the cost function (1) (e.g. Naı̈ve Bayes) or by resampling the train-
ing data according to the weight distribution (e.g. decision stumps). Once the training
of the new base classifier fm+1 (x) is finished, αm+1 can be determined by minimizing
(10) assuming fm+1 (x) is fixed. By setting ∂Em+1 /∂αm+1 = 0 the closed form solu-
tion can be derived as (3), where εm+1 is defined as in (2). After we obtain fm+1 (x)
and αm+1 , before continuing to round m + 1 of the boosting procedure and training
of fm+2 , the example weights wim have to be updated. By making use of (9), weights
for the next iteration can be calculated as (4), where I(yi  = fm+1 (xi )) is an indica-
tor function which equals 1 if i-th example is misclassified by fm+1 and 0 otherwise.
Thus, weight wim+1 depends on the performance of all previous base classifiers on i-th
example. The procedure of training an additive model by stage-wise optimization of
the exponential function is executed in iterations, each time adding a new base clas-
sifier. The resulting learning algorithm is identical to the familiar AdaBoost algorithm
summarized in Algorithm 1.
Consequences. There is an important aspect of AdaBoost relevant to development of
its incremental variant. Due to the iterative nature of the algorithm, ∂Em+1 /∂αj = 0
will only hold for the most recent classifier, j = m + 1, but not necessarily for the
previous ones, j = 1, ..., m. Thus, AdaBoost is not a global optimizer of the confidence
parameters αj [12]. As an alternative to the iterative optimization, one could attempt to
globally optimize all a parameters after addition of a base classifier. In spite of being
more time consuming, this would lead to better overall performance.
Weak learners (e.g decision stumps) are typically used as base classifiers because of
their inability to overfit the weighted data, which could produce a very large or infinite
value of αm+1 .
As observed in [10], AdaBoost guarantees exponential progress towards minimiza-
tion of the training error (6) with addition of each new weak classifier, as long as they
classify the weighted training examples better than random guessing (αm+1 > 0). The
convergence rate is determined in [13]. Note that αm+1 can also be negative if fm+1
does worse than 50% on the weighted set. In this case (m+1)-th classifier automatically
changes polarity because it is expected to make more wrong predictions than the correct
ones. Alternatively, fm+1 can be removed from the ensemble. A common approach is
to terminate the boosting procedure when weak learners with positive confidence pa-
rameters can no longer be produced.
In the next section, we introduce IBoost, an algorithm which naturally extends Ad-
aBoost to incremental learning where the cost function changes as the data set changes.
520 M. Grbovic and S. Vucetic

2 Incremental Boosting (IBoost)


Let us assume an AdaBoost committee with m base classifiers Fm (x) has been trained
on data set Dold = {(xi , yi ), i = 1, ..., N } and that we wish to train a committee
upon the data set changed to Dnew by addition of Nin examples, Din = {(xi , yi ), i =
1, ..., Nin }, and removal of Nout examples, Dout ⊂ D. The new training data set is
Dnew = Dold − Dout + Din .
One option is to discard Fm (x) and train a new ensemble from scratch. Another op-
tion, more appealing from the computational perspective, is to reuse the existing ensem-
ble. If the reuse option is considered, it is very important in the design of incremental
AdaBoost to observe that the cost function changes upon change of the data set.
Upon change of data set the cost function changes from

Em old
= e−yi ·Fm (xi ) (11)
i∈Dold

to 
new
Em = e−yi ·Fm (xi ) . (12)
i∈Dnew

There are several choices one could make regarding reuse of the current ensemble
Fm (x):
1. update αt , t = 1, ..., m, to better fit the new data set;
2. update base classifiers in Fm (x);
3. update both at, αt , t = 1, ..., m and base classifiers in Fm (x);
4. add a new base classifier fm+1 and its αm+1 .
Second and third alternatives are not considered here because they would require po-
tentially costly updates of base classifiers and would also require updates of example
weights and confidence parameters of base classifiers. In the remainder of the paper, it
will be assumed that trained base classifiers are fixed. It will be allowed, however, to
remove the existing classifiers from an ensemble.
The first alternative involves updating confidence parameters αj , j = 1, ..., m, in
such way that they now minimize (12). This can be achieved in two ways.
Batch Update updates each αj using the gradient descent algorithm αnew j j −
= αold
η · ∂Em /∂αj , where η is the learning rate. Following this, the resulting update rule
new old

is


m
−yi k fk (xi )
αold
αnew
j = αold
j + y f (x
i j i )e k=1 . (13)
i∈Dnew

One update of the m confidence parameters takes O(N · m) time. If the training set
changed only slightly, only a few updates should be sufficient for the convergence.
The number of batch updates to be performed should be selected depending on the
computational constraints.
Stochastic Update is a faster alternative for updating each αj . It uses stochastic
gradient descent instead of the batch version. The update of αj only using example
(xi , yi ) ∈ Dnew is
Incremental Boosting by Minimization of the Evolving Exponential Loss 521


m
−yi k fk (xi )
αold
αnew
j = αold
j + yi fj (xi )e k=1 . (14)
At the extreme, we can run the stochastic gradient using only the new examples,
(xi , yi ) ∈ Din . This kind of updating is especially appropriate for an aggressive it-
erative schedule where data are arriving one example at a time at a very fast rate and it
is infeasible to perform batch update.
The fourth alternative (adding a new base classifier) is attractive because it allows
training a new base classifier on the new data set in a way that optimally utilizes the ex-
isting boosting ensemble. Before training fm+1 we have to determine example weights.
Weight Calculation. There are three scenarios when there is a need to calculate or
update the example weights.
First, if confidence parameters were unchanged since the last iteration, we can keep
the weights of the old examples and only calculate the weights of the new ones using

m
αt I(yi 
=ft (xi ))
wim = et=1 , i ∈ Din . (15)

Second, if confidence parameters were updated, then all example weights have to be
calculated using (9).
Third, if any base classifier fj was removed, the example weights can be updated by
applying
wim = wim−1 e−αj I(yi =fj (xi )) , (16)
which is as fast as (4).
Adding Base Classifiers. After updating the example weights, we can proceed to train
a new base classifier in the standard boosting fashion. When deciding how exactly to
update the ensemble when the data changes, one should consider a tradeoff between the
accuracy and computational effort. The first question is whether to train a new classifier
and the second whether to update α values, and if the answer is affirmative, which
update mode to use and how many update iterations to run.
In applications where data set is being changed very frequently and by a little it
can become infeasible to train a new base classifier after each change. In that case one
can intentionally wait until enough incoming examples are misclassified by the current
model Fm and only then decide to add a new base classifier fm+1 . In the meantime, the
computational resources can be used to update α parameters.
Removing Base Classifiers. In order to avoid an unbounded growth in number of base
classifiers, we propose a strategy that removes a base classifier each time a predeter-
mined budget is exceeded. Similar strategies, in which the oldest [5] or the base model
with the poorest performance on the current data [1, 2, 4] is removed, were proposed.
Additionally, in case of data with concept change, a classifier fj can become out-
dated and receive negative αj , as a result of (13) or (14), because it is trained on older
examples that were drawn from a different concept.
Following this discussion, our strategy is to remove classifiers if one of the two
scenarios occurs:
522 M. Grbovic and S. Vucetic

• Memory is full and we want to add fm+1 , remove the classifier with the lowest α
• Remove fj if αj becomes negative during α updates
If classifier fj is removed, it is equivalent to setting αi = 0. To account for this change, a
parameters of the remaining classifiers are updated using (13) or (14) and the example
weights are recalculated using (9). If the time does not permit any modification of a
parameters, influence of the removed classifier on example weights can be canceled
using (16).
Convergence. An appealing feature of IBoost is that it retains the AdaBoost conver-
gence properties. At any given time and for any given set of m base classifiers f1 ,
f2 , ..., fm , as long as the confidence parameters α1 , α2 , ..., αm are positive and mini-
new
mize Em , addition of new base classifier fm+1 by minimizing (1) and calculation of
new
αm+1 using (3) will lead towards minimization of Em . This ensures the convergence
of IBoost.

2.1 IBoost Flowchart

Figure 1 presents a summary of IBoost algorithm. We are taking a slightly wider view
and point to all the options a practitioner could select, depending on the particular ap-
plication and computational constraints.
Initial data and example weights are used to train the first base classifier. After the
data set is updated, the user always has a choice of just updating the confidence param-
eters, training a new classifier, or doing both.
First, the user decides whether to train a new classifier or not. If the choice is not to,
the algorithm just updates the confidence parameters using (13) or (14) and modifies the
example weights (9). Otherwise, it proceeds to check if the budget is full and potentially
removes the base classifier with minimum α. Next, the user can choose whether to
update α parameters. If the choice is to perform the update, α parameters are updated
using (13) or (14) which is followed by recalculation of example weights (9). Otherwise,
before proceeding to training a new base classifier, the algorithm still has to calculate
weights for the new examples Din using (15). Finally, the algorithm proceeds with
training fm by minimizing (1), calculating αm (3) and updating example weights (4).

calculate αm (3)
update weights (4)

update data train budget update calculate weights train fm


Dnew= Dold – Dout+Din fm ? YES full ? NO α? NO for Din only (15) (1)

update
NO YES weights (16) YES

update α (13) or (14) remove f update α (13) or (14)


Update
update weights (9) with min. α Weights
update weights (9)

Fig. 1. IBoost algorithm flowchart


Incremental Boosting by Minimization of the Evolving Exponential Loss 523

IBoost (Fig. 1) was designed to provide large flexibility with respect to budget, train-
ing and prediction speed, and stream properties. In this paper, we present an IBoost
variant for Concept Change. However, using the flowchart we can also easily design
variants for applications such as Active Learning or Outlier Removal.

2.2 IBoost for Concept Change

Learning under concept change has received a great deal of attention during the last
decade, with a number of developed learning strategies (see overview [14]). IBoost
falls into the category of adaptive ensembles with instance weighting.
When dealing with data streams with concept change it is beneficial to use a sliding
window approach. At each step, one example is being added and one is being removed.
The common strategy is to remove the oldest one; however, other strategies exist. Se-
lection of window size n presents a tradeoff between achieving maximum accuracy on
the current concept and fast recovery from distribution changes.
As we previously discussed, IBoost is highly flexible as it can be customized to meet
memory and time constrains. For concept change applications we propose the IBoost
variant summarized in Algorithm 2. In this setup, after each window repositioning the
data within the window is used to update the α parameters and potentially train a new
base classifier fm+1 . The Stochastic version performs b updates of each a using the
newest example only, while Batch version performs b iterations of a updates using all
the examples in the window. The new classifier is added when the AddCriterion: (k
mod p = 0) ∧ (yk  = Fm (xk )) is satisfied, where (xk , yk ) is the new data point,
Fm (xk ) is the current ensemble prediction and p is the parameter which controls how
often are base models potentially added. This is a common criterion used in ensemble
algorithms which perform base model addition and removal [4, 5].
The free parameters (M, n, p and b) are quite intuitive and should be relatively easy
to select for a specific application. Larger p values can speed-up the process with slight
decrease in performance. As the budget M increases, so does the accuracy at cost of
increased cost of prediction, model update and storage requirements. Finally, selection
of b is a tradeoff between accuracy, concept change recovery and time.

3 Experiments

In this section, IBoost performance in four different concept change applications will
be evaluated. Three synthetic and one real-world data set, with different drift types
(sudden, gradual and rigorous) were used. The data generation and all the experiments
were repeated 10 times. The average test set classification accuracy is reported.

3.1 Data Sets

SEA synthetic data [15]. The data consists of three attributes, each one in the range
from 0 to 10, and the target variable yi which is set to +1 if xi1 + xi2 ≤ b and −1
otherwise, where b ∈ {7, 8, 9, 9.5}. The data stream used has 50, 000 examples. For
the first 12, 500 examples, the target concept is with b = 8. For the second 12, 500
524 M. Grbovic and S. Vucetic

Algorithm 2. IBoost variant for Concept Change applications


Input: Data stream D = {(xi , yi ), i = 1, ..., N }, window size n, budget M , frequency of
model addition p, number of gradient descent updates b

(0) initialize window Dnew = {(xi , yi ), i = 1, ..., n} and window data weights wi0 = 1/n
(a) k = n, Train f1 (1) using Dnew , calculate α1 (3), update weights winew (4), m = 1
(b) Slide the window: k = k + 1, Dnew = Dold + (xk , yk ) − (xk−n , yk−n )
(c) If (k mod p = 0) ∧ (yk = Fm (xk )),
(c.1) If (m = M )
(c.1.1) Remove fj with minimum αj , m = m − 1
(c.2) Update αj , j = 1, ..., m using (13) or (14) b times
(c.3) Recalculate winew using (9)
(c.4) Train fm+1 (1), calculate αm+1 (3), update weights winew (4)
(c.5) m = m + 1
(d) Else
(d.1) Update αj , j = 1, ..., m (13) or (14) b times, recalculate winew using (9)
(e) If any αj < 0, j = 1, ..., m
(e.1) Remove fj , m = m − 1
(e.2) Update αj , j = 1, ..., m (13) or (14) b times, recalculate winew using (9)
(f) Jump to (b)

examples, b = 9; the third, b = 7; and the fourth, b = 9.5. After each window slide, the
current ensemble is tested using the current concept 2, 500 test set examples.
Santa Fe time series data (collection A) [16] was used to test IBoost performance
on a real world gradual concept change problem. The goal is to predict the measure-
ment gi ∈ R based on 9 previous observations The original regression problem with
a target value gi was converted to classification such that yi = 1 if gi ≤ b, where
b ∈ {−0.5, 0, 1} and yi = −1 otherwise. The data stream contains 9, 990 examples.
For the first 3, 330 examples, b = −0.5; for the second 3, 330 examples, b = 0; and
b = 1 for the remaining ones. Testing is done using a holdout data with 825 examples
from the current concept. Gradual drifts were simulated by smooth transition of b over
1, 000 examples.
Random RBF synthetic data [3]. This generator can create data which contains
a rigorous concept change type. First, a fixed number of centroinds are generated in
feature space, each assigned a single class label, weight and standard deviation. The ex-
amples are then generated by selecting a center at random, taking weights into account,
and displacing them in random direction from the centroid by random displacement
length, drown from a Gaussian distribution with centeroids standard deviation. Drift is
introduced by moving the centers with constant speed. In order to test IBoost on large
binary data sets, we generated 10 centers, which are assigned class labels {−1, +1} and
a drift parameter 0.001, and simulated one million RBF data examples. Evaluation was
done using interleaved test-then-train methodology: every example was used for testing
the model before it was used for training the model.
LED data. The goal is to predict the digit displayed on a seven segment LED display,
where each binary attribute has a 10% chance of being inverted. The original 10-class
problem was converted to binary by representing digits {1, 2, 4, 5, 7} (non-round digits)
Incremental Boosting by Minimization of the Evolving Exponential Loss 525

as +1 and digits {3, 6, 8, 9, 0} (round digits) as −1. Four attributes (out of 7) were se-
lected to have drifts. We simulated one million examples and evaluated the performance
using interleaved test-then-train. The data is available in UCI repository.

3.2 Algorithms
IBoost was compared to non-incremental AdaBoost, Online Coordinate Boost, Online-
Boost and its two modifications for concept change (NSOnlineBoost and FLC), Fast
and Light Boosting, DWM and AdWin Online Bagging.
OnlineBoost [6] starts with some initial base models fj , j = 1, ..., m which are as-
signed weights λsc sw
j = 0 and λj = 0. When a new example (xi , yi ) arrives it is assigned
an initial example weight of λd = 1. Than, OnlineBoost uses a Poisson distribution for
sampling and updates each fj model k = P oisson(λd ) times using (xi , yi ). Next, if
fj (xi ) = yi the example weight is updated as λd = λd /2(1 − εj ) and λsc sc
j = λj + λd ;
sw sw sw sw sc
otherwise λd = λd /2εj and λj = λj + λd , where εj = λj /(λj + λj ), before
proceeding to updating the next base model fj+1 . Confidence parameters α for each
base classifier are obtained using (3) and the final predictions are made using (5). Since
OnlineBoost updates all the base models using each new observation, their performance
on the previous examples changes and so should the weighted sums λsc sw
m and λm . Still,
the unchanged sums are used to calculate α, and thus the resulting α are not optimized.
NSOnlineBoost [7] In the original OnlineBoost algorithm, initial classifiers are in-
crementally learned using all examples in an online manner. Base classifier addition
or removal is not used. This is why poor recovery from concept change is expected.
NSOnlineBoost modification introduces a sliding window and base classifier addition
and removal. The training is conducted in the OnlineBoost manner until the update pe-
riod pns is reached. Then, the ensemble Fm classification error on the examples in the
window is calculated and compared to the ensemble Fm − fj , where m includes all
the base models trained using at least Kns = 100 points. If removing any fj improves
the ensemble performance on the window, it is removed and a new classifier is added
with initial values λsc sw
m = 0, λm = 0 and εm = 0.
Fast and Light Classifier (FLC) [8] is a straightforward extension of OnlineBoost
that uses an Adaptive Window (AdWin) change detection technique [17] to heuristically
increase example weights when the change is detected. The base classifiers and their
confidence parameters are initialized and updated in the same way as in OnlineBoost.
When a new example arrives (λd = 1), AdWin checks, for every possible split into
”large enough” sub-windows, if their average classification rates differ by more than
a certain threshold d, set to k window standard deviations. If the change is detected,
the new example updates all m base classifiers with weights that are calculated using
λd = (1 − εj )/εj , where εj = λsw sw sc
j /(λj + λj ), j = 1, ..., m. The window then
drops the older sub-window and continues to grow back to its maximum size with the
examples from the new concept. If the change is not detected, the example weights and
base classifiers are updated in the same manner as in OnlineBoost.
AdWin Bagging [3] is the OnlineBagging algorithm proposed in [6] which uses the
AdWin technique [17] as a change detector and to estimate the error rates for each base
model. It starts with initial base models fj , j = 1, ..., m. Then, when a new example
526 M. Grbovic and S. Vucetic

(xi , yi ) arrives, each model fj is updated k = P oisson(1) times using (xi , yi ). Final
prediction is given by simple majority vote. If the change is detected, the base classifier
with the highest error rate εj is removed and a new one is added.
Online Coordinate Boost (OCB) [9] requires initial base models fj , j = 1, ..., m,
trained offline using some initial data. The initial training also provides the starting
confidence parameter values αj , j = 1, ..., m, and sums of weights of correctly and in-
correctly classified examples for each base classifier, (λsc sw
j and λj , respectively). When
a new example (xi , yi ) arrives, the goal is to find the appropriate updates Δαj for αj
such that the AdaBoost loss (6) with the addition of the last example is minimized. Be-
cause these updates cannot be found in the closed form, the authors derived closed form
updates that minimize the approximate loss instead of the exact one. Such optimization
requires keeping and updating the sums of weights (λsc sw
(j,l) and λ(j,l) ) which involve two
weak hypotheses j and l and introduction of the order parameter o. To avoid numerical
errors, the algorithm requires initialization with the data of large enough length nocb
and selection of the proper order parameter.
FLB [5] The algorithm assumes that data are arriving in disjoint blocks of size
nflb . Given a new block Bj ensemble example weights are calculated depending on
the ensemble error rate εj , where the weight of a misclassified example xi is set to
wi = (1 − εj )/εj and the weight of a correctly classified sample is left unchanged.
A new base classifier is trained using the weighted block. The process repeats until a
new block arrives. If the number of classifiers reaches the budget M , the oldest one is
removed. The base classifier predictions are combined by averaging the probability pre-
dictions and selecting the class with the highest probability. There is also a change de-
tection algorithm running in the background, which discards the entire ensemble when
a change is detected. It is based on the assumption that the ensemble performance θ on
the batch follows Gaussian distribution. The change is detected when the distribution
of θ changes from one Gaussian to another, which is detected using a threshold τ .
DWM [1] is a heuristically-based ensemble method for handling concept change.
It starts with a single classifier f1 trained using the initial data and α1 = 1. Then,
each time a new example xk arrives it updates weights a for the existing classifiers: a
classifier that incorrectly labels the current example receives a reduction in its weight by
multiplicative constant β = 0.5. After each pdwm examples classifiers whose weights
fall under a threshold θr = 0.01 are removed and if, in addition, (yk  = Fm (xk )) a
new classifier fm+1 with αm+1 = 1 is trained using the data in the window. Finally,
all classifiers are updated using (xk , yk ) and their weights a are normalized. The global
prediction for the current ensemble Fm (xk ) is always made by the weighted majority
(5). When the memory for storing base classifiers is full and a new classifier needs to
be stored, the classifier with the lowest α is removed.
In general, the described concept change algorithms can be divided into several
groups based on their characteristics (Table 1).

3.3 Results
We performed an in-depth evaluation of IBoost and competitor algorithms for differ-
ent values of window size n = {100, 200, 500, 1, 000, 2, 000}, base classifier budget
Incremental Boosting by Minimization of the Evolving Exponential Loss 527

Table 1. Dividing concept change algorithms into overlapping groups

Characteristics IBoost Online NSO FLC AdWin OCB DWM FLB


Boost Boost Bagg
Change Detector Used • • •
Online Base Classifier Update • • • • •
Classifier Addition and Removal • • • • •
Sliding Window • • • • •

M = {20, 50, 100, 200, 500} and update frequency p = {1, 10, 50}. We also evaluated
the performance of IBoost for different values of b = {1, 5, 10}. Both IBoost Batch (13)
and IBoost Stochastic (14) were considered.
In the first set of experiments IBoost was compared to the benchmark AdaBoost
algorithm, OCB and FLB on the SEA data set. Simple Decision Stumps (single-level
Decision Trees) were used as base classifiers. Their number was limited to M . Both
AdaBoost and IBoost start with a single example and when the number of examples
reaches n they begin using a window of size n. Each time the window slides and the
AddCriterion with p = 1 is satisfied, AdaBoost retrains M classifiers from scratch,
while IBoost trains a single new classifier (Algorithm 2).
OCB was initialized using the first nocb = n data points and then it updated confi-
dence parameters using the incoming data. Depending on the budget M , the OCB order
parameter o was set to the value that resulted in the best performance. Increasing o re-
sults in improved performance. However, performance deteriorates if it is increased too
much. In FLB, disjoint data batches of size nflb = n were used (equivalent to p = nflb ).
Five base models were trained using each batch while constraining the budget as ex-
plained in [5]. Class probability outputs for Decision Stumps were calculated based on
the distance from the split and the threshold for the change detector was selected such
that the false positive rate is 1%.
Figure 2 compares the algorithms in the M = 200, n = 200 setting. Performances
for different values of M and n are compared in Table 2 based on the test accuracy, con-
cept change recovery (average test accuracy on the first 600 examples after introduction
of new concept) and training times.

100
Test Accuracy (%)

95

90

85
IBoost−DS Batch, training time: 1,113 sec
IBoost−DS Stochastic, training time: 898 sec
AdaBoost−DS, training time: 913 sec
80 OCB−DS, training time: 590 sec
FLB−DS, training time: 207 sec
5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
Time Step

Fig. 2. Predictive Accuracy on SEA Data Set, M = 200, n = 200


528 M. Grbovic and S. Vucetic

Table 2. Performance Comparison on SEA dataset

window size n = 200 budget size M = 200


Algorithm budget size M window size n
20 50 100 200 500 100 200 500 1000 2000
IBoost test accuracy (%) 94.5 96.4 96.7 97.1 97.5 96.9 97.1 97.3 97.5 98
Stochastic
recovery (%) 92.5 93.1 93.3 93.5 93.4 93.4 93.5 92.4 90.1 89.6
b=5 time (s) 39 90 183 372 751 221 372 396 447 552
IBoost test accuracy (%) 95.9 97.4 97.8 97.9 98 97.2 97.9 98.1 98.3 98.5
Batch recovery (%) 91.5 92.1 92.9 92.5 93.4 92.8 92.5 91.2 88.8 88.4
b=5 time (s) 77 188 401 898 2.1K 801 885 1K 1.7K 2.3K
test accuracy (%) 94.5 95 95 94.9 94.9 92.8 94.9 96.7 97 97.5
AdaBoost recovery (%) 92 92.1 92.2 91.9 91.9 91.7 91.9 89.9 88.1 86.3
time (s) 91 192 432 913 2.1K 847 913 1K 1.3K 1.8K
test accuracy (%) 92.7 93.9 94.3 94.4 94.1 91.3 94.4 95.4 95.8 96.8
OCB recovery (%) 84.3 86.4 89.8 91.2 91.2 88.7 91.2 90.1 84.4 93.5
time (s) 47 120 259 590 2K 584 590 567 560 546
test accuracy (%) 82.6 89.4 92.9 94.4 94.9 94.7 94.4 90.5 87.5 83.4
FLB recovery (%) 82.3 85.3 86.1 84.7 84.9 85.2 84.7 83.8 83.5 81.9
time (s) 73 104 156 207 435 183 207 262 390 456

Both IBoost versions achieved much higher classification accuracy than AdaBoost
and it was faster to train. This can be explained by the fact that AdaBoost deletes the
influence of all previously seen examples outside the current window by discarding
the whole ensemble and retraining. Better performance of IBoost than OCB can be
explained by the difference in updating confidence parameters and the fact that OCB
never adds or removes base classifiers. Inferior FLC results show that removing the
entire ensemble when the change is detected is not the most effective solution.
IBoost Batch was more accurate than IBoost Stochastic. However, the training time
of IBoost Batch was significantly higher. Considering this, IBoost Stochastic represents
a reasonable tradeoff between performance and time. Fastest recovery for all three con-
cept changes was achieved by IBoost Stochastic. This is because the confidence param-
eters updates of IBoost Stochastic are performed using only the most recent example.
Some general conclusions are that the increase in budget M resulted in larger training
times and accuracy gain for all algorithms. Also, as the window size n grew, the per-
formance of both IBoost and retrained AdaBoost improved at cost of increased training
time and slower concept change recovery. With a larger window the recovery perfor-
mance gap between the two approaches increased, while the test accuracy gap reduced
as the retrained AdaBoost generalization error decreased.
As we discussed in section 3.2, one can select different IBoost b and p parameters
depending on the stream properties. Table 3 shows how the performance on SEA data
changed as they were adjusted. We can conclude that bigger values of b improved the
performance at cost of increasing the training time, while bigger values of p degraded
the performance (some just slightly, e.g. p = 10) coupled with big time savings.
In the second set of experiments, IBoost Stochastic (p = 1, b = 5) was compared
to the algorithms from Table 1 on both SEA and Santa Fe data sets. Naı̈ve Bayes was
Incremental Boosting by Minimization of the Evolving Exponential Loss 529

Table 3. IBoost performance for different b and p values on SEA dataset

p=1 b=1
Algorithm M = 200, n = 200
b=1 b = 5 b = 10 b = 10 b = 50 b = 100
test accuracy (%) 96.7 97.1 97.4 96.5 94.7 93.1
IBoost Stochastic recovery (%) 93.1 93.5 93.7 92.8 92.7 92.1
time (s) 201 372 635 104 45 22
test accuracy (%) 97.6 97.9 98.2 97.1 95.6 93.7
IBoost Batch recovery (%) 92.3 92.5 92.9 92.6 91.6 91.4
time (s) 545 898 1.6K 221 133 96

100

95
Test Accuracy (%)

90

IBoost−NB Stochastic, training time: 104 sec


85 OCB−NB, training time: 164 sec
AdWin OnlineBagg−NB, training time: 1,113 sec
FLC−NB, training time: 1,156 sec
OnlineBoost−NB, training time: 929 sec
DWM−NB, training time: 21 sec
80 FLB−NB, training time: 323 sec
NSOnlineBoost−NB, training time: 4,621 sec
5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
Time Step

Fig. 3. Predictive Accuracy on SEA Data Set, M = 50

chosen to be the base classifier in these experiments because of its ability to be incre-
mentally improved, which is a prerequisite for some of the competitors (Table 1). All
algorithms used a budget of M = 50. The algorithms that use a moving window had
a window of size n = 200. Additional parameters for different algorithms were set as
follows: OCB was initialized offline with nocb = 2K data points and used o = 5, FLB
used batches of size pflb = 200, NSOnlineBoost used pns = 10 and DWM pdwm = 50.
Results for the SEA data set are presented in Fig. 3. As expected, OnlineBoost had
poor concept change recovery because it never removes or adds new models. Its two
non-stationary versions, NSOnlineBoost and FLC, outperformed it. FLC was particu-
larly good during the first three concepts where it had almost the same performance as
IBoost. However, its performance deteriorated after introduction of the fourth concept.
The opposite happened for DWM which was worse than IBoost in all but the fourth
concept. We can conclude that IBoost outperformed all algorithms, while being very
fast (it came in second, after DWM).
Results for the Santa Fe data set are presented in Fig. 4. Similar conclusions as pre-
viously can be drawn. In the first concept several algorithms showed almost identical
performance. However, when the concept changed IBoost was the most accurate. Ta-
ble 4 summarizes test accuracies on both data sets.
AdWin Online Bagging had an interesting behavior in both SEA and Santa Fe data
sets. It did not suffer as large accuracy drop due to concept drift as the other algorithms,
and the direction of improvement suggests that it would reach IBoost performance if
duration of each particular concept were longer.
530 M. Grbovic and S. Vucetic

100

90
Test Accuracy (%)

80

70
IBoost−NB Stochastic, training time: 52.4 sec
OCB−NB, training time: 40.3 sec
AdWin OnlineBagg−NB, training time: 394 sec
60 FLC−NB, training time: 387 sec
OnlineBoost−NB, training time: 361 sec
DWM−NB, training time: 12.8 sec
50 FLB−NB, training time: 38.1 sec
NSOnlineBoost−NB, training time: 2,315 sec
1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 10,000
Time Step

Fig. 4. Predictive Accuracy on Santa Fe Data Set, M = 50

Table 4. SEA and Santa Fe performance summary based on the test accuracy

Data Set IBoost Online NSO FLC AdWin OCB DWM FLB
Stochastic Boost Boost Bagg
SEA 98.0 95.6 96.9 97.4 94.5 95.2 96.9 94.9
Santa Fe 94.1 81.8 85.1 83.4 80.0 80.6 88.8 87.6

82
Test Accuracy (%)

81.5

81

80.5

80

79.5 IBoost−NB Stochastic, total time: 2,216 sec


OCB−NB, total time: 2,329 sec
79 Adwin OnlineBagg−NB, total time: 4,913 sec
FLC−NB, total time: 4,835 sec
78.5 OnlineBoost−NB, total time: 4,421 sec
DWM−NB, total time: 1,277 sec
78
1000 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000
Time Step

Fig. 5. LED Data Set, 10% noise, 4 drifting attributes, M = 20

To study the performance on large problems, we used only IBoost Stochastic (p =


10, b = 1) version because fast processing of the data was required. Budget was set
to M = 20 and for algorithms that require window we used n = 200. Parameters for
remaining algorithms were set as: OCB (nocb = 2K and o = 5), DWM (pdwm = 500).
Figure 5 shows the results for LED data. We can conclude that AdWin Online Bag-
ging outperformed OnlineBoost, FLC and DWM, and was very similar to OCB. IBoost
Stochastic was the most accurate algorithm (except in the first 80K examples).
In Figure 6 we present the results for RBF data. IBoost Stochastic was the most
accurate model, by a large margin. It was the second fastest, after DWM. AdWin Online
Bagging was reasonably accurate, and it was the second performing overall.
Incremental Boosting by Minimization of the Evolving Exponential Loss 531

90

85
Test Accuracy (%)

80

75

70

65 IBoost−NB Stochastic, total time: 1,420 sec


OCB−NB, total time: 2,030 sec
60 AdWin OnlineBagg−NB, total time: 2,224 sec
FLC−NB, total time: 2,060 sec
55 OnlineBoost−NB, total time: 1,925 sec
DWM−NB, total time: 1,108 sec
1,000 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000
Time Step

Fig. 6. RBF Data Set, 10 centroids, drift 0.001, M = 20

4 Conclusion
In this paper we addressed a very important problem of incremental learning. We pro-
posed an extension of AdaBoost to incremental learning. The idea was to reuse and
upgrade the existing ensemble when the training data are modified. The new algo-
rithm was evaluated on concept change applications. The results showed that IBoost
is more efficient, accurate, and resistant than the original AdaBoost, mainly because it
retains memory about the examples that are removed from the training sliding window.
It also performed better than previously proposed OnlineBoost and its non-stationary
versions, DWM, Online Coordinate Boosting, FLB and AdWin Online Bagging. Our
future work will include extending IBoost to perform multi-class classification, com-
bining it with the powerful AdWin change detection technique and experimenting with
Hoeffding Trees as base classifiers.

References
[1] Kolter, J.Z., Maloof, M.A.: Dynamic weighted majority: A new ensemble method for track-
ing concept drift. In: ICDM, pp. 123–130 (2003)
[2] Scholz, M.: Knowledge-Based Sampling for Subgroup Discovery. In: Local Pattern Detec-
tion, pp. 171–189. Springer, Heidelberg (2005)
[3] Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavald, R.: New ensemble methods for
evolving data streams. In: ACM SIGKDD (2009)
[4] Wang, H., Fan, W., Yu, P.S., Han, J.: Mining concept-drifting data streams using ensemble
classifiers. In: Proc. ACM SIGKDD, pp. 226–235 (2003)
[5] Chu, F., Zaniolo, C.: Fast and light boosting for adaptive mining of data streams. In: Proc.
PAKDD, pp. 282–292 (2004)
[6] Oza, N., Russell, S.: Experimental comparisons of online and batch versions of bagging and
boosting. In: ACM SIGKDD (2001)
[7] Pocock, A., Yiapanis, P., Singer, J., Lujan, M., Brown, G.: Online Non-Stationary Boosting.
In: Intl. Workshop on Multiple Classifier Systems (2010)
[8] Attar, V., Sinha, P., Wankhade, K.: A fast and light classifier for data streams. Evolving
Systems 1(4), 199–207 (2010)
[9] Pelossof, R., Jones, M., Vovsha, I., Rudin, C.: Online Coordinate Boosting. In: On-line
Learning for Computer Vision Workshop, ICCV (2009)
532 M. Grbovic and S. Vucetic

[10] Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of
boosting. The Annals of Statistics 28, 337–407 (2000)
[11] Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Machine Learn-
ing: Proceedings of the Thirteenth International Conference, pp. 148–156 (1996)
[12] Schapire, R.E., Singer, Y.: Improved Boosting Algorithms Using Confidence-rate Predic-
tions. Machine Learning Journal 37, 297–336 (1999)
[13] Schapire, R.E.: The convergence rate of adaboost. In: COLT (2010)
[14] Zliobaite, I.: Learning under Concept Drift: an Overview, Technical Report, Vilnius Uni-
versity, Faculty of Mathematics and Informatics (2009)
[15] Street, W., Kim, Y.: A streaming ensemble algorithm (SEA) for large-scale classification.
In: ACM SIGKDD, pp. 377–382 (2001)
[16] Weigend, A.S., Mangeas, M., Srivastava, A.N.: Nonlinear gated experts for time series:
discovering regimes and avoiding overfitting. In: IJNS, vol. 6, pp. 373–399 (1995)
[17] Bifet, A., Gavald, R.: Learning from time changing data with adaptive windowing. In:
SIAM International Conference on Data Mining, pp. 443–448 (2007)
Fast and Memory-Efficient Discovery of the
Top-k Relevant Subgroups in a Reduced
Candidate Space

Henrik Grosskreutz and Daniel Paurat

Fraunhofer IAIS, Schloss Birlinghoven, 53754 St. Augustin, Germany


henrik.grosskreutz@iais.fraunhofer.de,
daniel.paurat@iais-extern.fraunhofer.de

Abstract. We consider a modified version of the top-k subgroup discov-


ery task, where subgroups dominated by other subgroups are discarded.
The advantage of this modified task, known as relevant subgroup discov-
ery, is that it avoids redundancy in the outcome. Although it has been
applied in many applications, so far no efficient exact algorithm for this
task has been proposed. Most existing solutions do not guarantee the
exact solution (as a result of the use of non-admissible heuristics), while
the only exact solution relies on the explicit storage of the whole search
space, which results in prohibitively large memory requirements.
In this paper, we present a new top-k relevant subgroup discovery
algorithm which overcomes these shortcomings. Our solution is based on
the fact that if an iterative deepening approach is applied, the relevance
check – which is the root of the problems of all other approaches – can
be realized based solely on the best k subgroups visited so far. The
approach also allows for the integration of admissible pruning techniques
like optimistic estimate pruning. The result is a fast, memory-efficient
algorithm which clearly outperforms existing top-k relevant subgroup
discovery approaches. Moreover, we analytically and empirically show
that it is competitive with simpler approaches which do not consider the
relevance criterion.

1 Introduction
In applications of local pattern discovery tasks, one is typically interested in
obtaining a small yet meaningful set of patterns. The reason is that resources
for post-processing of the patterns are typically limited, both if the patterns are
manually reviewed by human experts, or if they are used as input of a subsequent
data-mining step, following a multi-step approach like LeGo [11].
Reducing the number of raw patterns to a subset of manageable size can be
done using different approaches: one is to use a quality function to assess the
value of the patterns, and to discard all but the k highest-quality patterns. This
is known as the “top-k” approach. It can be further subdivided depending on
the pattern type and the quality function considered. In this paper, we consider
the case where the data has a binary label, the quality function accounts for

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 533–548, 2011.

c Springer-Verlag Berlin Heidelberg 2011
534 H. Grosskreutz and D. Paurat

the support in the different classes, and the patterns have the form of itemsets.
This setting is known as top-k supervised descriptive rule discovery, correlated
pattern mining or subgroup discovery [5]. In the following, we will stick with the
expression subgroup discovery and with the terminology used in this community.
Restricting to the top-k patterns (or subgroups, in our specific case) is not the
only approach to reduce the size of the output. A different line of research aims at
the identification and removal of patterns which are of little interest compared to
other patterns, (cf. [6,4]). This idea is formalized using constraints based on the
interrelation between patterns. A particularly appealing approach along this line
is the theory of relevance [14,7]. The idea of this approach, which applies only to
binary labeled data, is to remove all patterns that are dominated (or covered ) by
another pattern. Here, a pattern is considered as dominating another pattern if
the dominating pattern covers at least all positives (i.e. target-class individuals)
covered by the dominated pattern, but no additional negative.
The theory of relevance not only allows to get rid of multiple equivalent de-
scriptions (as does the theory of closed sets), but also of trivial specializations
which provide no additional insight over their generalizations. Due to this advan-
tage, relevance has been used as a filtering criterion in several subgroup discovery
applications [14,12,2]. In many settings, however, the number of relevant sub-
groups is still far larger than desired. In this case, a nearby solution is to combine
it with the top-k approach.
Up to now, however, no satisfying solution has been developed for the task of
top-k relevant subgroup discovery. Most algorithms apply non-admissible prun-
ing heuristics, with the result that high-quality subgroups can be overlooked
(e.g. [14,16]). The source of these problems is that relevance is a property not
defined locally, but with respect to the set of all other subgroups. The only
non-trivial algorithm which provably finds the exact solution to this task is
that of Garriga et al. [7]. This approach is based on the insight that all relevant
subgroups must be closed on the positives; moreover, the relevance of a closed-on-
the-positive can be determined based solely on the information about all closed-
on-the-positives. This gives rise to an algorithmic approach which exhaustively
traverses all closed-on-the-positives, stores them, and relies on this collection to
distinguish the relevant subgroups from the irrelevant closed-on-the-positives.
Obviously, this approach suffers from the drawback that a potentially very large
number of subgroups has to be stored in memory. For complex datasets, the high
memory requirements are a much more severe problem than the runtime. An-
other drawback of this approach is that it does not allow for admissible pruning
techniques based on a dynamically increasing threshold [22,17,8].
In this paper, we make the following contributions:
– We analyze existing top-k relevant subgroup discovery algorithms and show
how all except for the memory-demanding approach of [7] fail to guarantee
an exact solution;
– We present a simple solution to the relevance check which requires an amount
of memory only linear in k and the number of features. An additional ad-
vantage of this approach is that it can easily be combined with admissible
pruning techniques;
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 535

– Thereupon we present a new, memory-efficient top-k relevant subgroup dis-


covery algorithm. We demonstrate that it is faster than the only existing
exact algorithm (Garriga et al. [7]), while avoiding the memory issues of
the latter. Moreover, we show that our approach is competitive with all ex-
isting exhaustive subgroup discovery approaches, in particular with simpler
algorithms which do not consider the relevance criterion.

The remainder of the paper is structured as follows: after reviewing basic def-
initions in Section 2, we illustrate the task of relevant subgroup discovery in
Section 3. Successively, we present our new approach and analyze its properties
in Section 4, before we present empirical results in Section 5.

2 Preliminaries

In this section, we will define the task of subgroup discovery, review the theory
of relevance and discuss its connection to closure operators.

2.1 Subgroup Discovery

Subgroup discovery [10] aims at discovering descriptions of interesting sub-


portions of a dataset. We assume all records d1 , . . . , dm of the dataset to be
described by a set of n binary features (f1 (di ), . . . , fn (di )) ∈ {0, 1}n . A sub-
group description sd is a subset of the feature set, i.e. sd ⊆ {f1 , . . . , fn }. In the
following, we will sometimes simply write subgroup to refer to a subgroup de-
scription. A data record d satisfies sd if f (d) = 1 for all f ∈ sd, that is, subgroup
descriptions are interpreted conjunctively. Thus, we sometimes use the notation
fi1 &, . . . &fik instead of {fi1 , . . . , fik }. Finally, DB[sd] denotes the set of records
d ∈ DB of a database DB satisfying a subgroup description sd.
The interestingness of a subgroup description sd in the context of a database
DB is measured by a quality function q that assigns a real-valued quality q(sd, DB)
to sd. The quality functions usually combine the size of the subgroup and its un-
usualness with respect to a designated target variable, the class or label. In this
paper, we only consider the case of binary labels, that is, the label of a record
d is a special feature class(d) with range {+, −}. Some of the most common
quality functions for binary labeled data are of the form:
 
a |TP(DB, sd)| |TP(DB, ∅)|
|DB[sd]| · − (1)
|DB[sd]| |DB|

where TP(DB, sd) := {d ∈ DB[sd] | class(d) = +} denotes the true positives of


the subgroup sd, a is a constant such that 0 ≤ a ≤ 1, and TP(DB, ∅) simply
denotes all positives in the dataset. The family of quality functions characterized
by Equation 1 includes some of the most popular quality functions: for a = 1, it is
order equivalent to the Piatetsky-Shapiro quality function [10] and the weighted
relative accuracy WRACC [13], while for a = 0.5 it corresponds to the binomial
test quality function [10].
536 H. Grosskreutz and D. Paurat

2.2 Optimistic Estimate Pruning


A concept closely related to quality functions is that of an optimistic estimate
[22]. This is a function that provides a bound on the quality of a subgroup
description and of all its specializations. Formally, an optimistic estimator for
a quality function q is a function oe mapping a database DB and a subgroup
description sd to a real value such that for all DB, sd and specializations sd ⊇ sd,
it holds that oe(DB, sd) ≥ q(DB, sd ). Optimistic estimates allow to drastically
improve the performance of subgroup discovery by means of pruning [17,8].

2.3 The Theory of Relevance


The theory of relevance [15,14] is aimed at eliminating irrelevant patterns, resp.
subgroups. A subgroup sdirr is considered as irrelevant if it is dominated (or
covered ) by another subgroup sd in the following sense:
Definition 1. The subgroup sdirr is dominated by the subgroup sd in database
DB iff. (i) TP(DB, sdirr ) ⊆ TP(DB, sd) and (ii) FP(DB, sd) ⊆ FP(DB, sdirr ).
Here, TP is defined as in 2.1, while FP(DB, sd) = {c ∈ DB[sd] | class(c) = −}
denotes the false positives.

2.4 Closure Operators and Their Connection to Relevance


As shown by Garriga et al. [7], the notion of relevance can be restated in terms
of the following mapping between subgroup descriptions:
Γ + (X) := {f | ∀d ∈ TP(DB, X) : f [d] = 1}. (2)
Γ + is a closure operator, i.a. a function defined on the power-set of features
P({f1 , . . . , fn }) such that for all X, Y ∈ P({f1 , . . . , fn }), (i) X ⊆ Γ (X) (exten-
sivity), (ii) X ⊆ Y ⇒ Γ (X) ⊆ Γ (Y ) (monotonicity), and (iii) Γ (X) = Γ (Γ (X))
(idempotence) holds. The fixpoints of Γ + , i.e. the subgroup descriptions sdrel
such that sdrel = Γ + (sdrel ), are precisely the closed-on-the-positives mentioned
earlier. The main result in [7] is that
Proposition 1. The space of relevant patterns consists of all patterns sdrel sat-
isfying the following: (i) sdrel is closed on the positives, and (ii) there is no gen-
eralization sd  sdrel closed on the positives such that |FP(sd)| = |FP(sdrel )|.

The connection between relevancy and closure operators is particularly inter-


esting because closure operators have extensively been studied in the area of
closed pattern mining (cf. [19]). However, unlike here in closed pattern mining
the closure operator is defined solely on the support, without accounting for la-
bels. The fixpoints of the closure operator based on the support are called closed
patterns. Many algorithms have been developed to traverse the fixpoints of some
arbitrary closure operator, e.g. LCM [21]. Making use of this fact, Garriga et al.
[7] have proposed a simple two-step approach to find the relevant patterns: first,
find and store all closed-on-the-positives; second, remove all dominated closed-
on-the-positives using Proposition 1. We will refer to this approach as CPosSd.
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 537

3 Relevant Subgroup Discovery


As motivated in the introduction, we are interested in the following task:
Task 1. Top-k Relevant Subgroup Discovery Given a database DB, a qual-
ity function q, and an integer k > 0, find a set of subgroup descriptions R of
size k, such that
– all subgroup descriptions in R are relevant wrt. DB, and
– all subgroup descriptions not in R either have a quality no higher than
minsd∈R q(DB, sd), or are dominated by some subgroup description in R.

We will now illustrate this task using a simple example, before we show how
existing pruning approaches result in incorrect results.

3.1 An Illustrative Example


Our example database, shown in Table 1, describes opinion polls. There is one
record for every participating person. Beside the class, Approval, there are four
features that characterize the records: Children:yes and Children:no indicates
whether or not the participant has children; University indicates that the partici-
pant has a university degree, and finally High Income indicates an above-average
income. To keep the example simple, we have not included features describing
the negative counterparts of the last two features.
This simple example dataset induces a lattice of candidate subgroup descrip-
tions, containing a total of 16 nodes. These include several redundant descrip-
tions (like University and HighIncome & University), which are avoided if we
consider the sub-space of closed subgroups. Figure 1 visualizes this space, thereby
representing both the type and the WRACC quality:
– type: The visualization distinguishes between subgroup descriptions which
are closed, closed on the positives and relevant. Every node represents a
closed subgroup; those closed on the positives are rendered with a double
border; finally, relevant subgroups are rendered using a rectangular shape.
– quality: The color intensity of a node corresponds to the quality of the sub-
group: higher-qualities correspond to more intense gray shades.

Table 1. Example: An simple opinion poll dataset


Approval Children=yes Children=no University High Income
+   
+   
+ 
-  
- 
- 
- 
538 H. Grosskreutz and D. Paurat

(empty)

Children=yes HighIncome Children=no

Children=yes University HighIncome Children=no

University Children=no HighIncome

HighIncome University HighIncome & University Children=yes Children=no & HighIncome University

Children=yes Children=no University

Children=yes & HighIncome & University Children=no & HighIncome & University

Fig. 1. Subgroup lattice for the example. Relevant subgroups are highlighted

The figure illustrates that high-quality subgroups need not be relevant: For ex-
ample, the two subgroups Children:yes & HighIncome & University and Chil-
dren:no & HighIncome & University have high quality but are irrelevant, as they
are merely a fragmentation of the relevant subgroup HighIncome & University.

3.2 Existing Approaches, Challenges and Pitfalls


The existing approaches can briefly be divided into two classes: extensions of
classical algorithms which apply some kind of pruning, and the closed-on-the-
positives approach of Garriga et al. [7].

Pruning-based Approaches. State-of-the-art top-k subgroup discovery algorithm


do not traverse the whole space of candidate patterns but apply pruning to re-
duce the number of patterns effectively visited (cf. [8,3,18]). The use of such
techniques results in a dramatic reduction of the execution time and is an indis-
pensable tool for fast exhaustive subgroup discovery [17,8,3].
To better understand the issues that can arise, we first briefly review the
concept of a dynamically increasing quality threshold used during optimistic es-
timate pruning. Recall that classical subgroup discovery algorithms traverse the
space of candidate subgroup descriptions, collecting the best subgroups. Once
at least k subgroups have been collected, only subgroups are of interest whose
quality exceeds that of the k-th best subgroup visited so far. The quality of the
k-th subgroup thus represents a threshold, which increases monotonically in a
dynamic fashion during traversal of the search space. Combined with an op-
timistic estimator (see Section 2.2), this dynamic threshold can allow pruning
large parts of the search space.
The use of a dynamic threshold is of key importance in optimistic estimate
pruning, as it is impossible to calculate a suitable threshold beforehand. Figure 1
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 539

illustrates the efficiency of this approach: if top-1 subgroup discovery is per-


formed in this example, then the dynamic threshold allows pruning all but the
direct children of the root node. Unfortunately, storing only the k best subgroups
visited and using the quality of the k-best subgroup as dynamic threshold is
problematic in relevant subgroup discovery:

1. If a relevant subgroup is visited which dominates more than one subgroup


so far collected, then all dominated subgroups have to be removed. This can
have the effect that when the computation ends, the result queue erroneously
contains less than k subgroups.
2. If the quality threshold is increased to the quality of the k-th subgroup in
the result queue, but the queue contains non-relevant subgroups, then some
relevant subgroups can erroneously be pruned.

The two above problems can be observed in our example scenario. Issue 1 arises
if we search for the top-2 subgroups in Figure 1, and the nodes are visited in
the following order: Children=yes, Children:yes & HighIncome & University,
Children:no & HighIncome & University and High Income & University. When
the computation ends, the result will only contain High Income & University, but
miss the second relevant subgroup, Children=yes. Issue 2 arises if Children:yes
& HighIncome & University and Children:no & HighIncome & University are
added to the queue before Children=yes is visited: the effect is that the minimum
quality threshold is increased to a level that will incorrectly prune Children=yes.
The above issues are the reason why most existing algorithms, like BSD [16],
do not guarantee an exact solution for the task of top-k relevant subgroup dis-
covery. This is problematic both because the outcome will typically be of less
value, and because it is not uniquely determined; in fact, the outcome can differ
among implementations and possibly even among execution traces. This effect
is amplified if a beam search is applied (e.g. [14]), where the subgroups con-
sidered are not guaranteed not to be dominated by some subgroup outside the
beam.

Approaches based on the closed-on-the-positives. The paper of Garriga et. al. [7]
is the first that proposes a non-trivial approach to correctly solve the relevant
subgroup discovery task. The authors investigate the relation between closure
operators (cf. [19]) and relevance, and show that the relevant subgroups are
a subset of the subgroups closed on the positives. While the focus of the pa-
per is on structural properties and not on computational aspects, the authors
also propose the simple two-step algorithm CPosSd described in Section 2.4.
The search space considered by this algorithm — the closed-on-the-positives
— is a subset of the closed subgroups, thus it operates on a smaller candidate
space than all earlier approaches. The downside is that it does not account
for optimistic estimate pruning, and, probably more seriously, that it has very
high memory requirements, as the whole set of closed-on-the-positives has to be
stored.
540 H. Grosskreutz and D. Paurat

4 An Iterative Deepening Approach


We aim at a solution that:
1. Avoids the high memory requirements of CPosSd [7];
2. Considers a reduced search space, namely the closed on the positives;
3. Applies pruning based on a dynamically increasing quality threshold.
The last bullet implies that we need a way to efficiently determine whether a
subgroup is relevant or not the moment it is visited. Proposition 1 tells us that
this can be done if a pattern is guaranteed to be visited only once each of its gen-
eralizations have been visited. One straightforward solution is thus to traverse
the candidate space in a general-to-specific way. While this traversal strategy
slightly differs from a breadth-first-traversal, it has the same drawbacks, in par-
ticular that the memory requirements can be linear in the size of the candidate
space. In the worst case, this results in memory requirements exponential in the
number of features, which is clearly problematic.
To avoid the above memory issue, we build our solution upon an iterative
deepening depth-first traversal of the space of closed-on-the-positives. Iterative
deepening depth-first search is well known to have more convenient memory
requirements [20], and moreover it ensures that whenever a subgroup description
is visited, all its generalizations have been visited before.

4.1 A Relevance Check Based On The Top-k Subgroups Visited


One challenge remains, namely devising a memory-efficient way to test the rele-
vance of a newly visited pattern. Obviously, we cannot store all generalizations
of every subgroup in memory (and simply apply Proposition 1): as every pattern
can have exponentially many generalizations, this would again result in memory
issues. Instead, our solution is based on the following observation:

Proposition 2. Let DB be a dataset, q a quality function of the form of Equa-


tion 1 (with 0 ≤ a ≤ 1) and minQ some real value. Then, the relevance of any
closed-on-the-positive sd with quality ≥ minQ can be computed from the set

G∗ = {sdgen  sd | sdgen is relevant in DB and q(DB, sdgen ) > minQ}

of all generalizations of sd with quality ≥ minQ. In particular, sd is irrelevant if


and only if there is a relevant subgroup sdgen in G∗ with same negative support,
where the negative support of a pattern sd is defined as |FP(DB, sd)|.

The above proposition tell us that we can perform the relevance check based
only on the top-k relevant subgroups visited so far: The iterative deepening
traversal ensures that a pattern sd is only visited once all generalizations have
been visited; so if the quality of the newly visited pattern sd exceeds that of
the k-best subgroup visited so far, then the set of the best k relevant subgroups
visited includes all generalizations of sd with higher quality – that is, a superset
of the set G∗ mentioned in Proposition 2; hence, we can check the relevance
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 541

of sd. On the other hand, if the quality of sd is lower than that of the k-best
subgroup visited, then we don’t care about its relevance anyways.
To prove the correctness of Proposition 2, we first present two lemmas:
Lemma 3. If a closed-on-the-positive sdirr is irrelevant, i.e. if there is a gen-
eralization sd  sdirr closed on the positives with the same negative support as
sdirr , then there is also at least one relevant generalization sdrel  sdirr with
the same negative support.
Proof. Let N be the set of all closed-on-the-positives generalizations of sdirr with
the same negative support as sd. There must be at least one sdrel in N such that
none of the patterns in N is a generalization of sdrel . From Proposition 1, we
can conclude that sdrel must be relevant and dominates sdirr . 
Lemma 4. If a pattern sdrel dominates another pattern sdirr , then sdrel has
higher quality than sdirr .
Proof. We have that |DB[sdrel ]| ≥ |DB[sdirr ]|, because sdrel is a subset of sdirr
and support is antimonotonic. Thus, to show that sdrel has higher quality, it
is sufficient to show |TP(DB, sdrel )| /|DB[sdrel ]| > |TP(DB, sdirr )| /|DB[sdirr ]|.
From Proposition 1, we can conclude that sdrel and sdirr have the same num-
ber of false positives; let F denote this number. Using F , we can restate the
above inequality as |TP(DB, sdrel )| /(|TP(DB, sdrel )| + F ) > |TP(DB, sdirr )| /
(|TP(DB, sdirr )| + F ). All that remains to show is thus that |TP(DB, sdrel )| >
|TP(DB, sdirr )|. By definition of relevance, |TP(DB, sdrel )| ≥ |TP(DB, sdirr )|,
and because sdrel and sdirr are different and closed on the positives, the in-
equality must be strict, which completes the proof. 
Based upon these lemmas, it is straightforward to prove Proposition 2:
Proof. We first show that if sd is irrelevant, then there is a generalization in G∗
with the same negative support. From Lemma 3 we know that if sd is irrelevant,
then there is at least one relevant generalization of sd with same negative sup-
port dominating sd. Let sdgen be such a generalization. Lemma 4 implies that
q(DB, sdgen ) ≥ q(DB, sd) ≥ minQ, hence sdgen is a member of the set G∗ .
It remains to show that if sd is relevant, then there is no generalization in G∗
with same negative support. This follows directly from Proposition 1. 

4.2 The Algorithm


Algorithm 1 shows the pseudo-code for the approach outlined above. The main
program is responsible for the iterative deepening. The actual work is done in the
procedure findSubgroupsWithDepthLimit, which traverses the space of closed-
on-the-positives in a depth-first fashion using a stack (aka LIFO data structure).
Thereby, it ignores (closed-on-the-positive) subgroups longer than the length
limit, and it avoids multiple visits of the same node using some standard tech-
nique like the prefix-preserving property test [21]. Moreover, the function applies
standard optimistic estimate pruning and dynamic quality threshold adjustment.
The relevance check is done in line 6, relying on Proposition 2.
542 H. Grosskreutz and D. Paurat

Algorithm 1. Iterative Deepening Top-k RelevantSD (ID-Rsd )

Input : integer k and database DB over features {f1 ; ...; fn }


Output : the top-k relevant subgroups

main:
1: var result= queue with maximum capacity k (initially empty)
2: var minQ= 0
3: for limit= 1 to n do
4: findSubgroupsWithDepthLimit(result, limit)
5: return result

procedure findSubgroupsWithDepthLimit(result, limit):


1: var stack = new stack initialized with root node
2: while stack not empty do
3: var next= pop from stack
4: if next’s optimistic estimate exceeds minQ and its length does not exceeds limit
then
5: add all successor patterns of next to stack (avoiding multiple visits)
6: if next has quality above minQ and is not dominated by any p ∈ result then
7: add next to result
8: update minQ if possible

4.3 Complexity

We will now turn to the complexity of our algorithm. Let n denote the number
of features in the dataset and m the number of records. The memory complexity
is O(n2 + kn), given that the maximum recursion depth is n, the maximum size
of the result queue is k, and every subgroup description has length O(n).
Let us now consider the runtime complexity. For every node visited we com-
pute the quality, test for relevance and consider at most n augmentations. The
quality computation can be done in O(nm), while the relevance check can be
done in O(kn). The computation of the successors in Line 5 involves the exe-
cution of n closure computations, which amounts to O(n2 m). Altogether, the
cost-per-node is thus O(n2 m + kn). Finally, the number of nodes considered is
obviously bounded by O(|Cp | n), where Cp is the set of closed-on-the-positives
and the factor n is caused by the iterative deepening approach.1
Table 2 compares the runtime and space complexity of our algorithm with
CPosSd. Moreover, we show the complexity of classical and closed subgroup
discovery algorithms. Although these algorithms solve a different, simpler task,
it is interesting to observe they do not have a lower complexity. The expression
1
In case the search space has the shape of a tree, the number of nodes visited by an
iterative deepening approach is well-known to be proportional to the size of the tree.
Here, however, a tree-shape is not guaranteed, which is why we use the more loose
bound involving the additional factor n.
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 543

Table 2. Complexity of the different subgroup discovery approaches

Algorithm Memory Runtime Pruning


ID-Rsd O(n2 + kn) O(|Cp | [n3 m + n2 k]) yes
CPosSd Θ(n |Cp |) Θ(|Cp | n2 m) no
Classical SD O(n2 + kn) O(|S| nm) yes
Closed SD O(n2 + kn) O(|C| n2 m) yes

S used in the table denotes set of all subgroup descriptions, while C denote the
set of closed subgroups.
Let us consider the figures in more detail, starting with the memory com-
plexity: Except for CPosSd, all approach can apply depth-first-search (possibly
iterated) and thus have moderate memory requirements. In contrast, CPosSd has
to collect all closed-on-the-positives, each of which has a description of length
n. Please note that no pruning is applied, meaning that n |Cp | is not a loose up-
per bound for number of nodes stored in memory, but the precise indication —
which is why we use the Θ-notation in the table. As the number of closed-on-the-
positives can be exponential in n, this approach can quickly become unfeasible.
Let us now turn to the runtime complexity. First, let’s compare the runtime
complexity of our approach with classic resp. closed subgroup discovery algo-
rithms. Probably the most important difference is that they operate on different
spaces. While otherwise the complexity of our approach is higher by a linear
factor (resp. quadratic, compared to classic subgroup discovery), the space we
consider, i.e. the closed-on-the-positives Cp , can be exponentially smaller than
the one considered by the other approaches (i.e. C, respectively its superset S).
This is illustrated by the following family of datasets:
Proposition 5. For all n ∈ N+ , there is a dataset DBn of size n + 1 over n
features such that the ratio of closed to closed-on-the-positives is O(2n ).

Construction 1. We define the dataset DBn = d1 , . . . , dn , dn+1 over the n + 1


features f1 , . . . , fn , class as

0, if i = j
fj (di ) = and class(di ) = −
1, otherwise

for i = 1, ..., n and dn+1 = (1, . . . , 1, +).


In these datasets, every non-empty subgroup description is closed and has
positive quality. The total number of closed subgroups is thus 2n − 1, while there
is only one closed-on-the-positives, namely {f1 . . . fn }. 

Finally, compared to CPosSd, we see that in worst-case our iterative deepen-


ing approach causes an additional factor of n (the second term involving k is
not much of a problem, as in practice k is relatively small). For large datasets,
this disadvantage is however clearly outweighed by the reduction of the memory
544 H. Grosskreutz and D. Paurat

footprint. Moreover, as we will show in the following section, in practice this


worst-case seldom happens: on real datasets, our approach is mostly not slower
than CPosSd, but instead much faster (due to it’s use of pruning).

5 Experimental Results
In this section we empirically compare our new relevant subgroup discovery
algorithm with existing algorithms. In particular, we considered the following
two questions:
– How does our algorithm perform compared to CPosSd?
– How does our algorithm perform compared to classical and closed subgroup
discovery algorithms?
We will not investigate and quantify the advantage of the relevant subgroups
over standard or closed subgroups, as the value of the relevance criterion on
similar datasets has been demonstrated elsewhere (cf. [7]).

5.1 Implementation and Setup


We implemented our algorithm in
Dataset target class # rec. # feat.
JAVA, using conditional datasets but
no sophisticated data structures like credit-g bad 1000 58
fp-trees [9] or bitsets [16]. As minor op- lung-cancer 1 32 159
timization, during the iterative deep- lymph mal lymph 148 50
ening the length limit is increased in mushroom poisonous 8124 117
a way that length limits for which nursery recommend 12960 27
no patterns exist are skipped (this is optdigits 9 5620 64
realized by keeping track, in every it- sick sick 3772 66
eration, of the length of the shortest soybean brown-spot 638 133
pattern expanded exceeding the cur- splice EI 3190 287
rent length limit). tic-tac-toe positive 958 27
In the following investigation, we vote republican 435 48
use a dozen datasets from the UCI Ma- waveform 0 5000 40
chine Learning Repository [1], which
are presented along with their most Fig. 2. Datasets
important properties in Figure 2. All
numerical attributes where discretized
using minimal entropy discretization. We run the experiments using two quality
functions: the binomial test quality function and the WRACC quality. For prun-
ing, we used the tight optimistic estimate from [8] for the WRACC quality, while

for binomial test quality we used the function |TP(DB, sd)| · (1 − |TP(DB,∅)|
|DB| ),
which can be verified to be a tight optimistic estimate using some basic maths.
These optimistic estimates were used in all implementations to make sure that
the results are comparable. The experiments were run on a Core2Duo 2.4 GHz
PC with 4 GB of RAM.
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 545

(a) Bin test quality, k=10 (b) Bin test quality, k=100

Fig. 3. Number of nodes considered during relevant subgroup discovery (brackets in-
dicate memory issues)

5.2 Comparison with CPosSd


In this paragraph, we compare our algorithm with Garriga et al.’s. In order
to abstract from the implementation, we compare the number of visited nodes,
rather than the runtime or the exact amount of memory used.
First, in Figure 3 we show the number of nodes considered by our algorithm
(“ID-Rsd”) and by the other approach (“CPosSd”). We used the binomial test
quality, and distinguished between two values for k (10 and 100); the results for
other quality function are comparable and omitted for space reasons. The figure
shows that for three datasets (’optdigits’, ’splice’ and ’waveform’), the number of
nodes considered by CPosSd was almost 100 millions. As the algorithm CPosSd
has to keep all visited nodes in memory, the computation failed: our machine
run out of memory for these three datasets. The number of nodes plotted in
Figure 3 was obtained by merely counting the number of nodes traversed (instead
of computing the top-k relevant subgroups). This illustrates that the memory
footprint of CPosSd is often prohibitive.
In our approach, on the other hand, there is no need to keep all visited pat-
terns in memory, and hence all computations succeeded. Moreover, in total our
approach considers way less nodes than CPosSd. The overall reduction in the
number of nodes visited amounts to roughly two orders of magnitude (please
note that we used a logarithmic scale in the figures).

5.3 Comparison with Other Subgroup Miners


Next, we compared our algorithm with subgroup miners that solve a differ-
ent but related task, namely classical subgroup discovery and closed subgroup
discovery. As representative algorithms, we used DpSubgroup [8] and the depth-
first closed subgroup miner from [4], which is essentially an adaptation of LCM
[21] to the task of subgroup discovery. We remark that these algorithms are also
546 H. Grosskreutz and D. Paurat

(a) q=bt (b) q=wracc

Fig. 4. Number of nodes considered by (non-relevant) SD algorithms (k=10)

Table 3. Total num. nodes visited, and percentage compared to StdSd (k=10)

q=bt q=wracc
StdSd CloSd Id-Rsd StdSd CloSd Id-Rsd
total # nodes 346 921 363 1 742 316 590 068 15 873 969 459 434 120 967
percentage (vs. StdSd) 100% 0.5% 0.17% 100% 2.9% 0.76%
total Runtime (sec) 2 717 286 118 147 100 45
percentage 100% 10.5% 4.4% 100% 68% 30%

representative for approaches like the algorithms SD and BSD discussed in Sec-
tion 3.2, which apply some ad-hoc and possibly incorrect relevance filtering, but
otherwise operate on the space of all subgroup descriptions.2
Figure 4 shows the number of nodes considered if k is set to 10 and the
binomial test, respectively the WRACC quality function is used. Again, we use
a logarithmic scale. The results for k = 100 are similar and omitted for space
reasons. Please note that for our algorithm (“ID-Rsd”), all nodes are closed-on-
the-positives, while for the closed subgroup discovery approach (“CloSd”) they
are closed and for the classic approach (“StdSd”) they are arbitrary subgroup
descriptions.
The results differ strongly depending on the characteristics of the data. For
several datasets, our approach results in a decrease of the number of nodes con-
sidered. The difference to the classical subgroup miner DpSubgroup is particularly
apparent, where it often amounts to several orders of magnitude.
There are, however, several datasets where our algorithm traverses more nodes
than the classical approaches. Again, the effect is particularly considerable when
compared with the classical subgroup miner. Beside the overhead caused by
the multiple iterations, one reason for this effect is that the quality of the k-th
pattern found differs for the different algorithms: For the relevant subgroup
algorithm, the k-best quality tends to be lower, because this approach suppresses
2
Note that BSD becomes faster if its relevance check is disabled [16].
Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups 547

high-quality but irrelevant subgroups. One could argue that it would be more fair
to use a larger k-value for the non-relevant algorithms, as their output contains
more redundancy.
Overall, our algorithm is competitive (or, somewhat faster) than the other
approaches, as the aggregated figures in Table 3 show. Although the costs-per-
node are lower for classical subgroup discovery than for the other approaches,
overall this does not compensate for the much larger number of nodes traversed.

6 Conclusions
In this paper, we have presented a new algorithm for the task of top-k relevant
subgroup discovery. The algorithm is the first that finds the top-k relevant sub-
groups based on a traversal of the closed-on-the-positives, while avoiding the
high memory requirements of the approach of Garriga et al. [7]. Moreover, it
allows for the use of optimistic estimate pruning, which reduces the fraction of
closed-on-the-positives effectively considered.
The central idea of our algorithm is the memory-efficient relevance test, which
allows getting along with only the information about the k best patterns visited
so far. Please note that restricting the candidate space to (a subset of) the closed-
on-the-positives not only reduces the size of the search space: it also ensures the
correctness of our memory-efficient relevance check. The best k patterns can
only be used to determine the relevance of a new high-quality patterns if all
patterns visited are closed-on-the-positive subgroups – not if we were to consider
arbitrary subgroup descriptions. The restriction to closed-on-the-positives is thus
a prerequisite for the correctness of our approach.
The new algorithm performs quite well compared to existing approaches.
It not only avoids the high memory requirements of the approach of Garriga
et al. [7], but also clearly outperforms this approach. Moreover, it is competitive
with the existing non-relevant top-k subgroup discovery algorithms. This is par-
ticularly remarkable as it produces more valuable patterns than those simpler
approaches.

Acknowledgments. Part of this work was supported by the German Science


Foundation (DFG) under ’GA 1615/1-1’ and by the European Commission under
’ICT-FP7-LIFT-255951’.

References
1. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)
2. Atzmueller, M., Lemmerich, F., Krause, B., Hotho, A.: Towards Understanding
Spammers - Discovering Local Patterns for Concept Characterization and Descrip-
tion. In: Proc. of the LeGo Workshop at ECML-PKDD (2009)
3. Atzmueller, M., Lemmerich, F.: Fast subgroup discovery for continuous target con-
cepts. In: Rauch, J., Raś, Z.W., Berka, P., Elomaa, T. (eds.) ISMIS 2009. LNCS,
vol. 5722, pp. 35–44. Springer, Heidelberg (2009)
548 H. Grosskreutz and D. Paurat

4. Boley, M., Grosskreutz, H.: Non-redundant subgroup discovery using a closure


system. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.)
ECML PKDD 2009. LNCS, vol. 5781, pp. 179–194. Springer, Heidelberg (2009)
5. Bringmann, B., Nijssen, S., Zimmermann, A.: Pattern based classification: a uni-
fying perspective. In: LeGo Worskhop Colocated with ECML/PKDD (2009)
6. Bringmann, B., Zimmermann, A.: The chosen few: On identifying valuable pat-
terns. In: ICDM (2007)
7. Garriga, G.C., Kralj, P., Lavrač, N.: Closed sets for labeled data. J. Mach. Learn.
Res. 9, 559–580 (2008)
8. Grosskreutz, H., Rüping, S., Wrobel, S.: Tight optimistic estimates for fast sub-
group discovery. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD
2008, Part I. LNCS (LNAI), vol. 5211, pp. 440–456. Springer, Heidelberg (2008)
9. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation.
In: SIGMOD Conference, pp. 1–12 (2000)
10. Klösgen, W.: Explora: A multipattern and multistrategy discovery assistant. In:
Advances in Knowledge Discovery and Data Mining, pp. 249–271 (1996)
11. Knobbe, A., Cremilleux, B., Fürnkranz, J., Scholz, M.: From local patterns to
global models: The lego approach to data mining. In: From Local Patterns to
Global Models: Proceedings of the ECML/PKDD-2008 Workshop (2008)
12. Kralj, P., Lavrač, N., Zupan, B., Gamberger, D.: Experimental comparison of three
subgroup discovery algorithms: Analysing brain ischemia data. In: Information
Society, pp. 220–223 (2005)
13. Lavrac, N., Kavsek, B., Flach, P., Todorovski, L.: Subgroup discovery with CN2-
SD. Journal of Machine Learning Research 5(Feb), 153–188 (2004)
14. Lavrac, N., Gamberger, D.: Relevancy in constraint-based subgroup discovery. In:
Constraint-Based Mining and Inductive Databases (2005)
15. Lavrac, N., Gamberger, D., Jovanoski, V.: A study of relevance for learning in
deductive databases. J. Log. Program. 40(2-3), 215–249 (1999)
16. Lemmerich, F., Atzmueller, M.: Fast discovery of relevant subgroup patterns. In:
FLAIRS (2010)
17. Morishita, S., Sese, J.: Traversing itemset lattice with statistical metric pruning.
In: PODS (2000)
18. Nijssen, S., Guns, T., De Raedt, L.: Correlated itemset mining in roc space: a
constraint programming approach. In: KDD, pp. 647–656 (2009)
19. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient mining of association
rules using closed itemset lattices. Inf. Syst. 24(1), 25–46 (1999)
20. Russell, S.J., Norvig, P.: Artificial Intelligence: a modern approach, 2nd Interna-
tional edn. Prentice Hall, Englewood Cliffs (2003)
21. Uno, T., Asai, T., Uchida, Y., Arimura, H.: An efficient algorithm for enumerating
closed patterns in transaction databases. In: Discovery Science (2004)
22. Wrobel, S.: An algorithm for multi-relational discovery of subgroups. In: Ko-
morowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263. Springer, Heidel-
berg (1997)
Linear Discriminant Dimensionality Reduction

Quanquan Gu, Zhenhui Li, and Jiawei Han

Department of Computer Science,


University of Illinois at Urbana-Champaign
Urbana, IL 61801, US
qgu3@illinois.edu, zli28@uiuc.edu, hanj@cs.uiuc.edu

Abstract. Fisher criterion has achieved great success in dimensional-


ity reduction. Two representative methods based on Fisher criterion are
Fisher Score and Linear Discriminant Analysis (LDA). The former is
developed for feature selection while the latter is designed for subspace
learning. In the past decade, these two approaches are often studied inde-
pendently. In this paper, based on the observation that Fisher score and
LDA are complementary, we propose to integrate Fisher score and LDA
in a unified framework, namely Linear Discriminant Dimensionality Re-
duction (LDDR). We aim at finding a subset of features, based on which
the learnt linear transformation via LDA maximizes the Fisher criterion.
LDDR inherits the advantages of Fisher score and LDA and is able to do
feature selection and subspace learning simultaneously. Both Fisher score
and LDA can be seen as the special cases of the proposed method. The
resultant optimization problem is a mixed integer programming, which is
difficult to solve. It is relaxed into a L2,1 -norm constrained least square
problem and solved by accelerated proximal gradient descent algorithm.
Experiments on benchmark face recognition data sets illustrate that the
proposed method outperforms the state of the art methods arguably.

1 Introduction

In many applications in machine learning and data mining, one is often con-
fronted with very high dimensional data. High dimensionality increases the time
and space requirements for processing the data. Moreover, in the presence of
many irrelevant and/or redundant features, learning methods tend to over-fit
and become less interpretable. A common way to resolve this problem is di-
mensionality reduction, which has attracted much attention in machine learning
community in the past decades. Generally speaking, dimensionality reduction
can be achieved by either feature selection [8] or subspace learning [12] [11] [25]
(a.k.a feature transformation). The philosophy behind feature selection is that
not all the features are useful for learning. Hence it aims to select a subset of
most informative or discriminative features from the original feature set. And
the basic idea of subspace learning is that the combination of the original fea-
tures may be more helpful for learning. As a result, it aims at transforming the
original features to a new feature space with lower dimensionality.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 549–564, 2011.

c Springer-Verlag Berlin Heidelberg 2011
550 Q. Gu, Z. Li, and J. Han

Fisher criterion [6] [22] [9] plays an important role in dimensionality reduction.
It aims at finding a feature representation by which the within-class distance is
minimized and the between-class distance is maximized. Based on Fisher crite-
rion, two representative methods have been proposed. One is Fisher Score [22],
which is a feature selection method. The other is Linear Discriminant Analy-
sis (LDA) [6] [22] [9], which is a subspace learning method. Although there are
many other feature selection methods [8] [10] [23], Fisher score is still among the
state of the art [29]. And LDA has received great success in face recognition [2],
which is known as Fisher Face. In the past decades, both Fisher score and LDA
have been studied extensively [10] [20] [26] [24] [4] [17] [5]. However, they study
Fisher score or LDA independently, ignoring the close relation between them.
In this paper, we propose to study Fisher score and LDA together. The key
motivation is that, although it is based on Fisher criterion, Fisher score is not
able to do feature combination such as LDA. The features selected by Fisher
score are a subset of the original features. However, as we mentioned before,
the transformed features may be more discriminative than the original features.
On the other hand, although LDA admits feature combination, it transforms
all the original features rather than only those useful ones as in Fisher score.
Furthermore, since LDA uses all the features, the resulting transformation is
often difficult to interpret. It can be seen that Fisher score and LDA are ac-
tually complementary to some extent. If we combine Fisher score and LDA in
a systematic way, they could mutually enhance each other. One intuitive way
is performing Fisher score before LDA as a two-stage approach. However, since
these two stages are conducted individually, the whole process is likely to be
suboptimal. This motivates us to integrate Fisher score and LDA in a principled
way to complement each other.
Based on the above motivation, we propose a unified framework, namely Lin-
ear Discriminant Dimensionality Reduction (LDDR), integrating Fisher score
and LDA. In detail, we aim at finding a subset of features, based on which
the learnt linear transformation via LDA maximizes the Fisher criterion. LDDR
performs feature selection and subspace learning simultaneously based on Fisher
criterion. It inherits the advantages of Fisher score and LDA to overcome their
individual disadvantages. Hence it is able to discard the irrelevant features and
transform the relevant ones simultaneously. Both Fisher score and LDA can
be seen as the special cases of LDDR. The resulting optimization problem is
a mixed integer programming [3], which is difficult to solve. We relax it into a
L2,1 -norm constrained least square problem and solved by accelerated proximal
gradient descent algorithm [18]. It is worth noting that L2,1 -norm has already
been successfully applied in Group Lasso [28], multi-task feature learning [1] [14],
joint covariate selection and joint subspace selection [21]. Experiments on bench-
mark face recognition data sets demonstrate the effectiveness of the proposed
approach.
The remainder of this paper is organized as follows. In Section 2, we briefly
review Fisher score and LDA. In Section 3, we present a framework for joint
Linear Discriminant Dimensionality Reduction 551

feature selection and subspace learning. In Section 4, we review some related


works. Experiments on benchmark face recognition data sets are demonstrated
in Section 5. Finally, we draw a conclusion in Section 6.

1.1 Notations

Given a data set that consists of n data points {(xi , yi )}ni=1 , where xi ∈ Rd , and
yi ∈ {1, 2, . . . , c} denotes the class label of the i-th data point. The data matrix
is denoted by X = [x1 , . . . , xn ] ∈ Rd×n , and the linear transformation matrix
is denoted by W ∈ Rd×m , projecting the input data into an m-dimensional
subspace. Given a matrix W ∈ Rd×m , we denote the i-th row of W by wi , and
the
j-th column of W by wj . The Frobenius norm of W is defined as ||W||F =
d d
i ||w ||2 , and the L2,1 -norm of W is defined as ||W||2,1 = i ||w ||2 . 1 is
i 2 i

a vector of all ones with an appropriate length. 0 is a vector of all zeros. I is an


identity matrix with an appropriate size. Without n loss of generality, we assume
that X has been centered with zero mean, i.e., i=1 xi = 0.

2 A Review of LDA and Fisher Score

In this section, we briefly introduce two representative dimensionality reduction


methods: Linear Discriminant Analysis [6] [22] [9] and Fisher Score [22], both of
which are based on Fisher criterion.

2.1 Linear Discriminant Analysis

Linear discriminant analysis (LDA) [6] [22] [9] is a supervised subspace learning
method which is based on Fisher Criterion. It aims to find a linear transforma-
tion W ∈ Rd×m that maps xi in the d-dimensional space to a m-dimensional
space, in which the between class scatter is maximized while the within-class
scatter is minimized, i.e.,

arg max tr((WT Sw W)−1 (WT Sb W)), (1)


W

where Sb and Sw are the between-class scatter matrix and within-class scatter
matrix respectively, which are defined as


c 
c 
Sb = nk (µk − µ)(µk − µ)T , Sw = (xi − µk )(xi − µk )T , (2)
k=1 k=1 i∈Ck

where Ck is the index set of the k-th class, µk and nk are mean vector
c and size
of k-th class respectively in the input data space, i.e., X, µ = k=1 nk µk is
the overall mean vector of the original data. It is easy to show that Eq. (1) is
equivalent to
arg max tr((WT St W)−1 (WT Sb W)), (3)
W
552 Q. Gu, Z. Li, and J. Han

where St is the total scatter matrix, defined as follows,



n
St = (xi − µ)(xi − µ)T . (4)
i=1

Note that St = Sw + Sb .
According to [6], when the total scatter matrix St is non-singular, the solution
of Eq. (3) consists of the top eigenvectors of the matrix S−1
t Sb corresponding to
nonzero eigenvalues. When the total class scatter matrix St does not have full
rank, the solution of Eq. (3) consists of the top eigenvectors of the matrix S†t Sb
corresponding to nonzero eigenvalues, where S†t denotes the pseudo-inverse of St
[7]. Note that when St is nonsingular, S†t equals S−1t .
LDA has been successfully applied to face recognition [2]. Following LDA,
many incremental works have been done, e.g., Uncorrelated LDA and Orthogonal
LDA [26], Local LDA [24], Semi-supervised LDA [4] and Sparse LDA [17] [5].
Note that all these methods suffer from the weakness of using all the original
features to learn the subspace.

2.2 Fisher Score for Feature Selection


The key idea of Fisher score [22] is to find a subset of features, such that in the
data space spanned by the selected features, the distances between data points
in different classes are as large as possible, while the distances between data
points in the same class are as small as possible. In particular, given the selected
m features, the input data matrix X ∈ Rd×n reduces to Z ∈ Rm×n . Then the
Fisher Score is formulated as follows,
 
arg max tr S̃−1
t S̃b , (5)
Z

where S̃b and S̃t are defined as



c 
n
S̃b = nk (µ̃k − µ̃)(µ̃k − µ̃)T , S̃t = (zi − µ̃)(zi − µ̃)T , (6)
k=1 i=1

where µ̃k and nk are the mean vector and size of the k-th class respectively in
c
the reduced data space, i.e., Z, µ̃ = k=1 nk µ̃k is the overall mean vector of
d
the reduced data. Note that there are m candidate Z’s out of X, hence Fisher
score is a combinatorial optimization problem.
We introduce an indicator variable p, where p = (p1 , . . . , pd )T and pi ∈
{0, 1}, i = 1, . . . , d, to represent whether a feature is selected or not. In order to
indicate that m features are selected, we constrain p by pT 1 = m. Then the
Fisher Score in Eq. (5) can be equivalently formulated as follows,

arg max tr{(diag(p)St diag(p))−1 (diag(p)Sb diag(p))},


p

s.t. p ∈ {0, 1}d, pT 1 = m, (7)


Linear Discriminant Dimensionality Reduction 553

where diag(p) is a diagonal matrix whose diagonal elements are pi ’s, Sb and St
are the between-class scatter matrix and total scatter matrix, defined as in Eq.
(2) and Eq. (4).
As can be seen, like other feature selection approaches [8], Fisher score only
does binary feature selection. It does not admit feature combination like LDA
does.
Based on the above discussion, we can see that LDA suffers from the problem
which Fisher score does not have, while Fisher score has the limitation which
LDA does not have. Hence, if we integrate LDA and Fisher score in a systematic
way, they could complement each other and be benefited from each other. This
motivates the proposed method in this paper.

3 Linear Discriminant Dimensionality Reduction


In this section, we will integrate Fisher score and Linear Discriminant Analysis
in a unified framework. The key idea of our method is to find a subset of features,
based on which the learnt linear transformation via LDA maximizes the Fisher
criterion. It can be mathematically formulated as follows,

arg max tr{(WT diag(p)St diag(p)W)−1 (WT diag(p)Sb diag(p)W)},


W,p

s.t. p ∈ {0, 1}d, pT 1 = m, (8)

which is a mixed integer programming [3]. Eq. (8) is called as Linear Discrimi-
nant Dimensionality Reduction (LDDR) because it is able to do feature selection
and subspace learning simultaneously. It inherits the advantages of Fisher score
and LDA. That is, it is able to find a subset of useful original features, based
on which it generates new features by feature transformation. Given p = 1,
Eq. (8) reduces to LDA as in Eq. (3). Letting W = I, Eq. (8) degenerates to
Fisher score as in Eq.(7). Hence, both LDA and Fisher score can be seen as the
special cases of the proposed method. In addition, the objective functions cor-
responding to LDA and Fisher score are lower bounds of the objective function
of LDDR.
Recent studies [9] [27] established the relationship between LDA and multi-
variate linear regression problem, which provides a regression-based solution for
LDA. This motivates us to solve the problem in Eq.(8) in a similar manner. In the
following, we present a theorem, which establishes the equivalence relationship
between the problem in Eq.(8) and the problem in Eq.(9).

Theorem 1. The optimal p that maximizes the problem in Eq. (8) is the same
as the optimal p that minimizes the following problem
1
arg min ||XT diag(p)W − H||2F
p,W 2
s.t. p ∈ {0, 1}d, pT 1 = m, (9)
554 Q. Gu, Z. Li, and J. Han

where H = [h1 , . . . , hc ] ∈ Rn×c , and hk is a column vector whose i-th entry is


given by 

n
n − nnk , if yi = k
hik = k

(10)
− nnk , otherwise.
In addition, the optimal W1 of Eq. (8) and the optimal W2 of Eq. (9) have the
following relation
W2 = [W1 , 0]QT , (11)
under a mild condition that

rank(St ) = rank(Sb ) + rank(Sw ), (12)

and Q is a orthogonal matrix.

Proof. Due to space limit, we only give the sketch of the proof. On the one hand,
given the optimal W, the optimization problem in Eq. (8) with respect to p is
equivalent to the optimization problem in Eq. (9) with respect to p. On the
other hand, for any feasible p, the optimal W that maximizes the problem in
Eq. (8) and the optimal W that minimizes the problem in Eq. (9) satisfy the
relation in Eq. (11) according to Theorem 5.1 in [27]. The detailed proof will be
included in the longer version of this paper.

Note that the above theorem holds under the condition that X is centered with
zero mean. Since rank(St ) = rank(Sb ) + rank(Sw ) holds in many applications
involving high-dimensional and under-sampled data, the above theorem can be
applied widely in practice.
According to theorem 1, the difference between W1 and W2 is the orthogonal
matrix Q. Since the Euclidean distance is invariant to any orthogonal transfor-
mation, if a classifier based on the Euclidean distance (e.g., K-Nearest-Neighbor
and linear support vector machine [9]) is applied to the dimensionality-reduced
data obtained by W1 and W2 , they will achieve the same classification result.
In our experiments, we use K-Nearest-Neighbor classifier.
Suppose we find the optimal solution of Eq. (9), i.e., W∗ and p∗ , then p∗ is a
binary vector, and diag(p)W is a matrix where the elements of many rows are
all zeros. This motivate us to absorb the indicator variables p into W, and use
L2,0 -norm on W to achieve feature selection, leading to the following problem
1
arg min ||XT W − H||2F ,
W 2
s.t. ||W||2,0 ≤ m. (13)

However, the feasible region defined by ||W||2,0 ≤ m is not convex. We relax


||W||2,0 ≤ m to its convex hull [3], and obtain the following relaxed problem,
1
arg min ||XT W − H||2F ,
W 2
s.t. ||W||2,1 ≤ m. (14)
Linear Discriminant Dimensionality Reduction 555

Note that Eq. (14) is no longer equivalent to Eq. (8) due to the relaxation.
However, the relaxation makes the optimization problem computationally much
easier. In this sense, the relaxation can be seen as a tradeoff between the strict
equivalence and computational tractability.
Eq. (14) is equivalent to the following regularized problem,
1
arg min ||XT W − H||2F + μ||W||2,1 , (15)
W 2

where μ > 0 is a regularization parameter. Given an m, we could find a μ, such


that Eq. (14) and Eq. (15) achieve the same solution. However, it is difficult to
give an analytical relationship between m and μ. Fortunately, such a relationship
is not crucial for our problem. Since it is easier to tune μ than an integer m, we
consider Eq. (15) in the rest of this paper.
Eq. (15) is mathematically similar to Group Lasso problem [28] and multi-task
feature selection [14]. However, the motivations of our method and those methods
are essentially different. Our method aims at integrating feature selection and
subspace learning in a unified framework, while Group Lasso and multi-task
feature selection aim at discovering common feature patterns among multiple
related learning tasks. The objective function in Eq. (15) is a non-smooth but
convex function. In the following, we will present an algorithm for solving Eq.
(15). Similar algorithm has been used for multi-task feature selection [14].

3.1 Proximal Gradient Descent


The most natural approach for solving the problem in Eq. (15) is the sub-gradient
descent method [3]. However, its convergence rate is very slow, i.e., O( 12 ) [19].
Recently, proximal gradient descent has received increasing attention in the
machine learning community [13] [14]. It achieves the optimal convergence rate,
i.e., O( 1 ) for the first-order method and is able to deal with large-scale non-
smooth convex problems. It can be seen as an extension of gradient descent,
where the objective function to minimize is the composite of a smooth part and
a non-smooth part. As to our problem, let
1
f (W) = ||XT W − H||2F
2
F (W) = f (W) + μ||W||2,1 . (16)
It is easy to show that f (W) is convex and differentiable, while μ||W||2,1 is
convex but non-smooth.
In each iteration of the proximal gradient descent algorithm, F (W) is lin-
earized around the current estimate Wt , and the value of W is updated as the
solution of the following optimization problem,
Wt+1 = arg min Gηt (W, Wt ), (17)
W

where Gηt (W, Wt ) is called proximal operator, which is defined as


ηt
Gηt (W, Wt ) = ∇f (Wt ), W − Wt  + ||W − Wt ||2 + μ||W||2,1 . (18)
2
556 Q. Gu, Z. Li, and J. Han

In our problem, ∇f (Wt ) = XXT Wt − XH. The philosophy under this formu-
lation is that if the optimization problem in Eq. (17) can be solved by exploiting
the structure of the L2,1 norm, then the convergence rate of the resulting al-
gorithm is the same as that of gradient descent method, i.e., O( 1 ), since no
approximation on the non-smooth term is employed. It is worth noting that
the proximal gradient descent can also be understood from the perspective of
auxiliary function optimization [15].
By ignoring the terms in Gηt (W, Wt ) that is independent of W, the opti-
mization problem in Eq. (17) boils down to

1 1 μ
Wt+1 = arg min ||W − (Wt − ∇f (Wt ))||2F + ||W||2,1 . (19)
W 2 ηt ηt

For the sake of simplicity, we denote Ut = Wt − η1t ∇f (Wt ), then Eq. (19) takes
the following form

1 μ
Wt+1 = arg min ||W − Ut ||2F + ||W||2,1 , (20)
W 2 ηt

which can be further decomposed into c separate subproblems of dimension d


μ
i
wt+1 = arg min ||wi − uit ||22 + ||wi ||2 , (21)
ηt
i
where wt+1 , wi and uit are the i-th rows of Wt+1 , W and Ut respectively. It
has a closed form solution [14] as follows

i∗ (1 − μ
ηt ||uit ||
)uit , if ||uit || > ημt
w = (22)
0, otherwise.

Thus, the proximal gradient descent in Eq. (17) has the same convergence rate
of O( 1 ) as gradient descent for smooth problem.

3.2 Accelerated Proximal Gradient Descent

To achieve more efficient optimization, we employ Nesterov’s method [19] to


accelerate the proximal gradient descent in Eq. (17), which owns the convergence
rate as O( √1 ). More specifically, we construct a linear combination of Wt and
Wt+1 to update Vt+1 as follows:

αt − 1
Vt+1 = Wt + (Wt+1 − Wt ), (23)
αt+1

1+ 1+4α2t
where the sequence {αt }t≥1 is conventionally set to be αt+1 = 2 . For
more detail, please refer to [13]. Here we directly present the final algorithm for
optimizing Eq. (15) in Algorithm 1.
The convergence of this algorithm is stated in the following theorem.
Linear Discriminant Dimensionality Reduction 557

Algorithm 1. Linear Discriminant Dimensionality Reduction


Initialize: η0 , W1 ∈ Rd×m , α1 = 1;
repeat
while F (Wt ) > G(Wt , Wt ) do
Set η = γηt−1
end while
Set ηt = ηt−1
Compute Wt+1 = arg √ minW Gηt (W, Vt )
1+ 1+4α2
Compute αt+1 = 2
t

αt −1
Compute Vt+1 = Wt + αt+1
(Wt+1 − Wt )
until convergence

Theorem 2. [19] Let {Wt } be the sequence generated by Algorithm 1, then for
any t ≥ 1 we have
2γL||W1 − W∗ ||2F
F (Wt ) − F (W∗ ) ≤ , (24)
(t + 1)2
where L is the Lipschitz constant of the gradient of f (W) in the objective func-
tion, W∗ = arg minW F (W).
Theorem 2 shows that the convergence rate of the accelerated proximal gradient
descent method is O( √1 ).

4 Related Work
In this section, we discuss some approaches which are closely related to our
method.
In order to pursue sparsity and interpretability in LDA, [17] proposed both
exact and greedy algorithms for binary class sparse LDA as well as its spectral
bound. For multi-class problem, [5] proposed a sparse LDA (SLDA) based on
1 -norm regularized Spectral Regression,
arg min ||XT w − y||22 + μ||w||1 , (25)
w

where y is the eigenvector of Sb y = λSt y. Due to the nature of the 1 penalty,


some entries in w will be shrunk to exact zero if λ is large enough, which results
in a sparse projection. However, SLDA does not lead to feature selection, because
each column of the linear transformation matrix is optimized one by one, and
their sparsity patterns are independent. In contrast, our method is able to do
feature selection.
On the other hand, [16] proposed another feature selection method based on
Fisher criterion, namely Linear Discriminant Feature Selection (LDFS), which
modifies LDA to admit feature selection as follows,

d
arg min tr((WT Sw W)−1 (WT Sb W)) + μ ||wi ||∞ , (26)
W
i=1
558 Q. Gu, Z. Li, and J. Han


where di=1 ||ai ||∞ is the 1 /∞ norm of W. The optimization problem is con-
vex and solved by quasi-Newton method [3]. Although LDFS involves structured
sparse transformation matrix W as in our method, it use it to select features
rather than doing feature selection and transformation together. Hence it is
fundamentally a feature selection method. In comparison, our method uses the
structured sparse transformation matrix for both feature selection and combi-
nation.

5 Experiments
In this section, we evaluate the proposed method, i.e., LDDR, and compare
it with the state of the art subspace learning methods, e.g. PCA, LDA and
Locality Preserving Projection (LPP) [12], sparse LDA (SLDA) [5]. We also
compare it with the feature selection methods, e.g., Fisher score (FS) and Linear
Discriminant Feature Selection (LDFS) [16]. Moreover, we study Fisher score
followed with LDA (FS+LDA), which is the most intuitive way to conduct Fisher
score and LDA together. We use K-Nearest Neighbor classifier where K = 1 as
the baseline method. All the experiments were performed in Matlab on a Intel
Core2 Duo 2.8GHz Windows 7 machine with 4GB memory.

5.1 Data Sets


We use two standard face recognition databases which are used in [11] [5].
ORL face database1 contains 10 images for each of the 40 human subjects,
which were taken at different times, varying the lighting, facial expressions and
facial details. The original images (with 256 gray levels) have size 92×112, which
are resized to 32 × 32 for efficiency.
Extended Yale-B database2 contains 16128 face images of 38 human sub-
jects under 9 pose and 64 illumination conditions. In our experiment, we choose
the frontal pose and use all the images under different illumination, thus we get
2414 image in total. All the face images are manually aligned and cropped. They
are resized to 32 × 32 pixels, with 256 gray levels per pixel. Thus each face image
is represented as a 1024-dimensional vector.

5.2 Parameter Settings


For ORL data set, p = 2, 3, 4 images were randomly selected as training samples
for each person, and for Yale-B data set, p = 10, 20, 30 images were randomly se-
lected as training samples for each person. The rest images were used for testing.
The training set was used to learn a subspace, and the recognition was performed
in the subspace by K-Nearest Neighbor classifier where K = 1 according to [5].
Since the training set was randomly chosen, we repeated each experiment 20
times and calculated the average recognition accuracy. In general, the recog-
nition rate varies with the dimensionality of the subspace. The best average
1
http://www.cl.cam.ac.uk/Research/DTG/attarchive:pub/data
2
http://vision.ucsd.edu/~ leekc/ExtYaleDatabase/ExtYaleB.html
Linear Discriminant Dimensionality Reduction 559

performance obtained as well as the corresponding dimensionality is reported. It


is worth noticing that for LDDR, the dimensionality of the subspace is exactly
the same as the number of classes, i.e., c according to Eq.(9).
For LDA, as in [2], we first use PCA to reduce the dimensionality to n − c and
then perform LDA to reduce the dimensionality to c − 1. This is also known as
Fisher Face [2]. For FS+LDA, we first use Fisher Score to select 50% features
and then perform LDA to reduce the dimensionality. For LPP, we use the cosine
distance to compute the similarity between xi and xj . For SLDA, we tune μ
by searching the grid {10, 20, . . . , 100} on the testing set according to [5]. For
LDFS and LDDR, the regularization parameter μ is tuned by searching the grid
{0.01, 0.05, 0.1, 0.2, 0.5} on the testing set. As we know, tuning parameter on the
testing set could be biased. However, since we did this for all the methods as
long as they have parameters to tune, it is still a fair comparison.

5.3 Recognition Results

The experimental results are shown in Table 1 and Table 2. We can observe
that (1) On some cases, Fisher score is better than LDA, while on more cases,
LDA outperforms Fisher score. This implies feature transformation may be more
essential than feature selection; (2) LDFS is worse than Fisher score on the ORL
data set, while it is better than Fisher score on the Yale-B data set. This indicates
the performance gain by doing feature selection under slightly different criterion
is limited; (3) SLDA is better than LDA, which implies sparsity is able to improve
the classification performance of LDA; (4) FS+LDA improves both FS and LDA
at most cases. It is even better than SLDA at some cases. This implies the
potential performance gain of combining Fisher score and LDA. However, at
some cases, FS+LDA is not as good as FS or LDA. This is because Fisher
score and LDA are conducted individually in FS+LDA. The selected features
by Fisher score are not necessarily useful for LDA; (5) LDDR outperforms FS,
LDA, SLDA and FS+LDA consistently and overwhelmingly, which indicates that
by performing Fisher score and LDA simultaneously to maximize the Fisher

Table 1. Face recognition accuracy on the ORL data set

Data set 2 training 3 training 4 training


Acc Dim Acc Dim Acc Dim
Baseline 66.81±3.41 – 77.02±2.55 – 81.73±2.27 –
PCA 66.81±3.41 79 77.02±2.55 119 81.73±2.27 159
FS 69.06±3.04 197 79.07±2.71 200 84.42±2.41 199
LDFS 62.69±3.43 198 75.45±2.28 192 81.96±2.56 188
LDA 71.27±3.58 28 83.36±1.84 39 89.63±2.01 39
LPP 72.41±3.17 39 84.20±1.73 39 90.42±1.41 39
FS+LDA 71.81±3.36 28 84.13±1.35 39 88.56±2.16 39
SLDA 74.14±2.92 39 84.86±1.82 39 91.44±1.53 39
LDDR 76.88±3.49 40 86.89±1.91 40 92.77±1.61 40
560 Q. Gu, Z. Li, and J. Han

Table 2. Face recognition accuracy on the Yale-B data set

Data set 10 training 20 training 30 training


Acc Dim Acc Dim Acc Dim
Baseline 53.44±0.82 – 69.24±1.19 – 77.39±0.98 –
PCA 52.41±0.89 200 67.04±1.18 200 74.57±1.07 200
FS 64.34±1.40 200 76.53±1.19 200 82.15±1.14 200
LDFS 66.86±1.17 182 80.50±1.17 195 83.16±0.90 197
LDA 78.33±1.31 37 85.75±0.84 37 81.19±2.05 37
LPP 79.70±2.96 76 80.24±5.49 75 86.40±1.45 78
FS+LDA 77.89±1.82 37 87.89±0.88 37 93.91±0.69 37
SLDA 81.56±1.38 37 89.68±0.85 37 92.88±0.68 37
LDDR 89.45±1.11 38 96.44±0.85 38 98.66±0.43 38

criterion, Fisher score and LDA can enhance each other greatly. The selected
features by LDDR should be more useful than those selected by Fisher score.
We will illustrate this point latter.

5.4 Projection Matrices

To get a better understanding of our approach, we plot the linear transformation


matrices of our method and related methods on the ORL and Yale-B data sets
in Fig. 1 and Fig . 2 respectively. Clearly, the linear transformation matrix of
LDA is very dense, which is not easy to interpret. Each column of the linear
transformation matrix of SLDA is sparse. However, the sparse patterns of each
column are not coherent. In other word, for different dimensions of the subspace,
the selected features by SLDA are different. Therefore, it is unclear which fea-
tures are useful for the whole transformation. In contrast, each row of the linear
transformation matrix of LDDR tends to be zero simultaneously, which leads
to joint feature selection and transformation. This is exactly what we pursue.
Note that the sparsity of the linear transformation matrix of LDDR is controlled
by the regularization parameter μ. That is, the number of selected features in
LDDR is indirectly controlled by μ. We will show that the performance of LDDR
is not sensitive to μ latter.

5.5 Selected Features

We are also interested in the features selected by LDDR. We plot the top 50 se-
lected features (pixels) of our method and Fisher score on the ORL and Yale-B
data sets in Fig. 3 and Fig. 4 respectively. It is shown that the distribution of se-
lected features (pixels) by Fisher score is highly skewed. Most features distribute
in only one or two regions. Many features even reside on the non-face region.
It implies that the features selected by Fisher score are not discriminative. In
contrast, the features selected by LDDR distribute widely across the face region.
From another perspective, we can see that the features (pixels) selected by
LDDR are asymmetric. In other word, if one pixel is selected, its axis symmetric
Linear Discriminant Dimensionality Reduction 561

100 100 100

200 200 200

300 300 300

400 400 400

500 500 500

600 600 600

700 700 700

800 800 800

900 900 900

1000 1000 1000


5 10 15 20 25 30 35 5 10 15 20 25 30 35 5 10 15 20 25 30 35 40

(a) LDA (b) SLDA ( c ) LDDR

Fig. 1. The linear transformation matrix learned by (a) LDA, (b) SLDA (μ = 50) and
(c) LDDR (μ = 0.5) with 3 training samples per person on the ORL database. For
better viewing, please see it in color pdf file.

100 100 100

200 200 200

300 300 300

400 400 400

500 500 500

600 600 600

700 700 700

800 800 800

900 900 900

1000 1000 1000


5 10 15 20 25 30 35 5 10 15 20 25 30 35 5 10 15 20 25 30 35

(a) LDA (b) SLDA ( c ) LDDR

Fig. 2. The linear transformation matrix learned by (a) LDA, (b) SLDA (μ = 50) and
(c) LDDR (μ = 0.5) with 20 training samples per person on the Yale-B database. For
better viewing, please see it in color pdf file.

(a) Fisher Score (b) LDDR

Fig. 3. Selected features (marked by blue cross) by (a) Fisher score and (b) LDDR
(μ = 0.5) with 3 training samples per person on the ORL database. For better viewing,
please see it in color pdf file.

one will not be selected. This is because the face image is roughly axis symme-
try, so one in a pair of axis symmetric pixels is redundant given the other one
is selected. Moreover, the selected pixels are mostly around the eyebrow, the
boundary of eyes, nose and cheek, which are discriminative for distinguishing
face images of different people. This is accord with our life common sense.
562 Q. Gu, Z. Li, and J. Han

(a) Fisher Score (b) LDDR

Fig. 4. Selected features (marked by blue cross) by (a) Fisher score and (b) LDDR
(μ = 0.5)with 20 training samples per person on the Yale-B database. For better
viewing, please see it in color pdf file.

80 90 95

78 89 94

76 88 93

74 87 92

72 86 91
accuracy

accuracy

accuracy
70 85 90

68 84 89

66 83 88

64 82 87

62 LDA 81 LDA 86 LDA


LDDR LDDR LDDR
60 80 85
-4.5 -4 -3.5 -3 -2.5 -2 -4.5 -4 -3.5 -3 -2.5 -2 -4.5 -4 -3.5 -3 -2.5 -2
log(μ) log(μ) log(μ)

(a) 2 training (b) 3 training ( c ) 4 training

Fig. 5. Recognition accuracy with respect to the regularization parameter μ on the


ORL database

90 100 100

98

96 95

94
85
92 90
accuracy

accuracy

accuracy

90

88 85
80
86

84 80

LDA 82 LDA LDA


LDDR LDDR LDDR
75 80 75
-4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1
log(μ) log(μ) log(μ)

(a) 10 training (b) 20 training ( c ) 30 training

Fig. 6. Recognition accuracy with respect to the regularization parameter μ on the


Yale-B database

5.6 Sensitivity to the Regularization Parameter


LDDR only has one parameter, which is the regularization parameter μ. It in-
directly controls the number of selected features. Here we will investigate the
recognition accuracy with respect to the regularization parameter μ. We vary
the value of μ, and plot the recognition accuracy with respect to μ on the ORL
and Yale-B data sets in Fig. 5 and Fig. 6 respectively.
Linear Discriminant Dimensionality Reduction 563

As can be seen, LDDR is not sensitive to the regularization parameter μ


in a wide range of μ. In detail, LDDR achieves consistently good performance
with the μ varying from 0.01 to 0.1 on the ORL data set. LDDR is even more
stable on the Yale-B data set, where it gets overwhelmingly good result with
the μ changing from 0.01 to 0.5. This shows that in certain range, the number
of useful features does not affect the performance of the jointly learnt linear
transformation very much. It is an appealing property because we do not need
to tune the regularization parameter painfully in the application.

6 Conclusion
In this paper, we propose to integrate Fisher score and LDA in a unified frame-
work, namely Linear Discriminant Dimensionality Reduction. We aim at finding
a subset of features, based on which the learnt linear transformation via LDA
maximizes the Fisher criterion. LDDR inherits the advantages of Fisher score
and LDA and is able to do feature selection and subspace learning simultane-
ously. Both Fisher score and LDA can be seen as the special cases of the pro-
posed method. The resultant optimization problem is relaxed into a L2,1 -norm
constrained least square problem and solved by accelerated proximal gradient de-
scent algorithm. Experiments on benchmark face recognition data sets illustrate
the efficacy of the proposed framework.

Acknowledgments. The work was supported in part by NSF IIS-09-05215,


U.S. Air Force Office of Scientific Research MURI award FA9550-08-1-0265,
and the U.S. Army Research Laboratory under Cooperative Agreement Num-
ber W911NF-09-2-0053 (NS-CTA). We thank the anonymous reviewers for their
helpful comments.

References
1. Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Ma-
chine Learning 73(3), 243–272 (2008)
2. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces:
Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach.
Intell. 19(7), 711–720 (1997)
3. Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press,
Cambridge (2004)
4. Cai, D., He, X., Han, J.: Semi-supervised discriminant analysis. In: ICCV, pp. 1–7
(2007)
5. Cai, D., He, X., Han, J.: Spectral regression: A unified approach for sparse subspace
learning. In: ICDM, pp. 73–82 (2007)
6. Fukunaga, K.: Introduction to statistical pattern recognition, 2nd edn. Academic
Press Professional, Inc., San Diego (1990)
7. Golub, G.H., Loan, C.F.V.: Matrix computations, 3rd edn. Johns Hopkins Univer-
sity Press, Baltimore (1996)
8. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal
of Machine Learning Research 3, 1157–1182 (2003)
564 Q. Gu, Z. Li, and J. Han

9. Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning.
Springer, Heidelberg (2001)
10. He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: NIPS (2005)
11. He, X., Cai, D., Yan, S., Zhang, H.: Neighborhood preserving embedding. In: ICCV,
pp. 1208–1213 (2005)
12. He, X., Niyogi, P.: Locality preserving projections. In: NIPS (2003)
13. Ji, S., Ye, J.: An accelerated gradient method for trace norm minimization. In:
ICML, p. 58 (2009)
14. Liu, J., Ji, S., Ye, J.: Multi-task feature learning via efficient l2,1 -norm minimiza-
tion. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial
Intelligence, UAI 2009 (2009)
15. Luo, D., Ding, C.H.Q., Huang, H.: Towards structural sparsity: An explicit l2/l0
approach. In: ICDM, pp. 344–353 (2010)
16. Masaeli, M., Fung, G., Dy, J.G.: From transformation-based dimensionality reduc-
tion to feature selection. In: ICML, pp. 751–758 (2010)
17. Moghaddam, B., Weiss, Y., Avidan, S.: Generalized spectral bounds for sparse lda.
In: ICML, pp. 641–648 (2006)
18. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Pro-
gram. 103(1), 127–152 (2005)
19. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course.
Kluwer Academic Publishers, Dordrecht (2003)
20. Nie, F., Xiang, S., Jia, Y., Zhang, C., Yan, S.: Trace ratio criterion for feature
selection. In: AAAI, pp. 671–676 (2008)
21. Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace
selection for multiple classification problems. Statistics and Computing 20, 231–252
(2010)
22. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience
Publication, Hoboken (2001)
23. Song, L., Smola, A.J., Gretton, A., Borgwardt, K.M., Bedo, J.: Supervised feature
selection via dependence estimation. In: ICML, pp. 823–830 (2007)
24. Sugiyama, M.: Local fisher discriminant analysis for supervised dimensionality re-
duction. In: ICML, pp. 905–912 (2006)
25. Yan, S., Xu, D., Zhang, B., Zhang, H., Yang, Q., Lin, S.: Graph embedding and
extensions: A general framework for dimensionality reduction. IEEE Trans. Pattern
Anal. Mach. Intell. 29(1), 40–51 (2007)
26. Ye, J.: Characterization of a family of algorithms for generalized discriminant anal-
ysis on undersampled problems. Journal of Machine Learning Research 6, 483–502
(2005)
27. Ye, J.: Least squares linear discriminant analysis. In: ICML, pp. 1087–1093 (2007)
28. Yuan, M., Yuan, M., Lin, Y., Lin, Y.: Model selection and estimation in regression
with grouped variables. Journal of the Royal Statistical Society, Series B 68, 49–67
(2006)
29. Zhao, Z., Wang, L., Liu, H.: Efficient spectral feature selection with minimum
redundancy. In: AAAI (2010)
DB-CSC: A Density-Based Approach for
Subspace Clustering in Graphs with Feature
Vectors

Stephan Günnemann, Brigitte Boden, and Thomas Seidl

RWTH Aachen University, Germany


{guennemann,boden,seidl}@cs.rwth-aachen.de

Abstract. Data sources representing attribute information in combi-


nation with network information are widely available in today’s appli-
cations. To realize the full potential for knowledge extraction, mining
techniques like clustering should consider both information types si-
multaneously. Recent clustering approaches combine subspace clustering
with dense subgraph mining to identify groups of objects that are simi-
lar in subsets of their attributes as well as densely connected within the
network. While those approaches successfully circumvent the problem
of full-space clustering, their limited cluster definitions are restricted to
clusters of certain shapes.
In this work, we introduce a density-based cluster definition taking
the attribute similarity in subspaces and the graph density into account.
This novel cluster model enables us to detect clusters of arbitrary shape
and size. We avoid redundancy in the result by selecting only the most in-
teresting non-redundant clusters. Based on this model, we introduce the
clustering algorithm DB-CSC. In thorough experiments we demonstrate
the strength of DB-CSC in comparison to related approaches.

1 Introduction
In the past few years, data sources representing attribute information in com-
bination with network information have become more numerous. Such data
describes single objects via attribute vectors and also relationships between dif-
ferent objects via edges. Examples include social networks, where friendship
relationships are available along with the users’ individual interests (cf. Fig 1);
systems biology, where interacting genes and their specific expression levels are
recorded; and sensor networks, where connections between the sensors as well
as individual measurements are given. There is a need to extract knowledge
from such complex data sources, e.g. for finding groups of homogeneous objects,
i.e. clusters. Throughout the past decades, a multitude of clustering techniques
were introduced that either solely consider attribute information or network in-
formation. However, simply applying one of these techniques misses the potential
given by such combined data sources. To detect more informative patterns it is
preferable to simultaneously consider relationships together with attribute in-
formation. In this work, we focus on the mining task of clustering to extract
meaningful groups from such complex data.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 565–580, 2011.

c Springer-Verlag Berlin Heidelberg 2011
566 S. Günnemann, B. Boden, and T. Seidl

Age: 27
Age: 25 Age: 28
TV [h/week] TV: 5.0
TV: 5.5 TV: 4.0
Web: 0
Web: 7 Web: 17

Age: 26 Age: 29
TV: 4.5 TV: 3.0
Age: 27
Web: 15 Web: 2
TV: 3.5
Age [years] Web: 33

(a) Attribute view (2d subspace) (b) Graph view (extract)

Fig. 1. Clusters in a social network with attributes age, tv consume and web consume

Former techniques considering both data types aim at a combination of dense


subgraph mining (regarding relationships) and traditional clustering (regard-
ing attributes). The detected clusters are groups of objects showing high graph
connectivity as well as similarity w.r.t. all of their attributes. However, using tra-
ditional, i.e. full-space, clustering approaches on these complex data does mostly
not lead to meaningful patterns. Usually, for each object a multitude of different
characteristics is recorded; though not all of these characteristics are relevant for
each cluster and thus clusters are located only in subspaces of the attributes. In,
e.g., social networks it is very unlikely that people are similar within all of their
characteristics. In Fig. 1(b) the customers show similarity in two of their three at-
tributes. In these scenarios, applying full-space clustering is futile or leads to very
questionable clustering results since irrelevant dimensions strongly obfuscate the
clusters and distances are not discriminable anymore [4]. Consequentially, recent
approaches [16,9] combine the paradigms of dense subgraph mining and subspace
clustering, i.e. clusters are identified in locally relevant subspace projections of
the attribute data.
Joining these two paradigms leads to clusters useful for many applications:
In social networks, closely related friends with similar interests in some product
relevant attributes are useful for target marketing. In systems biology, functional
modules with partly similar expression levels can be used for novel drug design. In
sensor networks, long distance reports of connected sensors sharing some similar
measurements can be accumulated and transfered by just one representative to
reduce energy consumption. Overall, by using subspace clustering the problems
of full-space similarity are, in principle, circumvented for these complex data.
Though, the cluster models of the existing approaches [16,9] have a severe limi-
tation: they are limited to clusters of certain shapes. This holds for the properties
a cluster has to fulfill w.r.t the network information as well as w.r.t. the attribute
information. While for the graph structure models like quasi-cliques are used, for
the vector data the objects’ attribute values are just allowed to deviate within
an interval of fixed width the user has to specify. Simply stated, the clustered
objects have to be located in a rectangular hypercube of given width. For real
world data, such cluster definitions are usually too restrictive since clusters can
exhibit more complex shapes. Considering for example the 2d attribute subspace
of the objects in Fig. 1(a): The two clusters can not be correctly detected by the
rectangular hypercube model of [16,9]. Either both clusters are merged or some
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 567

objects are lost. In Fig. 1(b) an extract of the corresponding network structure
is shown. Since the restrictive quasi-clique property would only assign a low
density to this cluster, the cluster would probably be split by [9].
In this work, we combine dense subgraph mining with subspace clustering
based on a more sophisticated cluster definition; thus solving the drawbacks of
previous approaches. Established for other data types, density-based notions of
clusters have shown their strength in many cases. Thus, we introduce a density-
based clustering principle for the considered combined data sources. Our clusters
correspond to dense regions in the attribute space as well as in the graph. Based
on local neighborhoods taking the attribute similarity in subspaces as well as
the graph information into account, we model the density of single objects. By
merging all objects located in the same dense region, the overall clusters are
obtained. Thus, our model is able to detect the clusters in Fig. 1 correctly.
Besides the sound definition of clusters based on density values, we achieve a
further advantage. In contrast to previous approaches, the clusters in our model
are not limited in their size or shape but can show arbitrary shapes. For example,
the diameter of the clusters detected in [9] is a priori restricted by a parameter-
dependent value [17] leading to a bias towards clusters of small size and little
extent. Such a bias is avoided in our model, where the sizes and shapes of clusters
are automatically detected. Overall, our contributions are:

– We introduce a novel density-based cluster definition taking attributes in


subspaces and graph information into account.
– We ensure an unbiased cluster detection since our clusters can have arbitrary
shape and size.
– We develop the algorithm DB-CSC to determine such a clustering solution.

2 Related Work

Different clustering methods were proposed in the literature. Clustering vector


data is traditionally done by using all attributes of the feature space. Density-
based techniques [8,11] have shown their strength in contrast to other full-space
clustering approaches like k-means. They do not require the number of clusters
as an input parameter and are able to find arbitrarily shaped clusters. However,
full-space clustering does not scale to high dimensional data since locally irrele-
vant dimensions obfuscate the clustering structure [4,14]. As a solution, subspace
clustering methods detect an individual set of relevant dimensions for each clus-
ter [14]. Also for this mining paradigm, density-based clustering approaches [13,3]
have resolved the drawbacks of previous subspace cluster definitions. However,
none of the introduced subspace clustering techniques considers graph data.
Mining graph data can be done in various ways [1]. The task of finding groups
of densely connected objects in one large graph, as needed for our method, is
often referred to as “graph clustering” or “dense subgraph mining”. On overview
of the various dense subgraph mining techniques is given in [1]. Furthermore,
several different notions of density are used for graph data including cliques,
568 S. Günnemann, B. Boden, and T. Seidl

γ-quasi-cliques [17], and k-cores [5,12]. However, none of the introduced dense
subgraph mining techniques considers attribute data annotated to the vertices.
There are some methods considering graph data and attribute data. In [6,15]
attribute data is only used in a post-processing step. [10] transforms the network
into a distance function and afterwards applies traditional clustering. In [18],
contrarily, the attribute information is transformed into a graph. [7] extends the
k-center problem by requiring that each group has to be a connected subgraph.
These approaches [10,18,7] perform full-space clustering on the attributes. [19,20]
enriches the graph by further nodes based on the vertices’ attribute values and
connects them to vertices showing this value. The clustered objects are only
pairwise similar and no specific relevant dimensions can be defined.
Recently, two approaches [16,9] were introduced that deal with subspace clus-
tering and dense subgraph mining. However, both approaches use too simple
cluster definitions. Similar to grid-based subspace clustering [2], a cluster (w.r.t.
the attributes) is simply defined by taking all objects located within a given
grid cell, i.e. whose attribute values differ by at most a given threshold. The
methods are biased towards small clusters with little extend. This drawback is
even worsened by considering the used notions of dense subgraphs: e.g. by using
quasi-cliques as in [9] the diameter is a priori constrained to a fixed threshold
[17]. Very similar objects just slightly located next to a cluster are lost due to
such restrictive models. Overall, finding meaningful patterns based on the clus-
ter definitions of [16,9] is questionable. Furthermore, the method in [16] does not
eliminate redundancy which usually occurs in the analysis of subspace projec-
tions due to the exponential search space containing highly similar clusters.
Our novel model is the first approach using a density-based notion of clusters
for the combination of subspace clustering and dense subgraph mining. This
allows arbitrary shaped and arbitrary sized clusters hence leading to an unbiased
definition of clusters. We remove redundant clusters induced by similar subspace
projections resulting in meaningful result sizes.

3 A Density-Based Clustering Model for Combined Data


In this section we introduce our density-based clustering model for the combined
clustering of graph data and attribute data. The clusters in our model correspond
to dense regions in the graph as well as in the attribute space. For simplicity,
we first introduce in Section 3.1 the cluster model for the case that only a single
subspace, e.g. the full-space, is considered. The extension to subspace clustering
and the definition of a redundancy model to confine the final clustering to the
most interesting subspace clusters is introduced in Section 3.2.
Formally, the input for our model is a vertex-labeled graph G = (V, E, l)
with vertices V , edges E ⊆ V × V and a labeling function l : V → RD where
Dim = {1, . . . , D} is the set of dimensions. We assume an undirected graph
without self-loops, i.e. (v, u) ∈ E ⇔ (u, v) ∈ E and (u, u) 
∈ E. Furthermore, we
use x[i] to refer to the i-th component of a vector x ∈ RD .
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 569

3.1 Cluster Model for a Single Subspace


In this section we introduce our density-based cluster model for the case of
a single subspace. The basic idea of density-based clustering is that clusters
correspond to connected dense regions in the dataspace that are separated by
sparse regions. Therefore, each clustered object x has to exceed a certain minimal
density, i.e. its local neighborhood has to contain a sufficiently high number of
objects that are also located in x’s cluster. Furthermore, in order to form a
cluster, a set of objects has to be connected w.r.t. their density, i.e. objects
located in the same cluster have to be connected via a chain of objects from the
cluster such that each object lies within the neighborhood of its predecessor.
Consequently, one important aspect of our cluster model is the proper definition
of a node’s local neighborhood.
If we would simply determine the neighborhood based on the attribute data,
as done in [8,11], we could define the attribute neighborhood of a vertex v by
its -neighborhood, i.e. the set of all objects which distances to o do not exceed
a threshold . Formally,

NV (v) = {u ∈ V | dist(l(u), l(v)) ≤ }

with an appropriate distance function like the maximum norm dist(x, y) =


maxi∈{1,...,D} |x[i] − y[i]|. Just considering attribute data, however, leads to the
problem illustrated in Fig. 2. Considering the attribute space, the red triangles
form a dense region. (The edges of the graph and the parameter  are depicted
in the figure.) Though, this group is not a meaningful cluster since in the graph
this vertex set is not densely connected. Accordingly, we have to consider the
graph data and attribute data simultaneously to determine the neighborhood.
Intuitively, taking the graph into account can be done by just using adjacent
vertices for density computation. The resulting simple combined neighborhood of
a vertex v would be the intersection
V
N,adj (v) = NV (v) ∩ {u ∈ V | (u, v) ∈ E}

In Fig. 2 the red triangles would not be dense anymore because their simple
combined neighborhoods are empty. However, using just the adjacent vertices
leads to a too restrictive cluster model, as the next example in Fig. 4 shows.
Assuming that each vertex has to contain 3 objects in its neighborhood (in-
cluding the object itself) to be dense, we get two densely connected sets, i.e. two
clusters, in Fig. 4(a). In Fig. 4(b), we have the same vertex set, the same set
of attribute vectors and the same graph density. The example only differs from
the first one by the interchange of the attribute values of the vertices v3 and
v4 , which both belong to the same cluster in the first example. Intuitively, this
set of vertices should also be a valid cluster in our definition. However, it is not
because the neighborhood of v2 contains just the vertices {v1 , v2 }. The vertex
v4 is not considered since it is just similar w.r.t. the attributes but not adjacent.
The missing tolerance w.r.t. interchanges of the attribute values is one problem
induced by using just adjacent vertices. Furthermore, this approach would not
570 S. Günnemann, B. Boden, and T. Seidl

H H
Dim 2

Dim 2
Dim 1 Dim 1

Fig. 2. Dense region in the attribute Fig. 3. Novel challenge by using local den-
space but sparse region in the graph sity computation and k-neighborhoods

H H

Dim 2
v3 v2 v4
Dim 2

v2
v1 v4 v1 v3

Dim 1 Dim 1
(a) Successful with k ≥1, minP ts = 3 (b) Successful with k ≥2, minP ts = 3

Fig. 4. Robust cluster detection by using k-neighborhoods. Adjacent vertices (k = 1)


are not always sufficient for correct detection (left: successful; right: fails)

be tolerant w.r.t. small errors in the edge set. For example in social networks,
some friendship links are not present in the current snapshot although the people
are aware of each other. Such errors should not prevent a good cluster detection.
Thus, in our approach we consider all vertices that are reachable over at most k
edges to obtain a more error-tolerant model. Formally, the neighborhood w.r.t.
the graph data is given by:
Definition 1 (Graph k-neighborhood). As vertex u is k-reachable from a
vertex v (over a set of vertices V ) if
∃v1 , . . . , vk ∈ V : v1 = v ∧ vk = u ∧ ∀i ∈ {1, . . . , k − 1} : (vi , vi+1 ) ∈ E
The graph k-neighborhood of a vertex v ∈ V is given by
NkV (v) = {u ∈ V | u is x-reachable from v (over V ) ∧ x ≤ k} ∪ {v}
Please note that the object v itself is contained in its neighborhood NkV (v) as
well. Overall, the combined neighborhood of a vertex v ∈ V considering graph
and attribute data can be formalized by intersecting v’s graph k-neighborhood
with its -neighborhood.
Definition 2 (Combined local neighborhood). The combined neighborhood
of v ∈ V is:
N V (v) = NkV (v) ∩ NV (v)
Using the combined neighborhood N V (v) and k ≥ 2, we get in Fig. 4(a) and
Fig. 4(b) the same two clusters. In both examples v2 ’s neighborhood contains 3
vertices, e.g. N V (v2 ) = {v1 , v2 , v4 } in Fig. 4(b). So to speak, we “jump over” the
vertex v3 to find further vertices that are similar to v2 in the attribute space.
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 571

Local density calculation. As mentioned, the “jumping” principle is necessary


to get meaningful clusters. However, it leads to a novel challenge not given in
previous density-based clustering approaches. Considering Figure 3, the vertices
on the right hand side form a combined cluster and are clearly separated from the
remaining vertices by their attribute values. However, on the left hand side we
have two separate clusters, one consisting of the vertices depicted as dots and
one consisting of the vertices depicted as triangles. If we would only consider
attribute data for clustering, these two clusters would be merged into one big
cluster as the vertices are very similar w.r.t. their attribute values. If we would
just use adjacent vertices, this big cluster would not be valid as there are no
edges between the dot vertices and the triangle vertices. However, since we allow
“jumps” over vertices, the two clusters could be merged, e.g. for k = 2.
This problem arises because the dots and triangles are connected via the
vertices on the right hand side, i.e. we “jump” over vertices that actually do not
belong to the final cluster. Thus, a “jump” has to be restricted to the objects
of the same cluster. In Fig. 4(b) e.g., the vertices {v2 , v3 , v4 } belong to the same
cluster. Thus, reaching v4 over v3 to increase the density of v2 is meaningful.
Formally, we have to restrict the k-reachability used in Def. 1 to the vertices
contained in the cluster O. In this case, the vertices on the right hand side of
Fig. 3 cannot be used for “jumping” and thus the dots are not in the triangles’
neighborhoods and vice versa. Thus, we are able to separate the two clusters.
Overall, for computing the neighborhoods and hence the densities, only ver-
tices v ∈ O of the same cluster O can be used. While previous clustering ap-
proaches calculate the densities w.r.t. the whole database (all objects in V ),
our model calculates the densities within the clusters (objects in O). Instead of
calculating global densities, we determine local densities based on the cluster O.
While this sounds nice in theory, it is difficult to solve in practice, as obviously
the set of clustered objects O is not known a priori but has to be determined.
The cluster O depends on the density values of the objects, while the density
values depend on O. So we get a cyclic dependency of both properties.
In our theoretical clustering model, we can solve this cyclic dependency by
assuming a set of clustered objects O as given. The algorithmic solution is pre-
sented in Sec. 4. Formally, given the set of clustered objects O three properties
have to be fulfilled: First, each vertex v ∈ O has to be dense w.r.t. it local
neighborhood. Second, the spanned region of these objects has to be (locally)
connected since otherwise we would have more than two clusters (cf. Fig 3).
Last, the set O has to be maximal w.r.t. the previous properties since otherwise
some vertices of the cluster are lost. Overall,
Definition 3 (Density-based combined cluster). A combined cluster in a
graph G = (V, E, l) w.r.t. the parameters k,  and minP ts is a set of vertices
O ⊆ V that fulfills the following properties:
(1) high local density: ∀v ∈ O : |N O (v)| ≥ minP ts
(2) locally connected: ∀u, v ∈ O : ∃w1 , . . . , wl ∈ O : w1 = u ∧ wl = v ∧ ∀i ∈
{1, . . . , l − 1} : wi ∈ N O (wi+1 )
(3) maximality: ¬∃O ⊃ O : O fulfills (1) and (2)
572 S. Günnemann, B. Boden, and T. Seidl

Please note that the neighborhood calculation N O (v) is always done w.r.t. to the
set O and not w.r.t. the whole graph V . Based on this definition and by using
minP ts = 3, k ≥ 2, we can e.g. detect the three clusters from Fig. 3. By using
a too small value for k (as for example k = 1 in Fig. 4(b)), clusters are often
split or even not detected at all. On the other hand, if we choose k too high,
we run the risk that the graph structure is not adequately considered any more.
However, for the case depicted in Fig. 3 our model detects the correct clusters
even for arbitrary large k values as the cluster on the right hand side is clearly
separated from the other clusters by its attribute values. The two clusters on
the left hand side are never merged by our model as we do not allow jumps over
vertices outside the cluster.
In this section we introduced our combined cluster model for the case of a
single subspace. As shown in the examples, the model can detect clusters that
are dense in the graph as well as in the attribute space and that can often not
be detected by previous approaches.

3.2 Overall Subspace Clustering Model

In this section we extend our cluster model to a subspace clustering model. Be-
sides the adapted cluster definition we have to take care of redundancy problems
due to the exponential many subspace projections. As mentioned in the last sec-
tion, we are using the maximum norm in the attribute space. If we just want
to analyze subspace projections, we can simply define the maximum norm re-
stricted to a subspace S ⊆ Dim as
distS (x, y) = maxi∈S |x[i] − y[i]|
In principle any Lp norm can be restricted in this way and can be used within
our model. Based on this distance function, we can define a subspace cluster
which fulfills the cluster properties just in a subset of the dimensions:

Definition 4 (Density-based combined subspace cluster). A combined


subspace cluster C = (O, S) in a graph G = (V, E, l) consists of a set of vertices
O ⊆ V and a set of relevant dimensions S ⊆ Dim such that O forms a combined
cluster (cf. Def 3) w.r.t. the local subspace neighborhood NSO (v) = NkO (v) ∩
O
N,S O
(v) with N,S (v) = {u ∈ O | distS (l(u), l(v)) ≤ }.

As we can show, our subspace clusters have the anti-monotonicity property: For
a subspace cluster C = (O, S), for every S  ⊆ S there exists a vertex set O ⊇ O
such that (O , S  ) is a valid cluster. This property is used in our algorithm to
find the valid clusters more efficiently.

Proof. For every two subspaces S, S  with S  ⊆ S and every pair of vertices u, v

it holds that distS (l(u), l(v)) ≤ distS (l(u), l(v)). Thus for every vertex v ∈ O
we get NSO (v) ⊇ NSO (v). Accordingly, the properties (1) and (2) from Def. 3 are
fulfilled by (O, S  ). If (O, S  ) is maximal w.r.t. these properties, then (O, S  ) is
a valid combined subspace cluster. Else, by definition there exists a vertex set
O ⊃ O such that (O , S  ) is a valid subspace cluster.
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 573

Redundancy removal. Because of the density-based subspace cluster model,


there can not be an overlap between clusters in the same subspace. However, a
vertex can belong to several subspace clusters in different subspaces, thus clusters
from different subspaces can overlap. Due to the exponential number of possible
subspace projections, an overwhelming number of (very similar) subspace clus-
ters may exist. Whereas allowing overlapping clusters in general makes sense in
many applications, allowing too much overlap can lead to highly redundant in-
formation. Thus, for meaningful interpretation of the result, removing redundant
clusters is crucial. Instead of simply reporting the highest dimensional subspace
clusters or the cluster with maximal size, we use a more sophisticated redun-
dancy model to confine the final clustering to the most interesting clusters.
However, defining the interestingness of a subspace cluster is a non-trivial task
since we have two important properties “number of vertices” and “dimension-
ality”. Obviously, optimizing both measures simultaneously is not possible as
clusters with higher dimensionality usually consist of fewer vertices than their
low-dimensional counterparts (cf. anti-monotonicity). Thus we have to realize
a trade-off between the size and the dimensionality of a cluster. We define the
interestingness of a single cluster as follows:
Definition 5 (Interestingness of a cluster). The interestingness of a com-
bined subspace cluster C = (O, S) is computed as
Q(C) = |O| · |S|
For the selection of the final clustering based on this interestingness definition
we use the redundancy model introduced by [9]. This model defines a binary
redundancy relation between two clusters as follows:
Definition 6 (Redundancy between clusters). Given the redundancy pa-
rameters robj , rdim ∈ [0, 1], the binary redundancy relation ≺red is defined by:
For all combined clusters C = (O, S), C = (O, S):
C ≺red C ⇔ Q(C) < Q(C) ∧ |O∩O| |S∩S|
|O| ≥ robj ∧ |S| ≥ rdim
Using this relation, we consider a cluster C as redundant w.r.t. another cluster
C if C’s quality is lower and the overlap of the clusters’ vertices and dimensions
exceeds a certain threshold. It is crucial to require both overlaps for the redun-
dancy. E.g. two clusters that contain similar vertices, but lie in totally different
dimensions represent different information and thus should both be considered
in the output.
Please note that the redundancy relation ≺red is non-transitive, i.e. we cannot
just discard every cluster that is redundant w.r.t. any other cluster. Therefore
we have to select the final clustering such that it does not contain two clusters
that are redundant w.r.t. each other. At the same time, the result set has to
be maximal with this property, i.e. we can not just leave out a cluster that is
non-redundant. Overall, the final clustering has to fulfill the properties:
Definition 7 (Optimal density-based combined subspace clustering).
Given the set of all combined clusters Clusters, the optimal combined clustering
Result ⊆ Clusters fulfills
574 S. Günnemann, B. Boden, and T. Seidl

– redundancy-free property: ¬∃Ci , Cj ∈ Result : Ci ≺red Cj


– maximality property: ∀Ci ∈ Clusters\Result : ∃Cj ∈ Result : Ci ≺red Cj
Our clustering model enables us to detect arbitrarily shaped subspace clusters
based on attribute and graph densities without generating redundant results.

4 The DB-CSC Algorithm


In the following section we describe the DB-CSC (Density-Based Combined
Subspace Clustering) algorithm for detecting the optimal clustering result. In
Section 4.1 we present the detection of our density-based clusters in a single
subspace S and in Section 4.2 we introduce the overall processing scheme using
different pruning techniques to enhance the efficiency.

4.1 Finding Clusters in a Single Subspace


To detect the clusters in a single subspace S, we first introduce a graph trans-
formation that represents attribute and graph information simultaneously.
Definition 8 (Enriched subgraph). Given a set of vertices O ⊆ V , a sub-
space S, and the original graph G = (V, E, l), the enriched subgraph GO S =
(V  , E  ) is defined by V  = O and E  = {(u, v) | v ∈ NSO (u) ∧ v 
= u} using the
distance function distS .

H H H
Dim 2

Dim 2

Dim 2

O2
O1

Dim 1 Dim 1 Dim 1


(a) Original graph (b) Enriched graph GVS (c) GO
S for subsets Oi
i

Fig. 5. Finding clusters by (minP ts−1)-cores in the enriched graphs (k=2, minP ts=3)

Two vertices are adjacent in the enriched subgraph iff their attribute values are
similar in S and the vertices are connected by at most k edges in the origi-
nal graph (using just vertices from O). In Fig. 5(b) the enriched subgraph for
the whole set of vertices V is computed while Fig. 5(c) just considers the sub-
set O1 and O2 respectively. In this graph we concentrate on the detection of
(minP ts−1)-cores, which are defined as maximal connected subgraphs Oi ⊆ V
in which all vertices have at least a degree of (minP ts−1). We show:
Theorem 1 (Equivalence of representations). Let O ⊆ V a set of vertices
and S ⊆ Dim a subspace. C = (O, S) fulfills property (1) and (2) of Defini-

tion 3 if and only if the enriched subgraph GO
S = (O, E ) contains a single
(minP ts−1)-core that covers all vertices O.
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 575

Proof. ∗ For property (1) of Def. 3 we get: high local density ⇔ ∀v ∈ O :


|N O (v)| ≥ minP ts ⇔ ∀v ∈ O : |N O (v)\{v}| ≥ minP ts − 1 ⇔ ∀v ∈ O :
degE  (v) ≥ minP ts − 1 ⇔ minimal vertex degree of minP ts − 1 in GO S
∗ For property (2) of Def. 3 we get: locally connected ⇔ ∀u, v ∈ O: ∃w1 , . . . , wl ∈
O : w1 = u ∧ wl = v ∧ ∀i ∈ {1, . . . , l − 1} : wi ∈ NSO (wi+1 ) ⇔ ∀u, v ∈
O : ∃w1 , . . . , wl ∈ O : w1 = u ∧ wl = v ∧ ∀i ∈ {1, . . . , l − 1} : (wi , wi+1 ) ∈ E  ⇔
GOS is connected

The theorem implies that our algorithm only has to analyze vertices that po-
tentially lead to (minP ts−1)-cores. The important observation is: If GO S is a


(minP ts−1)-core, then for each graph GO S with O ⊇ O the set O will also be

contained in a (minP ts−1)-core. (This holds since NSO (u) ⊇ NSO (u) and hence

GO O
S contains all edges of GS .) Thus, each potential cluster (O, S) especially has
to be contained within a (minP ts−1)-core of the graph GVS . Overall, we first
extract from GVS all (minP ts−1)-cores, since only these sets could lead to valid
clusters. In Fig. 5(b) these sets are highlighted.
However, keep in mind that not all (minP ts−1)-cores correspond to valid
clusters. Theorem 1 requires that a single (minP ts−1)-core covering all vertices
is induced by the enriched subgraph. Figure 5(b) for example contains two cores.
As already discussed, the left set O1 is not a valid cluster but has to be split up.
Thus, if the graph GVS contains a single (minP ts−1)-core O1 with O1 = V
we get a valid cluster and the cluster detection is finished in this subspace.
In the other cases, however, we recursively have to repeat the procedure for
each (minP ts−1)-core {O1 , . . . , Om } detected in GVS , i.e. we determine the
smaller graphs GO i
S and their contained (minP ts−1)-cores. Since in each step
the (maximal) (minP ts−1)-cores are analyzed and refined, we ensure besides
property (1) and (2) – due to Theorem 1 – also property (3) of Definition 3.
Formally, the set of resulting clusters Clus = {C1 , . . . , Cm } corresponds to a
fixpoint of the function
f (Clus) = {(O , S) | O is a (minP ts−1)-core in GO
S with Ci = (Oi , S) ∈ Clus}
i

This fixpoint can be reached by f (f (. . . f ({(V, S)}))) = Clus. Overall, this pro-
cedures enables us to detect all combined clusters in the subspace S.

4.2 Finding Clusters in Different Subspaces


This chapter describes how we efficiently determine the clusters located in dif-
ferent subspaces. In principle, our algorithm has to analyze each subspace. We
enumerate these subspaces by a depth first traversal through the subspace lat-
tice. To avoid enumerating the same subspace several times we assume an order
d1 , d2 , . . . , dD on the dimensions. We denote the dimension with the highest in-
dex in subspace S by max{S} and extend the subspace only by dimensions that
are ordered behind max{S}. This principle has several advantages:
By using a depth first search, the subspace S is analyzed before the subspace
S  = S ∪ {d} for d > max{S}. Based on the anti-monotonicity we know that
576 S. Günnemann, B. Boden, and T. Seidl

each cluster in S  has to be a subset of a cluster in S. Thus, in subspace S  we do


not have to start with the enriched subgraph GVS but it is sufficient to start with
the vertex sets of the known clusters, i.e. if the clusters of subspace S are given
by Clus = {C1 , . . . , Cm }, we will determine the fixpoint f (f (. . . f ({(Oi , S  )})))
for each cluster Ci = (Oi , S) ∈ Clus. This is far more efficient since the vertex
sets are smaller. In Fig. 6 the depth first traversal based on the clusters of the
previous subspace is shown in line 21, 23 and 29. The actual detection of clusters
based on the vertices O is realized in line 13-19, which corresponds to the fixpoint
iteration described in the previous section.
Using a depth first search enables us to store a set of parent clusters (be-
forehand detected in lower dimensional subspaces) that a new cluster is based
on (cf. line 22). Furthermore, given a set of vertices O in the subspace S we
know that by traversing the current subtree only clusters of the kind Creach =
(Oreach , Sreach ) with Oreach ⊆ O and S ⊆ Sreach ⊆ S ∪ {max{S} + 1, . . . , D}
can be detected. This information together with the redundancy model allows
a further speed-up of the algorithm. The overall aim is to stop the traversal of
a subtree if each of the reachable (potential) clusters is redundant to one par-
ent cluster, i.e. if there exists C ∈ P arents such that Creach ≺red C for each
Creach . Traversing such a subtree is not worthwhile since the contained clusters
are probably excluded from the result later on due to their redundancy.
Redundancy of a subtree occurs if the three properties introduced in Def. 6
hold. The second property (object overlap) is always fulfilled since each Oreach
is a subset of any cluster from P arents (cf. anti-monotonicity). The maximal
possible quality of the clusters Creach can be estimated by Qmax = |O| · |S ∪
{max{S} + 1, . . . , D}|. By focusing on the clusters Cp = (Op , Sp ) ∈ P arents
with Qmax < Q(Cp ) we ensure the first redundancy property. The third property
(dimension overlap) is ensured if |Sp | ≥ |S ∪ {max{S} + 1, . . . , D}| · rdim holds.
In this case we get for each Creach : |Sp | ≥ |Sreach | · rdim ⇔ |Sp ∩ Sreach | ≥
|Sp ∩Sreach |
|Sreach | · rdim ⇔ |S reach |
≥ rdim . Those parent clusters fulfilling all three
properties are stored within P arentsred (line 24).
If P arentsred is empty, we have to traverse the subtree (else case, line 28).
If it is not empty (line 25), the current subtree is redundant to at least one
parent cluster. We currently stop traversing this subtree. However, we must not
directly prune the subtree ST : if the clusters from P arentsred themselves are
not included in the result, clusters from the subtree would become interesting
again. Thus, we do not finally reject the subtree ST but we store the required
information and add the subtree to a priority queue.
Processing this queue is the core of the DB-CSC algorithm. The priority
queue contains clusters (line 6, 20) and non-traversed subtrees (line 9, 27). We
successively take the object with the highest (estimated) quality from the queue.
If it is a cluster that is non-redundant to the current result, we add it to the
result (line 7-8). If it is a subtree, we check if some cluster from P arentsred
is already included in the result: if so, we finally reject this subtree (line 10);
otherwise, we have to restart traversing this subtree (line 11).
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 577

method: main()
Result = ∅ // current result set
1
2 queue = ∅ // priority queue with clusters and subtrees, descendingly sorted by quality
3 for d ∈ Dim do DF S traversal({d}, V, ∅)
4 while queue = ∅ do
5 remove first (highest-quality) object Obj from queue
6 if Obj is cluster then // check redundancy
7 for C ∈ Result do if( Obj ≺red C) goto line 4 // discard redundant cluster
8 Result = Result ∪ {Obj} // cluster is non-redundant
9 else // Obj is subtree ST = (S, O, Qmax , P arents, P arentsred )
10 if P arentsred ∩ Result = ∅ then goto line 4 // discard whole subtree
11 else DF S traversal(S, O, P arents) // subtree is non-redundant, restart traversal

12 return Result
method: DF S traversal(subspace S, candidate vertices O, parent clusters P arents)
13 f oundClusters = ∅, prelimClusters = {O}
14 while prelimClusters  = ∅ do
15 remove first candidate Ox from prelimClusters
16 generate enriched subgraph GOS
x

17 determine (minP ts − 1)-cores → Cores = {O1 , . . . , Om



}
18 if |Cores| = 1 ∧ O1 = Ox then f oundClusters = f oundClusters ∪ {(O1 , S)}
19 else prelimClusters = prelimClusters ∪ Cores
20 add f oundClusters to queue
21 for Ci = (Oi , S) ∈ f oundClusters do
22 P arentsi = P arents ∪ {Ci }
23 for d ∈ {max{S} + 1, . . . , D} do
24 determine P arentsred ⊆ P arentsi to which whole subtree is redundant
25 if P arentsred = ∅ then
26 calc. subtree information ST = (S ∪ {d}, Oi , Qmax , P arentsi , P arentsred )
27 add ST to queue // (currently) do not traverse subtree!
28 else
29 DF S traversal(S ∪ {d}, Oi , P arentsi ) // check only subsets of Oi

Fig. 6. DB-CSC algorithm

Overall, our algorithm efficiently determines the optimal clustering solution


because only small vertex sets are analyzed for clusters and whole subtrees (i.e.
sets of clusters) are pruned using the redundancy model.

5 Experimental Evaluation
Setup. We compare DB-CSC to GAMer [9] and CoPaM [16], two approaches
that combine subspace clustering and dense subgraph mining. In our experiments
we use real world data sets as well as synthetic data. By default the synthetic
datasets have 20 attribute dimensions and contain 80 combined clusters each
with 15 nodes and 5 relevant dimensions. Additionally we add random nodes
and edges to represent noise in the data. The clustering quality is measured
by the F1 measure [9], which compares the detected clusters to the “hidden”
clusters. The efficiency is measured by the algorithms’ runtime.
Varying characteristics of the data. In the first experiment (Fig. 7(a)) we
vary the database size of our synthetic datasets by varying the number of gener-
ated combined clusters. The runtime of all algorithms increases with increasing
database size (please note the logarithmic scale on both axes). For the datasets
578 S. Günnemann, B. Boden, and T. Seidl

1.0 1.0 1.0


0.9
0.9 0.9 DBͲCSC 0.8
GAMer 0.7

F1value

F1value
F1value

0.8 0.8
CoPaM 0.6
0.7 0.7 0.5
DBͲCSC 0.4 DBͲCSC
0.6 GAMer 0.6 GAMer
0.3
CoPaM CoPaM
0.5 0.5 0.2
100 1000 10000 0% 20% 40% 60% 80% 100% 10 20 30 40 50
databasesize(#vertices) clusterdimensionality(%offullspace) #verticespercluster
10000 1000 10000

1000 1000
runtime[sec]

runtime[sec]

runtime[sec]
100
100 100

DBͲCSC
10 DBͲCSC
10 10 DBͲCSC
GAMer GAMer GAMer
CoPaM CoPaM CoPaM
1 1 1
100 1000 10000 0% 20% 40% 60% 80% 100% 10 20 30 40 50
databasesize(#vertices) clusterdimensionality(%offullspace) #verticespercluster

(a) Varying database size (b) Varying dimensionality (c) Varying cluster size

Fig. 7. Quality (top row) and Runtime (bottom row) w.r.t. varying data characteristics

with more than 7000 vertices, CoPaM is not applicable any more due to heap
overflows (4GB). While the runtimes of the different algorithms are very similar,
in terms of clustering quality DB-CSC obtains significantly better results than
the other approaches. The competing approaches tend to output only subsets of
the hidden clusters due to their restrictive cluster models. In the next experi-
ment (Fig. 7(b)) we vary the dimensionality of the hidden clusters. The runtime
of all algorithms increases for higher dimensional clusters. The clustering quali-
ties of DB-CSC and CoPaM slightly decrease. This can be explained by the fact
that for high dimensional clusters it is likely that additional clusters occur in
subsets of the dimensions. However, DB-CSC still has the best clustering quality
and runtime in this experiment. In Fig. 7(c) the cluster size (i.e. the number of
vertices per cluster) is varied. The runtimes of DB-CSC and GAMer are very
similar to each other, whereas the runtime of CoPaM increases dramatically
until it is not applicable any more. The clustering quality of DB-CSC remains
relatively stable while the qualites of the other approaches decrease constantly.
For increasing cluster sizes the expansion of the clusters in the graph as well as
in the attribute space increases, thus the restrictive cluster models of GAMer
and CoPaM can only detect subsets of them.
Robustness. In Fig. 8(a) we analyze the robustness of the methods w.r.t. the
number of “noise” vertices in the datasets. The clustering quality of all ap-
proaches decreases for noisy data, however the quality of DB-CSC is still rea-
sonably high even for 1000 noise vertices (which is nearly 50% of the overall
dataset). In the next experiment (Fig. 8(b)) we vary the clustering parameter .
For GAMer and CoPaM we vary the allowed width of a cluster in the attribute
space instead of . As shown in the figure, by choosing  too small we cannot
find all clusters and thus get smaller clustering qualities. However, for  > 0.05
the clustering quality of DB-CSC remains stable. The competing methods have
lower quality. In the last experiment (Fig. 8(c)) we evaluate the robustness of
DB-CSC: Subspace Clustering in Graphs with Feature Vectors 579

1.0 1.0 1.0

F1valueforDBͲCSC
0.9 DBͲCSC 0.8 0.8
GAMer

F1value
F1value

0.8 0.6 0.6


CoPaM
0.7 0.4 0.4
DBͲCSC
0.6 0.2 GAMer 0.2
CoPaM
0.5 0.0 0.0
0 500 1,000 0 0.02 0.04 0.06 0.08 0.1 2 3 4 5 6 7 8
numberofnoisevertices ɸ minPts

(a) Quality vs. noise (b) Quality vs.  (c) Quality vs. minP ts

Fig. 8. Robustness of the methods w.r.t. noise and parameter values

DB-CSC w.r.t. the parameter minP ts. For too small values for minP ts, many
vertex sets are falsely detected as clusters, thus we obtain small clustering qual-
ities. However, for sufficiently high minP ts values the quality remains relatively
stable, similar to the previous experiment.
Overall, the experiments show that DB-CSC obtains significantly higher clus-
tering qualities. Even though it uses a more sophisticated cluster model than
GAMer and CoPaM, the runtimes of DB-CSC are comparable to (and in some
cases even better than) those of the other approaches.
Real world data. As real world data sets we use gene data1 and patent data2
as also used in [9]. Since for real world data there are no “hidden” clusters given
that we could compare our clustering results with, we compare the properties of
the clusters found by the different methods. For the gene data DB-CSC detects
9 clusters with a mean size of 6.3 and a mean dimensionality of 13.2. In con-
trast, GAMer detects 30 clusters (mean size: 8.8 vertices, mean dim.: 15.5) and
CoPaM 115581 clusters (mean size: 9.7 vertices, mean dim.: 12.2), which are far
too many to be interpretable. In the patent data, DB-CSC detects 17 clusters
with a mean size of 19.2 vertices and a mean dimensionality of 3. In contrast,
GAMer detects 574 clusters with a mean size of 11.7 vertices and a mean di-
mensionality of 3. CoPaM did not finish on this dataset within two days. The
clusters detected by DB-CSC are more expanded than the clusters of GAMer,
which often simply are subsets of the clusters detected by DB-CSC.

6 Conclusion

We introduce the combined clustering model which simultaneously considers


graph data and attribute data in subspaces. Our novel model is the first approach
that exploits the advantages of density-based clustering in both domains. Based
on the novel notion of local densities, our clusters correspond to dense regions in
the graph as well as in the attribute space. To avoid redundancy in the result,
our model selects only the most interesting clusters for the final clustering. We
develop the algorithm DB-CSC to efficiently determine the combined clustering
1
http://thebiogrid.org/ and http://genomebiology.com/2005/6/3/R22
2
http://www.nber.org/patents/
580 S. Günnemann, B. Boden, and T. Seidl

solution. The clustering quality and the efficiency of DB-CSC are demonstrated
in the experimental section.

Acknowledgment. This work has been supported by the UMIC Research Cen-
tre, RWTH Aachen University, Germany, and the B-IT Research School.

References
1. Aggarwal, C., Wang, H.: Managing and Mining Graph Data. Springer, New York
(2010)
2. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-
tering of high dimensional data for data mining applications. In: SIGMOD, pp.
94–105 (1998)
3. Assent, I., Krieger, R., Müller, E., Seidl, T.: EDSC: Efficient density-based subspace
clustering. In: CIKM, pp. 1093–1102 (2008)
4. Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is ”nearest neigh-
bor” meaningful? In: ICDT, pp. 217–235 (1999)
5. Dorogovtsev, S., Goltsev, A., Mendes, J.: K-core organization of complex networks.
Physical Review Letters 96(4), 40601 (2006)
6. Du, N., Wu, B., Pei, X., Wang, B., Xu, L.: Community detection in large-scale
social networks. In: WebKDD/SNA-KDD, pp. 16–25 (2007)
7. Ester, M., Ge, R., Gao, B.J., Hu, Z., Ben-Moshe, B.: Joint cluster analysis of
attribute data and relationship data: the connected k-center problem. In: SDM
(2006)
8. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discov-
ering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)
9. Günnemann, S., Färber, I., Boden, B., Seidl, T.: Subspace clustering meets dense
subgraph mining: A synthesis of two paradigms. In: ICDM, pp. 845–850 (2010)
10. Hanisch, D., Zien, A., Zimmer, R., Lengauer, T.: Co-clustering of biological net-
works and gene expression data. Bioinformatics 18, 145–154 (2002)
11. Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia
databases with noise. In: KDD, pp. 58–65 (1998)
12. Janson, S., Luczak, M.: A simple solution to the k-core problem. Random Struc-
tures & Algorithms 30(1-2), 50–62 (2007)
13. Kailing, K., Kriegel, H.P., Kroeger, P.: Density-connected subspace clustering for
high-dimensional data. In: SDM, pp. 246–257 (2004)
14. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A sur-
vey on subspace clustering, pattern-based clustering, and correlation clustering.
TKDD 3(1), 1–58 (2009)
15. Kubica, J., Moore, A.W., Schneider, J.G.: Tractable group detection on large link
data sets. In: ICDM, pp. 573–576 (2003)
16. Moser, F., Colak, R., Rafiey, A., Ester, M.: Mining cohesive patterns from graphs
with feature vectors. In: SDM, pp. 593–604 (2009)
17. Pei, J., Jiang, D., Zhang, A.: On mining cross-graph quasi-cliques. In: KDD, pp.
228–238 (2005)
18. Ulitsky, I., Shamir, R.: Identification of functional modules using network topology
and high-throughput data. BMC Systems Biology 1(1) (2007)
19. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute
similarities. PVLDB 2(1), 718–729 (2009)
20. Zhou, Y., Cheng, H., Yu, J.X.: Clustering large attributed graphs: An efficient
incremental approach. In: ICDM, pp. 689–698 (2010)
Learning the Parameters of Probabilistic Logic
Programs from Interpretations

Bernd Gutmann, Ingo Thon, and Luc De Raedt

Department of Computer Science, Katholieke Universiteit Leuven


Celestijnenlaan 200A, POBox 2402, 3001 Heverlee, Belgium
firstname.lastname@cs.kuleuven.be

Abstract. ProbLog is a recently introduced probabilistic extension of


the logic programming language Prolog, in which facts can be annotated
with the probability that they hold. The advantage of this probabilistic
language is that it naturally expresses a generative process over inter-
pretations using a declarative model. Interpretations are relational de-
scriptions or possible worlds. This paper introduces a novel parameter
estimation algorithm LFI-ProbLog for learning ProbLog programs from
partial interpretations. The algorithm is essentially a Soft-EM algorithm.
It constructs a propositional logic formula for each interpretation that
is used to estimate the marginals of the probabilistic parameters. The
LFI-ProbLog algorithm has been experimentally evaluated on a number
of data sets that justifies the approach and shows its effectiveness.

1 Introduction
Statistical relational learning [12] and probabilistic logic learning [5,7] have con-
tributed various representations and learning schemes. Popular approaches in-
clude BLPs [15], ICL [18], Markov Logic [19], PRISM [22], PRMs [11], and
ProbLog [6,13]. These approaches differ not only in the underlying representa-
tions but also in the learning settings they employ.
For learning knowledge-based model construction approaches (KBMC), such
as Markov Logic, PRMs, and BLPs, one normally uses relational state descrip-
tions as training examples. This setting is also known as learning from inter-
pretations. For training probabilistic programming languages one typically uses
learning from entailment [7,8]. PRISM and ProbLog, for instance, are probabilis-
tic logic programming languages that are based on Sato’s distribution seman-
tics [21]. They use training examples in form of labeled facts where the labels
are either the truth values of these facts or target probabilities.
In the learning from entailment setting, one usually starts from observations
for a single target predicate. In the learning from interpretations setting, how-
ever, the observations specify the value for some of the random variables in a
state-description. Probabilistic grammars and graphical models are illustrative
examples for each setting. Probabilistic grammars are trained on examples in the
form of sentences. Each training example states that a particular sentence was
derived or not, but it does not explain how it was derived. In contrast, Bayesian

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 581–596, 2011.

c Springer-Verlag Berlin Heidelberg 2011
582 B. Gutmann, I. Thon, and L. De Raedt

networks are typically trained on partial or complete state descriptions, which


specify the value for some random variables in the network. This also implies
that training examples for Bayesian networks can contain much more informa-
tion. These differences in learning settings also explain why the KBMC and PLP
approaches have been applied on different kinds of data sets and applications.
Entity resolution and link prediction are examples for domains where KBMC has
been successfully applied. This paper aims at bridging the gap between these two
types of approaches to learning. We study how the parameters of ProbLog pro-
grams can be learned from partial interpretations. The key contribution of the
paper is a novel algorithm, called LFI-ProbLog, that is used for learning ProbLog
programs from partial interpretations. We thoroughly evaluated the algorithm
on various standard benchmark problems. LFI-ProbLog is freely available as part
of the ProbLog system at http://dtai.cs.kuleuven.be/problog/ and within
YAP Prolog.
The paper is organized as follows: In Section 2, we review logic programming
concepts as well as the probabilistic programming language ProbLog. Section 3
formalizes the problem of learning the parameters of ProbLog programs from
interpretations. Section 4 introduces LFI-ProbLog. We report on experimental
results in Section 5. Before concluding, we discuss related work in Section 6.

2 Probabilistic Logic Programming Concepts

We start by reviewing the main concepts underlying ProbLog.


An atom is an expression of the form q(t1 , . . . , tk ) where q is a predicate of arity
k and the ti terms. A term is a variable, a constant, or a functor applied to terms.
Definite clauses are universally quantified expressions of the form h :- b1 , . . . , bn
where h and the bi are atoms. A fact is a clause without a body. A substitution
θ is an expression of the form {V1 /t1 , . . . , Vm /tm } where the Vi are different
variables and the ti are terms. Applying a substitution θ to an expression e yields
the instantiated expression eθ where all variables Vi in e are being simultaneously
replaced by their corresponding terms ti in θ. An expression is called ground, if
it does not contain variables. The semantics of a set of definite clauses is given
by its least Herbrand model, the set of all ground facts entailed by the theory.
A set of definite clauses is called tight if the dependency graph if acyclic. An h
atom depends on an atom b, if b occurs in a clause with head h.
A ProbLog theory (or program) T consists of a set of labeled facts F and
a set of definite clauses BK that express the background knowledge. As the
semantics of ProbLog is based on the distribution semantics, we require that
every atom fulfills the finite support condition. This means that the SLD-tree
for each ground atom is finite. The facts pn :: fn in F are annotated with the
probability pn that fn θ is true for all substitutions θ grounding fn . The re-
sulting facts fn θ are called atomic choices [18] and represent the elementary
random events; they are assumed to be mutually independent. Each non-ground
probabilistic fact represents a kind of template for random variables. Given a
Learning the Parameters of Probabilistic Logic Programs 583

finite1 number of possible substitutions {θn,1 , . . . θn,Kn } for each probabilistic


fact pn :: fn , a ProbLog program T = {p1 :: f1 , · · · , pN :: fN } ∪ BK defines a
probability distribution over total choices L (the random events), where L ⊆
LT = {f1 θ1,1 , . . . f1 θ1,K1 , . . . , fN θN,1 , . . . , fN θN,KN }.
 
P(L|T ) = pn (1 − pn ).
fn θn,k ∈L fn θn,k ∈LT \L

The following ProbLog theory states that there is a burglary with probability
0.1, an earthquake with probability 0.2 and if either of them occurs the alarm
will go off. If the alarm goes off, a person X will be notified and will therefore
call with the probability of al(X), that is, 0.7.
F = {0.1 :: burglary, 0.2 :: earthquake, 0.7 :: al(X)}
BK = {person(mary). , person(john). , alarm :- burglary; earthquake. ,
calls(X) :- person(X), alarm, al(X).}

The set of atomic choices in this program is {al(mary), al(john), burglary, and
earthquake} and each total choice is a subset of this set. Each total choice L
combined with the background knowledge BK defines a Prolog program. Con-
sequently, the probability distribution at the level of atomic choices also in-
duces a probability distribution over possible definite clause programs of the
form L ∪ BK. Furthermore, each such program has a unique least Herbrand in-
terpretation, which is the set of all the ground facts that are entailed by the
program representing a possible world, e.g., for the total choiceburglary the
interpretation {burglary, alarm, person(john), person(mary)} the probability
distribution at the level of total choices also induces a probability distribution
at the level of possible worlds. The probability Pw (I) of this interpretation is
0.1 × (1 − 0.2) × (1 − 0.7)2 . We define the success probability of a query q as
 
Ps (q|T ) = L⊆LT P(L|T ) = δ(q, BK ∪ L) · P(L|T ) (1)
L∪BK|=q L⊆LT

where δ(q, BK ∪ L) = 1 if there exists a θ such that BK ∪ L |= qθ, and 0 other-


wise. It can be shown that the success probability corresponds to the probability
that the query is true in a randomly selected possible world (according to Pw ).
The success probability Ps (calls(john)|T ) is 0.196. Observe that ProbLog pro-
grams do not represent a generative model at the level of the individual facts
or predicates. Indeed, it is not the case that the sum of the probabilities of the
facts for a given predicate (here calls/1) must equal 1:

Ps (calls(X)|T ) 
= Ps (calls(john)|T ) + Ps (calls(mary)|T ) 
=1 .

So, the predicates do not encode a probability distribution over their instances.
This differs from probabilistic grammars and their extensions such as stochastic
logic programs [4], where each predicate or non-terminal defines a probability dis-
tribution over its instances, which enables these approaches to sample instances
1
Throughout the paper, we shall assume that F is finite, see [21] for the infinite case.
584 B. Gutmann, I. Thon, and L. De Raedt

from a specific target predicate. Such approaches realize a generative process at


the level of individual predicates. Samples taken from a single predicate can then
be used as examples for learning the probability distribution governing the pred-
icate. In the literature this setting is known as learning from entailment. Sato
and Kameya’s well-known learning algorithm for PRISM [22] also assumes that
there is a generative process at the level of a single predicate and it is therefore
not applicable to learning from interpretations.
While the ProbLog semantics does not encode a generative process at the
level of individual predicates, it does encode one at the level of interpretations.
This process has been described above; it basically follows from the fact that
each total choice generates a unique possible world through its least Herbrand
interpretation. Therefore, it is much more natural to learn from interpretations
in ProbLog; this is akin to typical KBMC approaches.
A partial interpretation I specifies for some (but not all) atoms the truth
value. We represent partial interpretations as I = (I + , I − ) where I + contains
all true atoms and I − all false atoms. The probability of a partial interpretation
is the sum of the probabilities of all possible worldsconsistent with the known
atoms. This is the success probability of the query ( aj ∈I + aj ) ∧ ( aj ∈I − ¬aj ).
The probability of the following partial interpretation in the Alarm domain
I + = {person(mary), person(john), burglary, alarm, al(john), calls(john)}
I − = {calls(mary), al(mary)}
     
is Pw ((I + , I − )) = 0.1 × 0.7 × (1 − 0.7) × (0.2 + (1 − 0.2)) .

3 Learning from Interpretations


Learning from (possibly partial) interpretations is a common setting in statis-
tical relational learning that has not yet been studied in its full generality for
probabilistic programming languages. In a generative setting, one is typically
interested in the maximum likelihood parameters given the training data. This
can be formalized as follows.
Definition 1 (Max-Likelihood Parameter Estimation). Given a ProbLog
program T (p) containing the probabilistic facts F with unknown parameters
p = p1 , ..., pN  and background knowledge BK, and a set of (possibly partial)
interpretations D = {I1 , . . . , IM } (the training examples). Find maximum like-
lihood probabilities p = p1 , . . . , p
N  such that
M
p = arg max P (D|T (p)) = arg max Pw (Im |T (p))
p p m=1

Thus, we are given a ProbLog program and a set of partial interpretations and
the goal is to find the maximum likelihood parameters. One has to consider
two cases when computing p  . For complete interpretations where everything is
observable, one can obtain p  by counting (cf. Sect. 3.1). In the more complex
case of partial interpretations, one has to use an approach that is capable of
handling partial observability (cf. Sect. 3.2).
Learning the Parameters of Probabilistic Logic Programs 585

3.1 Full Observability


It is clear that in the fully-observable case the maximum likelihood estimators
p
n for the probabilistic facts pn :: fn can be obtained by counting the number
of true ground instances in every interpretation, that is,

1 M Knm m m
m
1 if fn θn,k ∈ Im
pn = δ n,k where δ n,k := (2)
Zn m=1 k=1 0 else
m
and θn,k is the k-th possible ground substitution for the fact fn in the interpre-
tation Im and Knm is the number of such substitutions. The sum is normalized

by Zn = M m
m=1 Kn , the total number of ground instances of the fact fn in all
training examples. If Zn is zero, i.e., no ground instance of fn is used, p n is
undefined and one must not update pn .
Before moving on to the partial observable case, let us consider the issue of de-
m
termining the possible substitutions θn,k for a fact pn :: fn and an interpretation
Im . To resolve this, we assume that the facts fn are typed and that each inter-
pretation Im contains an explicit definition of the different types in the form of
fully-observable unary predicates. In the alarm example, the predicate person/1
can be regarded as the type of the (first) argument of al(X) and calls(X). This
predicate can differ between interpretations. One person, i.e., can have john and
mary as neighbors, another one ann, bob and eve.

3.2 Partial Observability


In many applications the training examples are partially observed. In the alarm
example, we may receive a phone call but we may not know whether an earth-
quake has in fact occurred. In the partial observable case – similar to Bayesian
networks – a closed-form solution of the maximum likelihood parameters is in-
m m
feasible. Instead, one has to replace in (2) the term δn,k by ET [δn,k |Im ], i.e., the
conditional expectation given the interpretation under the current model T ,
1 M Knm m
pn = ET [δn,k |Im ] . (3)
Zn m=1 k=1

As in the fully observable case, the domains are assumed to be given. Before
describing the Soft-EM algorithm for finding p n , we illustrate one of its cru-
cial properties using the alarm example. Assume that our partial interpretation
is I + = {person(mary), person(john), alarm} and I − = ∅. It is clear that
for calculating the marginal probability of all probabilistic facts – these are
the expected counts – only the atoms in {burglary, earthquake, al(john),
al(mary)} ∪ I are relevant. This is due to the fact that the remaining atoms
{calls(john), calls(mary)} cannot be used in any proof for the facts observed
in the interpretations. We call atoms, which are relevant for the distribution
of the ground atom x, the dependency set of x. It is defined as depT (x) :=
{f ground fact | a ground SLD-proof in T for x contains f }. Our goal is to re-
strict the probability calculation to the dependent atoms only. Hence we general-
ize this set to partial interpretations I as follows depT (I) := x∈(I + ∪I − ) depT (x)
and introduce the notion of a restricted ProbLog theory.
586 B. Gutmann, I. Thon, and L. De Raedt

Definition 2. Let T = F ∪ BK be a ProbLog theory and I = (I + , I − ) a partial


interpretation. Then we define T r (I) = F r (I) ∪ BKr (I), the interpretation-
restricted ProbLog theory, as follows. F r (I) = LT ∩ depT (I) and BKr (I) is
obtained by computing all ground instances of clauses in BK in which all atoms
appear in depT (I).

For the partial interpretation I = ({burglary, alarm}, ∅), for instance, BKr (I)
is {alarm :- burglary, alarm :- earthquake} and the restricted set of facts
F r (I) is {0.1 :: burglary, 0.2 :: earthquake}.
The restricted theory T r (I) cannot be larger than T . More important, it is
always finite since we assume the finite support property and the evidence being a
finite conjunction of ground atoms. In many cases it will be much smaller, which
allows for learning in domains where the original theory does not fit in memory.
It can be shown using the independence of probabilistic facts in ProbLog, that
the conditional probability of a ground instance of fn given I calculated in the
theory T is equivalent to the probability calculated in T r (I), that is,
m
m ET r (Im ) [δn,k |Im ] if fn ∈ depT (Im )
ET [δn,k |Im ] = (4)
pn otherwise

We exploit this property in the following section when developing the Soft-EM
algorithm for finding the maximum likelihood parameters p  defined in (3).

4 The LFI-ProbLog Algorithm

The algorithm starts by constructing a Binary Decision Diagram (BDD) [2] for
every training example Im (cf. Sect. 4.1), which is then used to compute the ex-
m
pected counts E[δn,k |Im ] (cf. Sect. 4.3). A BDD is a compact graphical represen-
tation of a Boolean formula. In our case, the Boolean formula (or, equivalently,
the BDD) represents the conditions under which the partial interpretation will
be generated by the ProbLog program and the variables in the formula are the
ground atoms in depT (Im ). Basically, any truth assignment to these facts that
satisfies the Boolean formula (or the BDD) will result in the partial interpreta-
tion. Given a fixed variable order, a Boolean function f can be represented as
a full Boolean decision tree where each node N on the ith level is labeled with
the ith variable and has two children called low l(N ) and high h(N ). Each path
from the root to a leaf represents a complete variable assignment. If variable
x is assigned 0 (1), the branch to the low (high) child is taken. Each leaf is
labeled with the value of f given the variable assignment represented by the cor-
responding path from the root. We use ½ to denote true and Ç to denote false.
Starting from such a tree, one obtains a BDD by merging isomorphic subgraphs
and deleting redundant nodes until no further reduction is possible. A node is
redundant if and only if the subgraphs rooted at its children are isomorphic. In
Fig. 1, dashed edges indicate 0’s and lead to low children, solid ones indicate 1’s
and lead to high children.
Learning the Parameters of Probabilistic Logic Programs 587

1.) Calculate Dependencies:


4.) Propagated evidence:
depT (alarm) = {alarm, earthquake, burglary}
(burglary ∨ earthquake)∧
depT (calls(john)) = {burlary, earthquake ¬al(john)
al(john), person(john), calls(john), alarm} 5.) Build and evaluate BDD
0.7 :: al(john)
2.) Restricted theory: α0.2, β = 1

0.1 :: burglary. person(john).


0.2 :: earthquake
0.2 :: earthquake. alarm :- burglary. α = 0.28,
β = 0.3

0.7 :: al(john). alarm :- earthquake.


alarm alarm
calls(john) :- person(john), alarm, al(john). (determ.)
α = 0.1,β =
(determ.)
α = 1,
0.24, β = 0.06

3.) Clark’s completion: 0.1 :: burglary


α = 0.1,β =
person(john) ↔ true 0.24

alarm ↔ (burglary ∨ earthquake)


calls(john) ↔ person(john) ∧ alarm ∧ al(john) ½ Ç

Fig. 1. The different steps of the LFI-ProbLog algorithm for the training example I + =
{alarm}, I − = {calls(john)}. Normally the alarm node in the BDD is propagated
away in Step 4, but it is kept here for illustrative purposes. The nodes are labeled with
their probability and the up- and downward probabilities.

4.1 Computing the BDD for an Interpretation

The LFI-ProbLog algorithm generates the BDD that encodes a partial interpre-
tation I. Due to the usage of Clark’s completion in Step 3 the algorithm requires
a tight ProbLog program as input. Clark’s completion allows one to propagate
values from the head to the bodies of clauses and vice versa. It states that the
head is true if and only if at least one of its bodies is true, which captures the
least Herbrand model semantics of tight definite clause programs. The algorithm
works as follows (c.f. Fig 1):

1. Compute depT (I) . This is the set of ground atoms that may have an influence
on the truth value of the atoms with known truth value in the partial in-
terpretation I. This is realized by applying the definition of depT (I) directly
using a tabled meta-interpreter in Prolog. We use tabling to store subgoals
and avoid recomputation.
2. Use depT (I) to compute BKr (I), the background theory BK restricted to the
interpretation I (cf. Definition 2 and (4)).
3. Compute clark(BKr (I)), which denotes Clark’s completion of BKr (I); it
is computed by replacing all clauses with the same head h :- body1 , ...,
h :- bodyn by the corresponding formula h ↔ body1 ∨ . . . ∨ bodyn .
4. Simplify clark(BKr (I)) by propagating known values for the atoms in I.
This step eliminates ground atoms with known truth value in I. That is,
we simply fill out their value in the theory clark(BKr (I)), and then we
588 B. Gutmann, I. Thon, and L. De Raedt

propagate these values until no further propagation is possible. This is akin


to the first steps of the Davis-Putnam algorithm.
5. Construct the BDDI , which compactly represents the Boolean formula con-
sisting of the resulting set of clauses. This BDDI is used by the Algorithm 1
outlined in Section 4.3 to compute the expected counts.
In Step 4 of the algorithm, atoms fn with known truth values vn are removed
from the formula and in turn from the BDD. This has to be taken into account
both when calculating the probability of the interpretation and the expected
counts of these variables. The probability of the partial interpretation I given
the ProbLog program T (p) can be calculated as:

Pw (I|T (p)) = P (BDDI ) · P (fn = vn ) (5)
fn known in I

where vn is the value of fn in I and P (BDDi ) is the probability of the BDD as


defined in the following subsection. The probability calculation is implemented
using (5). For ease of notation, however, we shall act as if BDDI included the
variables corresponding to random facts with known truth value in I.
In addition, for computing the expected counts, we also need to consider the
nodes and atoms that have been removed from the Boolean formula when
the BDD has been computed in a compressed form. See for example in Fig. 1 (5)
the probabilistic fact burglary. It only occurs on the left path to the ½-terminal,
but it is with probability 0.1 also true on the right path. Therefore, we treat miss-
ing atoms at a particular level as if they were there and simply go to the next
node independent of whether the missing atom has the value true or false.

4.2 Automated Theory Splitting


For large ground theories the naively constructed BDDs are too big to fit in
memory. BDD tools use heuristics to find a variable order that minimizes size of
the BDD. The runtime of this step is exponential in the size of the input, which
is prohibitive for parameter learning. We propose an algorithm that identifies
independent parts of the grounded theory clark(BKr (I)) (the output of Step 4).
The key observation is, that the BDD for the Boolean formula A ∧ B can be
decomposed into two BDDs, one for BDD for A and one for B respectively, if
A and B do not share a common variable. Since each variable is contained in
at most one BDD, the expected counts of variables can be computed as the
union of the expected count calculation on both BDDs. In order to use the
automatic theory splitting, one has to replace step 5 of the BDD construction
(cf. Section 4.1) with the following algorithm. The idea is to identify sets of
independent formulae in a theory by mapping the theory onto a graph as follows.

1. Add one node per clause in clark(BKr (I)).


2. Add an edge between two nodes if the corresponding clauses share an atom.
3. Identify the connected components in the resulting graph.
4. Build for each of the connected components one BDD representing the con-
junction of the clauses in the component.
Learning the Parameters of Probabilistic Logic Programs 589

The resulting set of BDDs are used by the algorithm outlined in the next section
to compute the expected counts.

4.3 Calculate Expected Counts


m
One can calculate the expected counts E[δn,k |Im ] by a dynamic programming
approach on the BDD. The algorithm is akin to the forward/backward algorithm
for HMMs or the inside/outside probability of PCFGs. We use pN as the proba-
bility that the node N will be left using the branch to the high-child and 1 − pN
otherwise. For a node N corresponding to a probabilistic fact fi this probability
is pN = pi and pN = 1 otherwise. We use the indicator function πN = 1 to test
whether a node N is deterministic. For every node N in the BDD we compute:
1. The upward probability α(N ) represents the probability that the logical for-
mula encoded by the sub-BDD rooted at N is true. For instance, in Fig. 1 (5),
the upward probability of the leftmost node for alarm represents the prob-
ability that the formula alarm ∧ burglary is true.
2. The downward probability β(N ) represents the probability of reaching the
current node N on a random walk starting at the root, where at deterministic
nodes both paths are followed in parallel. If all random walkers take the same
decisions at the remaining nodes it is guaranteed that only one reaches the
½-terminal. This is due to the fact that the values of all deterministic nodes
are fixed given the values for all probabilistic facts. For instance, in Fig. 1 (5),
the downward probability of the left alarm node is equal to the probability
of ¬earthquake ∧ ¬al(john), which is (1 − 0.2) · (1 − 0.7).
Due to their definition, summing values of α and
β at any level n in the BDD
yields the BDD probability, that is, P (BDD) = N node at level n α(N )β(N ).
Each path from the root to the ½-terminal corresponds to an assignment of
values to the variables that satisfies the Boolean formula underlying the BDD.
The probability that such a path passes through the node N can be computed
as α(N ) · β(N ) · (P (BDD))−1 . The upward and downward probabilities are
computed using the following formulae (cf. Fig. 2):
α(Ç) = 0 α(½) = 1 β(Root) = 1
α(N ) = α(h(N )) · pπNN
+ α(l(N )) · (1 − pN )πN
 
β(N ) = β(M ) · pπMM + β(M ) · (1 − pM )πM
N =h(M) N =l(M)

where πN is 0 for nodes representing deterministic nodes and 1 otherwise. Due


to the definition of α and β, the probability of the BDD is returned both at the
root and at the ½-terminal, that is, P (BDD) = α(Root) = β(½). Given these
m
values, one can compute the expected counts E[δn,k |Im ] as

m
E[δn,k |Im ] = β(N ) · pN · α(h(N )) · (P (BDD))−1 .
N represents fm

One computes the downward probability α from the root to the leaves and the
upward probability β from the leaves to the root. Intermediate results are stored
and reused when nodes are revisited. Both parts are sketched in Algorithm 1.
590 B. Gutmann, I. Thon, and L. De Raedt

α(N ) = α(T1 ) · pπNN + α(T0 ) · (1 − pN )πN β(N ) = β(pa(N ))


(1 − pN )πn ·
α(T1 ) N α(T2 ) pπNN · β(N ) N β(N )

T1 T2 T1 T2

Fig. 2. Propagation step of the upward probability (left) and for the downward proba-
bility (right). The indicator function πN is 1 if N is a probabilistic node and 0 otherwise.

Algorithm 1. Calculating α and β. The nodes l(h) and h(n) are the low and
high child of the node n respectively.
function Alpha(BDD node n) function Beta(BDD node n)
If n is the ½ then return 1 q := priority queue using the BDD’s order
If n is the Ç then return 0 enqueue(q, n)
if n probabilistic fact then Beta := array of 0’s of length size(BDD)
return pn · Alpha(h(n)) Beta[root(n)]:= 1
+(1−pn )·Alpha(l(n)) while q not empty do
return Alpha(h(n))+ n := dequeue(q)
quadAlpha(l(n)) Beta[h(n)]+ = Beta[n] · pπnn
Beta[l(n)]+ = Beta[n] · (1 − pn )πn
enqueue(q, h(n)) if not yet in q
enqueue(q, l(n)) if not yet in q

5 Experiments
We implemented LFI-ProbLog in YAP Prolog and use SimpleCUDD for the
BDD operations. We used two datasets to evaluate LFI-ProbLog. The WebKB
benchmark serves as test case to compare with state-of-the-art systems. The
Smokers dataset is used to test the algorithm in terms of the learned model,
that is, how close are the parameters to the original ones. The experiments were
run on an Intel Core 2 Quad machine (2.83 GHz) with 8GB RAM.

5.1 WebKB
The goal of this experiment is to answer the following questions:
Q1. Is LFI-ProbLog competitive with existing state-of-the-art frameworks?
Q2. Is LFI-ProbLog insensitive to the initial probabilities?
Q3. Is the theory splitting algorithm capable of handling large data sets?
In this experiment, we used the WebKB [3] dataset. It contains four folds, each
describing the link structure of pages from one of the following universities:
Cornell, Texas, Washington, and Wisconsin. WebKB is a collective classification
task, that is, one wants to predict the class of a page depending on the classes
of the pages that link to it and depending on the words being used in the
Learning the Parameters of Probabilistic Logic Programs 591

1 -1000

-1500
0.9

Area Under ROC Curve


-2000
0.8

LLH Test Set


-2500

0.7 -3000

-3500
0.6
LFI-ProbLog [0.0001-0.0003] -4000
0.5 LFI-ProbLog [0.1-0.3] LFI-ProbLog [0.0001-0.0003]
LFI-ProbLog [0.1-0.9] -4500 LFI-ProbLog [0.1-0.3]
MLNs LFI-ProbLog [0.1-0.9]
0.4 -5000
0 200 400 600 800 1000 0 5 10 15 20
Time [sec] Iteration

Fig. 3. Area under the ROC curve against the learning time (left) and test set log
likelihood for each iteration of the EM algorithm (right) for WebKB

text. To allow for an objective comparison with Markov Logic networks and
the results of Domingos and Lowd [9], we used their slightly altered version of
WebKB. In their setting each page is assigned exactly one of the classes “course”,
“faculty”, “other”, “researchproject”, “staff”, or “student”. Furthermore, the
class “person”, present in the original version, has been removed. We use the
following model that contains one non-ground probabilistic fact for each pair of
Class and Word. To account for the link structure, it contains one non-ground
probabilistic fact for each pair of Class1 and Class2.
P :: pfWoCla(Page, Class, Word).
P :: pfLiCla(Page1, Page2, Class1, Class2).
The probabilities P are unknown and have to be learned by LFI-ProbLog. As
there are 6 classes and 771 words, our model has 6×771+6×6 = 4662 parameters.
In order to combine the probabilistic facts and predict the class of a page we
add the following background knowledge.
cl(Pa, C) :- hasWord(Pa, Word), pfWoCla(Pa, Word, C).
cl(Pa, C) :- linksTo(Pa2, Pa), pfLiCla(Pa2, Pa, C2, C), cl(Pa2, C2).
We performed a 4-fold cross validation, that is, we trained the model on three
universities and then tested it on the fourth one. We repeated this for all four
universities and averaged the results. We measured the area under the precision-
recall curve (AUC-PR), the area under the ROC curve (AUC-ROC), the log
likelihood (LLH), and the accuracy after each iteration of the EM algorithm.
Our model does not express that each page has exactly one class. To account
for this, we normalize the probabilities per page. Figure 3 (left) shows the AUC-
ROC plotted against the average training time. The initialization phase, that is
running steps 1-4 of LFI-ProbLog, takes ≈ 330 seconds, and each iteration of the
EM algorithm takes ≈ 62 seconds. We initialized the probabilities of the model
randomly with values sampled from the uniform distribution between 0.1 and
0.9, which is shown as the graph for LFI-ProbLog [0.1-0.9]. After 10 iterations
(≈ 800 s) the AUC-ROC is 0.950 ± 0.002, the AUC-PR is 0.828 ± 0.006, and the
accuracy is 0.769 ± 0.010.
We compared LFI-ProbLog with Alchemy [9] and LeProbLog [13]. Alchemy
is an implementation of Markov Logic networks. We use the model suggested by
592 B. Gutmann, I. Thon, and L. De Raedt

Domingos and Lowd [9] that uses the same features as our model, and we train
it according to their setup.2 . The learning curve for AUC-ROC is shown in Fig-
ure 3 (left). After 943 seconds Alchemy achieves an AUC-ROC of 0.923 ± 0.016,
an AUC-PR of 0.788 ± 0.036, and an accuracy of 0.746 ± 0.032. LeProbLog is a
regression-based parameter learning algorithm for ProbLog. The training data
has to be provided in the form of queries annotated with the target probability.
It is not possible to learn from interpretations. For WebKB, however, one can
map one interpretation to several training examples P (class(URL,Class) = P
per page where P is 1 if the class of URL is Class and else 0. This is possi-
ble, due to the existence of a target predicate. We used the standard settings
of LeProblog and limit the runtime to 24 hours. Within this limit, the algo-
rithm performed 35 iteration of gradient descent. The final model obtained an
AUC-PR of 0.419 ± 0.014, an AUC-ROC of 0.738 ± 0.014, and an accuracy of
0.396 ± 0.020. These results affirmatively answer Q1.
We tested how sensitive LFI-ProbLog is for the initial fact probabilities by
repeating the experiment with values sampled uniformly between 0.1 and 0.3
and sampled uniformly between 0.0001 and 0.0003 respectively. As the graphs
in Figure 3 indicate, the convergence is initially slower and the initial LLH values
differ. This is due to the fact that the ground truth probabilities are small, and
if the initial fact probabilities are small too, one obtains a better initial LLH. All
settings converge to the same results, in terms of AUC and LLH. This suggests
that LFI-ProbLog is insensitive to the start values (cf. Q2).
The BDDs for the WebKB dataset are too large to fit in memory and the
automatic variable reordering is unable to construct the BDD in a reasonable
amount of time. We used two different approaches to resolve this. In the first
approach, we manually split each training example, that is, the grounded theory
together with the known class for each page, into several training examples. The
results shown in Figure 3 are based on this manual split. In the second approach,
we used the automatic splitting algorithm presented in Section 4.2. The resulting
BDDs are identical to the manual split setting, and the subsequent runs of the
EM algorithm converge to the same results. Hence when plotting against the it-
eration, the graphs are identical. The resulting ground theory is much larger and
the initialization phase therefore takes 247 minutes. However, this is mainly due
to the overhead for indexing, database access and garbage collection in the un-
derlying Prolog system. Grounding and Clark’s completion take only 6 seconds
each, the term simplification step takes roughly 246 minutes, and the final split-
ting algorithm runs in 40 seconds. As we did not optimize the implementation
of the term simplification, we see a big potential for improvement, for instance
by tabling intermediate simplification steps. This affirmatively answers Q3.

5.2 Smokers
We set up an experiment on an instance of the Smokers dataset (cf. [9]) to
answer the question
2
Daniel Lowd provided us with the original scripts for the experiment setup. We
report on the evaluation based on the rerun of the experiment.
Learning the Parameters of Probabilistic Logic Programs 593

20 20 20
0% 0% 0%
10 % 10 % 10 %
20 % 20 % 20 %
15 30 % 15 30 % 15 30 %
40 % 40 % 40 %
KL Divergence

50 % 50 % 50 %
10 10 10

5 5 5

0 0 0
40 60 80 100 120 140 160 180 200 40 60 80 100 120 140 160 180 200 40 60 80 100 120 140 160 180 200
# training examples # training examples # training examples

Fig. 4. Result for KL-divergence in the smokers domain. The plots are for left to right
3, 4, 5 smokers. Different graphs correspond to different amounts of missing data.

Q4. Is LFI-Problog able to recover the parameters of the original model with a
reasonable amount of data?

Missing or incorrect values are two different types of noise that can occur in
real-world data. While incorrect values can be compensated by additional data,
missing values cause local maxima in the likelihood function. In turn, they cause
the learning algorithm to yield parameters different from the ones used to gener-
ate the data. LFI-ProbLog computes the maximum likelihood parameters given
some evidence. Hence the algorithm should be capable of recovering the param-
eters used to generate a set of interpretations. We analyze how the amount of
required training data increases as the size of the model increases. Furthermore,
we test for the influence of missing values on the results. We assess the quality
of the learned model, that is, the difference to the original model parameters
by computing the Kullback Leibler (KL) divergence. ProbLog allows for an effi-
cient computation of this measure due to the independence of the probabilistic
facts. In this experiment, we use a variant of the “Smokers” model which can be
represented in ProbLog as follows:

p si :: smokes i(X, Y) // person influenced by a smoking friend


p sp :: smokes p(X) // person starts smoking without external reason
p cs :: cancer s(X). // cancer is caused by smoking
p cp :: cancer(X). // cancer without external reason
smokes(X) :- friend(X, Y), smokes(Y), smokes i(X, Y)); smokes p(X).
cancer(X) :- smokes(X), cancer s(X)); cancer p(X).

Due to space restrictions, we omit the details on how to represent this such that
the program is tight. We set the number of persons to 3,4 and 5 respectively and
sampled from the resulting models up to 200 interpretations each. From these
datasets we derived new instances by randomly removing 10 − 50% of the atoms.
The size of an interpretation grows quadratically with the number of persons.
The model, as described above, has an implicit parameter tying between ground
594 B. Gutmann, I. Thon, and L. De Raedt

instances of non-ground facts. Hence the number of model parameters does not
change with the number of persons. To measure the influence of the model
size, we therefore trained grounded versions of the model, where the grounding
depends on the number of persons. For each dataset we ran LFI-ProbLog for
50 iterations of EM. Manual inspection showed that the probabilities stabilized
after a few, typically 10, iterations. Figure 4 shows the KL divergence for 3, 4
and 5 persons respectively. The closer the KL divergence is to 0, the closer the
learned model is to the original parameters. As the graphs show, the learned
parameters approach the parameters of the original model as the number of
training examples grows. Furthermore, the amount of missing values has little
influence on the distance between the true and the learned parameters. Hence
LFI-Problog is capable of recovering the original parameters and it is robust
against missing values. This affirmatively answers Q4.

6 Related Work

Most of the existing parameter learning approaches for ProbLog [6], PRISM [22],
and SLPs [17] are based on learning from entailment. For ProbLog, there exists a
learning algorithm based on regression where each training example is a ground
fact together with the target probability [13]. In contrast to LFI-ProbLog, this
approach does not assume an underlying generative process; neither at the level
of predicates nor at the level of interpretations. Sato and Kameya have con-
tributed various interesting and advanced learning algorithms that have been
incorporated in PRISM. Ishihata et al. [14] consider a parameter learning set-
ting based on Binary Decision Diagrams (BDDs) [2]. In contrast to our work,
they assume the BDDs to be given, whereas LFI-ProbLog, constructs them in
an intelligent way from evidence and a ProbLog theory. Ishihata et al. suggest
that their approach can be used to perform learning from entailment for PRISM
programs. This approach has been recently adopted for learning CP-Logic pro-
grams (cf. [1]). The BDDs constructed by LFI-ProbLog are a compact represen-
tation of all possible worlds that are consistent with the evidence. LFI-ProbLog
estimates the marginals of the probabilistic facts in a dynamic programming
manner on the BDDs. While this step is inspired by [14], we tailored it to-
wards the specifics of LFI-ProbLog, that is, we allow deterministic nodes to be
present in the BDDs. This extension is crucial, as the removal of deterministic
nodes can results in an exponential growth of the Boolean formulae underlying
the BDD construction. Riguzzi [20] uses a transformation of ground ProbLog
programs to Bayesian networks in order to learn ProbLog programs from inter-
pretations. Such a transformation is also employed in the learning approaches
for CP-logic [24,16]. Thon et al. [23] studied how CPT-L, a sequential variant
of CP-Logic, can be learned from sequences of interpretations. CPT-L is closely
related to LFI-ProbLog. However, CPT-L is targeted towards the sequential as-
pect of the theory, whereas we consider a more general settings with arbitrary
theories. Thon et al. assume full observability, which allows them to split the
sequence into separate transitions. They build one BDD per transition, which
Learning the Parameters of Probabilistic Logic Programs 595

is much easier to construct than one large BDD per sequence. Our splitting al-
gorithm is capable of exploiting arbitrary independence. LFI-ProbLog can also
be related to knowledge-based model construction approaches in statistical rela-
tional learning such as BLPs, PRMs and MLNs [19]. While the setting explored
in this paper is standard for the aforementioned formalisms, our approach has
significant representational and algorithmic differences from the algorithms used
in those formalisms. In BLPS, PRMs and CP-logic, each training example is
typically used to construct a ground Bayesian network on which a standard
learning algorithm is applied. Although the representation generated by Clark’s
completion is quite close to the representation of Markov Logic, there are subtle
differences. While Markov Logic uses weights on clauses, we use probabilities
attached to single facts.

7 Conclusions
We have introduced a novel parameter learning algorithm from interpretations
for the probabilistic logic programming language ProbLog. This has been mo-
tivated by the differences in the learning settings and applications of typical
knowledge-based model construction approaches and probabilistic logic program-
ming approaches. The LFI-ProbLog algorithm tightly couples logical inference
with a probabilistic EM algorithm at the level of BDDs. Possible directions of
future work include using d-DNNF representations instead of BDDs [10] and a
transformation to Boolean formulae that does not require tight programs.

Acknowledgments. Bernd Gutmann is supported by the Research Foundation-


Flanders (FWO-Vlaanderen). This work is supported by the GOA project 2008/08
Probabilistic Logic Learning and by the European Community under contract
number FP7-248258-First-MM. We thank Vı̀tor Santos Costa and Paulo Moura
for their help with YAP Prolog.

References
1. Bellodi, E., Riguzzi, F.: EM over binary decision diagrams for probabilistic logic
programs. Tech. Rep. CS-2011-01, Università di Ferrara, Italy (2011)
2. Bryant, R.E.: Graph-based algorithms for boolean function manipulation. IEEE
Trans. Computers 35(8), 677–691 (1986)
3. Craven, M., Slattery, S.: Relational learning with statistical predicate invention:
Better models for hypertext. Machine Learning 43(1/2), 97–119 (2001)
4. Cussens, J.: Parameter estimation in stochastic logic programs. Machine Learn-
ing 44(3), 245–271 (2001)
5. De Raedt, L., Frasconi, P., Kersting, K., Muggleton, S. (eds.): Probabilistic Induc-
tive Logic Programming — Theory and Applications. LNCS (LNAI), vol. 4911.
Springer, Heidelberg (2008)
6. De Raedt, L., Kimmig, A., Toivonen, H.: Problog: A probabilistic Prolog and its
application in link discovery. In: Veloso, M. (ed.) IJCAI, pp. 2462–2467 (2007)
7. De Raedt, L.: Logical and Relational Learning. Springer, Heidelberg (2008)
596 B. Gutmann, I. Thon, and L. De Raedt

8. De Raedt, L., Kersting, K.: Probabilistic inductive logic programming. In: Ben-
David, S., Case, J., Maruoka, A. (eds.) ALT 2004. LNCS (LNAI), vol. 3244, pp.
19–36. Springer, Heidelberg (2004)
9. Domingos, P., Lowd, D.: Markov Logic: An Interface Layer for Artificial Intelli-
gence. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan
& Claypool Publishers, San Francisco (2009)
10. Fierens, D., Van den Broeck, G., Thon, I., Gutmann, B., De Raedt, L.: Inference
in probabilistic logic programs using weighted cnf’s. In: The 27th Conference on
Uncertainty in Artificial Intelligence, UAI 2011 (to appear, 2011)
11. Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning probabilistic relational
models. In: Džeroski, S., Lavrač, N. (eds.) Relational Data Mining, pp. 307–335.
Springer, Heidelberg (2001)
12. Getoor, L., Taskar, B. (eds.): An Introduction to Statistical Relational Learning.
MIT Press, Cambridge (2007)
13. Gutmann, B., Kimmig, A., De Raedt, L., Kersting, K.: Parameter learning in
probabilistic databases: A least squares approach. In: Daelemans, W., Goethals,
B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp.
473–488. Springer, Heidelberg (2008)
14. Ishihata, M., Kameya, Y., Sato, T., Minato, S.: Propositionalizing the EM algo-
rithm by BDDs. In: ILP (2008)
15. Kersting, K., Raedt, L.D.: Bayesian logic programming: theory and tool. In:
Getoor, L., Taskar, B. (eds) [12]
16. Meert, W., Struyf, J., Blockeel, H.: Learning ground cp-logic theories by leveraging
bayesian network learning techniques. Fundam. Inform. 89(1), 131–160 (2008)
17. Muggleton, S.: Stochastic logic programs. In: De Raedt, L. (ed.) Advances in In-
ductive Logic Programming. Frontiers in Artificial Intelligence and Applications,
vol. 32. IOS Press, Amsterdam (1996)
18. Poole, D.: The independent choice logic and beyond. In: De Raedt, L. et al [5],
19. Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62, 107–
136 (2006)
20. Riguzzi, F.: Learning ground problog programs from interpretations. In: Proceed-
ings of the 6th Workshop on Multi-Relational Data Mining, MRDM 2007 (2007)
21. Sato, T.: A statistical learning method for logic programs with distribution se-
mantics. In: Sterling, L. (ed.) Proceedings of the 12th International Conference on
Logic Programming, pp. 715–729. MIT Press, Cambridge (1995)
22. Sato, T., Kameya, Y.: Parameter learning of logic programs for symbolic-statistical
modeling. Journal of Artificial Intelligence Research 15, 391–454 (2001)
23. Thon, I., Landwehr, N., De Raedt, L.: A simple model for sequences of relational
state descriptions. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD
2008, Part I. LNCS (LNAI), vol. 5211, pp. 506–521. Springer, Heidelberg (2008)
24. Vennekens, J., Denecker, M., Bruynooghe, M.: Representing causal information
about a probabilistic process. In: Fisher, M., van der Hoek, W., Konev, B., Lisitsa,
A. (eds.) JELIA 2006. LNCS (LNAI), vol. 4160, pp. 452–464. Springer, Heidelberg
(2006)
Feature Selection Stability Assessment Based on
the Jensen-Shannon Divergence

Roberto Guzmán-Martı́nez1 and Rocı́o Alaiz-Rodrı́guez2


1
Servicio de Informatica y Comunicaciones
Universidad de León, 24071 León, Spain
roberto.guzman@unileon.es
2
Dpto. de Ingenierı́a Eléctrica y de Sistemas,
Universidad de Leon, 24071 León, Spain
rocio.alaiz@unileon.es

Abstract. Feature selection and ranking techniques play an important


role in the analysis of high-dimensional data. In particular, their stability
becomes crucial when the feature importance is later studied in order to
better understand the underlying process. The fact that a small change
in the dataset may affect the outcome of the feature selection/ranking
algorithm has been long overlooked in the literature. We propose an
information-theoretic approach, using the Jensen-Shannon divergence to
assess this stability (or robustness). Unlike other measures, this new met-
ric is suitable for different algorithm outcomes: full ranked lists, partial
sublists (top-k lists) as well as the least studied partial ranked lists. This
generalized metric attempts to measure the disagreement among a whole
set of lists with the same size, following a probabilistic approach and be-
ing able to give more importance to the differences that appear at the
top of the list. We illustrate and compare it with popular metrics like
the Spearman rank correlation and the Kuncheva’s index on feature se-
lection/ranking outcomes artificially generated and on an spectral fat
dataset with different filter-based feature selectors.

Keywords: Feature selection, feature ranking, stability, robustness,


Jensen-Shannon divergence.

1 Introduction
Feature selection techniques play an important role in classification problems
with high dimensional data [6]. Reducing the data dimensionality is a key step
in these applications as the size of the training data set needed to calibrate
a model grows exponentially with the number of dimensions (the curse of the
dimensionality problem) and the process of knowledge discovery from the data
is simplified if the instances are represented with less features.
Feature selection techniques measure the importance of the features according
to the value of a given function. These algorithms can be basically divided in

This work has been partially supported by the Spanish MEC project DPI2009-08424.

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 597–612, 2011.

c Springer-Verlag Berlin Heidelberg 2011
598 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez

three types [7]: filter, wrapper and embedded approaches. The filter methods
select the features according to a reasonable criterion computed directly from the
data and that is independent of the classification model. The wrapper approaches
make use of the predictive performance of the classification machine in order to
determine the value of a given feature subset and the embedded techniques
are specific for each model since they are intrinsically defined in the inductive
algorithm. Regarding the outcome of the feature selection technique, the output
format may be a full ranked list (or weighting-score) or a subset of features.
Obviously representation changes are possible and thus, a feature subset can be
extracted from a full ranked list by selecting the most important features and a
partial ranked list can be also derived directly from the full ranking by removing
the least important features.
A problem that arises in many practical problems, in particular when the
available dataset is small and the feature dimensionality is high, is that small
variations in the data lead to different outcomes of the feature selection algo-
rithm. Perhaps the disparity among different research findings has made the
study of the stability (or robustness) of feature selection a topic of recent in-
terest. Fields like biomedicine, bioinformatics or chemometrics require not only
accurate classification models, but a feature ranking or a subset of the most
important features in order to better understand the data and the underlying
process. The fact that under small variations in the available training data, the
top-k feature list (or the ranked feature list) varies, makes this task not straight-
forward and the conclusions derived from it quite unreliable.
The assessment of the robustness of feature selection/ranking methods be-
comes an important issue [11,9,3,1], specially when the aim is to gain insight
into the underlying process by analyzing the most relevant features. Neverthe-
less, this is a topic that has received little attention and it has been only during
the last decade that several works address this analysis. In order to measure
the stability, suitable metrics for each output format of the feature selection
algorithms are required.
The Spearman’s rank correlation coefficient [10,11,19] and Canberra distance
[9] have been proposed to measure the similarity when the outcome represen-
tation is a full ranked list. When the goal is to measure the similarity between
top-k lists (partial lists), a wide variety of measures have been proposed: Jac-
card distance [11,19], an adaptation of the Tanimoto distance [11], Kuncheva’s
stability index [13], Relative Hamming distance [5], Consistency measures [20],
Dice-sorense’s index [15], Ochiai’s index [22] or Percentage of overlapping fea-
tures [8]. An alternative that lies between full ranked lists (all features with
ranking information) and partial lists (a subset with the top-k features, where
all of them are given the same importance) is the use of partial ranked lists,
that is, a list with the top-k features and the relative ranking among them.
This approach has been used in the information retrieval domain [2] to evaluate
queries and it seems more natural when the goal is to analyze a subset of fea-
tures. Providing information of the feature importance is fundamental to carry
Feature Selection Stability Assessment Based on the JS Divergence 599

out a subsequent analysis of the data, but a stability measures have not been
proposed yet for these partial ranked lists.
In our context, the evaluation of the robustness of feature selection techniques,
two ranked lists would be considered much less similar if their differences oc-
curred at the “top” rather than at the “bottom” of the lists. Unlike metrics such
as the Kendall’s tau and the Spearman’s rank correlation coefficient that do not
capture this information, we propose a stability measure based on information
theory that takes this into consideration. Our proposal is based on mapping each
ranked list into a probability distribution and then, measuring the dissimilar-
ity among these distributions using the information-theoretic Jensen-Shannon
divergence. This single metric, SJS (Similarity based on the Jensen-Shannon
divergence) applies to full ranked lists, partial ranked lists as well as top-k lists.
The rest of this paper is organized as follows: Next, Section 2 describes the
stability problem and common approaches to deal with it. The new metric based
on the Jensen-Shannon divergence SJS is presented in Section 3. Experimental
evaluation is shown in Section 4 and finally Section 5 summarizes the main
conclusions.

2 Problem Formulation

In this section, we formulate the problem mathematically and present two com-
mon metrics to evaluate the stability of a feature selection and a feature ranking
algorithm.

2.1 Feature Selection and Ranking

Consider a training dataset D = {(xi , di ), i = 1, . . . , M } consisting of M in-


stances and a target d associated with each sample. Each instance xi is a l-
dimensional vector xi = (xi1 , xi2 , . . . xil ) where each component xij represents
the value of a given feature fj for that example i, that is, fj (xi ) = xij .
Consider now a feature selection algorithm whose output is a vector s

s = (s1 , s2 , s3 , . . . , sl ), si ∈ {0, 1} (1)


l
where 1 indicates the presence of a feature and 0 the absence and i=1 si = k
for a top-k feature list.
When the algorithm performs feature ranking, its output is a ranking vector
r with components
r = (r1 , r2 , r3 , . . . , rl ) (2)
where 1 ≤ ri ≤ l and 1 is considered the highest rank.
Converting a ranking output into a top-k list is conducted according to

1 if ri ≤ k
si = (3)
0 otherwise
600 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez

2.2 Similarity Measures


Generally speaking, the dissimilarity among ranked lists can be measured at
different levels:
– Among full ranked lists
– Among partial sublists (top-k lists)
– Among partial ranked lists (top-k ranked lists)
Ideally, this metric should be bounded by constants that do not depend on the
size of the sublist k or on the total number of features l. Additionally, it should
have a constant value for randomly generated subsets/rankings.
The Spearman’s rank correlation coefficient is the most frequent metric to
measure the similarity between two full ranking outputs [11,19,16,8] but it is not
suitable to partial lists. Among the wide variety of metrics proposed to measure
the similarity between partial lists with k features, such as the Jaccard distance,
Relative Hamming distance, Consistency measures, Dice-sorense’s index [15],
Ochiai’s index [22], Percentage of overlapping features, the Kuncheva’s stability
index appears to be the most widely accepted [13,1,8]. These metrics only apply
to top-k lists, though. Finally, other measures can be applied to full ranked lists
and top-k lists: Canberra distance, Spearman’s footrule and Spearman’s rho,
but they do not fulfil desirable properties such as, they should be bounded by
constants that do not depend on k or l and it should have a constant value for
randomly extracted subsets or random rankings.
Let r and r be the outcome of a feature ranking algorithm applied to two
different subsamples of D. The Spearman’s rank correlation coefficient (SR ) is
defined as
l
 (ri − ri )2
SR (r, r ) = 1 − 6 (4)
i=1
l(l2 − 1)
where l is the number of features. The SR metric takes values in the interval
[−1, 1], being -1 for exactly inverse orders, 0 if there is no correlation between
the rankings and 1 when the two rankings are identical.
Let also consider s and s as the outcome of a feature selection algorithm
applied to two different subsamples of D. The Kuncheva’s index (KI) is given by
ol − k 2
KI(s, s ) = (5)
k(l − k)
where l is the original whole number of features, o is the number of features that
are
l present in both lists simultaneously and k is the length of the sublists, that is,
l 
i=1 si = i=1 si = k. The KI satisfies −1 < KI ≤ 1, achieving its maximum
when the two lists are identical (o = k) and its minimum for independently
drawn lists s and s .

2.3 The Stability for a Set of Lists


When we have a set of outputs from a feature ranking algorithm, A = {r1 , r2 , . . .
rN }, with size N , the most common way to evaluate the stability of the set is
Feature Selection Stability Assessment Based on the JS Divergence 601

to compute pairwise similarities and average the results, what leads to a single
scalar value.
N
 −1 N
2
S(A) = SM (ri , rj ) (6)
N (N − 1) i=1 j=i+1

where SM represents any similarity metric (Kuncheva’s stability index KI or


the Spearman rank correlation coefficient SR , for example).

3 Stability Based on the Jensen-Shannon Divergence


We propose a stability measure based on the Jensen-Shannon Divergence able to
measure the diversity either among several full ranked lists, among partial ranked
lists or among top-k lists. When the ranking is taken into account, the differences
at the top of the list would be considered more important than differences at
the bottom part, regardless it is a full or a partial list. When we focus on top-k
lists, all the features would be given the same importance.
Our approach to measure the stability of feature selection/ranking techniques
is based on mapping the output of the feature selection/ranking algorithm into
a probability distribution. Then, the “distance” among these distributions is
measured with the Jensen-Shannon divergence [14].
Next, we present our proposal for full ranked lists and then, Section 3.1 and
Section 3.2 cover its extension to partial ranked lists and top-k lists, respectively.
Given the output of a feature ranking algorithm, features at the top of the
list should be given the highest probability (or weight) and it should smoothly
decrease according to the rank. Thus, following [2] the ranking vector r =
(r1 , r2 , r3 , . . . , rl ) would be mapped into the probability vector p = (p1 , p2 , p3 , . . . ,
pl ) where  
1 1 1 1
pi = 1+ + + ... + (7)
2l ri ri + 1 l
l
where i=1 pi = 1
This way, we assess the dissimilarity between two ranked lists r and r , mea-
suring the divergence between the distributions p and p associated with them.
When it comes to measure the difference between two probability distribu-
tions, the Kullback-Leibler (KL) divergence DKL [12] becomes the most widely
used option. The KL divergence between probability distributions p and p is
given by  pi
DKL (p||p )) = pi log  (8)
i
pi

This measure is always non negative, taking values from 0 to ∞, and DKL (p||q)) =
0 if p = q. The KL divergence, however, has two important drawbacks, since (a)
in general it is asymmetric (DKL (p||q)) 
= DKL (q||p)) and (b) it does not gener-
alize to more than two distributions. For this reason, we use the related Jensen-
Shannon divergence [14], that is a symmetric version of the Kullback-Leibler
divergence and is given by
602 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez

1
DJS (p||p ) =
(DKL (p||p̄) + DKL (p ||p̄)) (9)
2
where p̄ is the average of the distributions.
Given a set of N distributions {p1 , p2 , . . . , pN }, where each one corresponds
to a run of a given feature ranking algorithm, we can use the Jensen Shannon di-
vergence to measure the similarity among the distributions produced by different
runs of the feature ranking algorithm, what can be expressed as
N
1 
DJS (p1 , . . . , pN ) = DKL (pi ||p̄) (10)
N i=1

or alternatively as
N l
1  pij
DJS (p1 , . . . , pN ) = pij log (11)
N j=1 i=1 p¯i

with pij being the probability assigned to feature i in the ranking output j and
p¯i the average probability assigned to feature i.
We look for a stability measure based on the Jensen Shannon Divergence
(SJS ) that fulfills some constraints:
– It falls in the interval [0 ,1]
– It takes the value zero for completely random rankings
– It takes the value one for stable rankings
The stability metric SJS (Stability base on the Jensen Shannon divergence) is
given by
DJS (p1 , . . . , pN )
SJS (p1 , . . . , pN ) = 1 − ∗ (12)
DJS (p1 , . . . , pN )
where DJS is the Jensen Shannon Divergence among the N ranking outcomes

and DJS is the divergence value for a ranking generation that is completely
random.

In a random setting, p¯i = 1/l what leads to a constant value DJS
N l l l
∗ 1  1  
DJS (p1 , . . . , pN ) = pij log(pij l) = N pi log(pi l) = pi log(pi l)
N j=1 i=1 N i=1 i=1
(13)
where pi is the probability assigned to a feature with rank ri . Note that this
maximum value depends exclusively on the number of features and it can be
computed beforehand with the mapping provided by (7).
It is easy to check that:
– For a completely stable ranking algorithm, pij = p¯i in (11). That is, the rank
of feature-j is the same in any run-i of the feature ranking algorithm. This
leads to DJS = 0 and a stability metric SJS = 1
Feature Selection Stability Assessment Based on the JS Divergence 603


– A random ranking will lead to DJS = DJS and therefore SJS = 0
– For any ranking neither completely stable nor completely random, the sim-
ilarity metric SJS ∈ (0, 1). The closer to 1, the more stable the algorithm
is.

3.1 Extension to Partial Ranked Lists


The similarity between partial ranked lists, that is, partial lists that contain the
top-k features with relative ranking information can be also measured with the
SJS metric. In this case, the probability is assigned to the top-k ranked features
according to
⎧  
⎨ 1 1 1 1
1+ + + ... + if ri ≤ k
pi = 2k ri ri + 1 k (14)

0 otherwise

The SJS is computed according to (12) with the normalizing factor DJS given
by (13) and the probability pi assigned to a feature with rank ri computed as
stated in (14).

3.2 Extension to Top-k Lists


When it comes to partial lists with a given number k of top-features, a uniform
probability is assigned to the selected features according to

1
if ri ≤ k
pi = k (15)
0 otherwise

The SJS is computed according to (12) with the probability pi assigned to a



feature according to (15) and the normalizing factor DJS given by
l
 k    
∗ 1 1 l
DJS (p1 , . . . , pN ) = pi log(pi l) = log l = log (16)
i=1 i=1
k k k

where k is the length of the sublist and l the total number of features.

4 Empirical Study
4.1 Illustration on Artificial Outcomes
In this experiment we evaluate the stability metric SJS for the outcomes of
hypothetical feature ranking algorithms. We generate sets of N = 100 rankings
of l = 2000 features. We simulate several feature ranking (FR) algorithms:
– FR-0 with 100 random rankings, that is, a completely random FR algorithm
– FR-1 with one fixed output, and 99 random rankings.
– FR-2 with two identical fixed outputs, and 98 random rankings.
604 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez

1
Jensen−Shannon
0.9 Spearman

0.8

0.7

0.6
Metric

0.5
SJ S
0.4

0.3
SR
0.2

0.1

0
FR−0 FR−50 FR−100
Stability

Fig. 1. SJ S metric and Spearman rank correlation for Feature Ranking (FR) techniques
that vary from completely random (FR-0 on the left) to completely stable (FR-100 on
the right)

– FR-i with i identical fixed outputs, and 100 − i random rankings.


– FR-100 with 100 identical rankings, that is, an stable FR technique.

Fig. 1 shows the SJS and the SR for Feature Ranking (FR) techniques that vary
from completely random (FR-0, on the left) to completely stable (FR-100 on
the right). For the FR-0 method, the stability metric based on Jensen-Shannon
divergence SJS takes the value 0, while its value is 1 for the stable FR-100
algorithm. Note that SJS takes similar values to the Spearman rank correlation
coefficient SR .
Assume we have now some Feature Selection (FS) techniques, which stabil-
ity needs to be assessed. These FS methods (FS-0,FS-1,...,FS-100) have been
obtained from the corresponding FR techniques described above, extracting the
top-k features (k = 600). In the same way, they vary smoothly from a completely
random FS algorithm (FS-0) to stable FS a completely stable one (FS-100). The
Jensen-Shannon metric SJS together with the Kuncheva Index (KI) are depicted
for top-600 lists in Fig. 2. Note that the SJS metric applied to top-k lists pro-
vide similar values to the KI metric. The Jensen-Shannon based measure SJS
can be applied to full ranked lists and partial lists, while the KI is only suitable
to partial lists and the SR only to full ranked lists.
Generating partial ranked feature lists is an intermediate option between: (a)
generating and comparing full ranked feature lists that are, in general, very long
and (b) extracting sublists with the top-k features, but with no information
about the importance of each feature. The SJS metric based on the Jensen-
Shannon divergence also allows to compare these partial ranked lists (as well as
top-k lists and full ranked lists). Consider we have sets of sublists with the 600
most important features out of 2000 features. We generated several sets of lists:
Feature Selection Stability Assessment Based on the JS Divergence 605

1
Spearman
0.9
Kuncheva Index
0.8

0.7

0.6
Metric

0.5

0.4
KI
0.3
SJ S
0.2

0.1

0
FS−0 FS−50 FS−100
Stability

Fig. 2. SJ S metric and the KI for Feature Selection (FS) techniques that vary from
completely random (FS-0 on the left) to completely stable (FS-100 on the right). The
metrics work on top-k lists with k=600.

some of them show high differences in the lowest ranked features while other
show high differences in the highest rank features. The same sublist can come
either with the ranking information (partial ranked lists) or with no information
about the feature importance (top-k lists). The overlap among the lists is around
350 features. Fig. 3 shows the value SJS (partial ranked lists), SJS (top-k list)
and the Kuncheva index (top-k lists) for the lists.
Even though the lists have the same average overlap (350 features), some of
them show more discrepancy about which are the top features (Fig. 3, on the
right), while other sets show more differences at the bottom of the list. The
KI can not handle this information since it only works with top-k lists and
therefore, it assigns the same value for these very different situations. When the
SJS works at this level (top-k list), it also gives the same measure for all the
scenarios. The SJS can also handle the information provided in partial ranked
lists, considering the importance of the features and therefore assigning a lower
stability value for those sets of lists with high differences at the top of the
lists, that is with high discrepancy about the most important features. Likewise,
it assigns a higher stability value for those sets where the differences appear
in the least important features, but there is more agreement about the most
important features. Fig. 3 illustrates this fact where SJS (partial ranked lists)
varies according to the location of the differences in the list, while SJS (top-k
lists) and the KI assign the same value regardless of where the discrepancies
appear.
Consider also the following situation where the most important 600 features
out of 2000 have been extracted and the overlap among the top-600 lists of 100%.
We have evaluated several scenarios:
606 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez

1
KI top−k
S top−k
0.9 JS
S , Ranked lists
JS
0.8

Stability 0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
Bottom Middle Top
Sets of Partial list with differences

Fig. 3. SJ S (partial ranked lists), SJ S (top-k list) and the Kuncheva index (top-k lists)
for Feature Selection (FS) techniques that extract the top-600 features out of 2000. The
overlap among the lists is around 350 common features. The situations vary smoothly
from sets of partial lists with differences at the bottom of the list (left) to sets of lists
that show high differences at the top of the list (right).

– The feature ranks are identical in all the lists (Identical)


– The ranking of a given feature is assigned randomly (Random)
– Neither completely random nor completely identical.

Working with top-k lists (KI), the stability metrics provide a value of 1, what
is somewhat misleading considering the different scenarios that may appear. It
seems natural that, even though all agree about the 600 most important features,
the stability metric should be lower than 1 when there is low agreement about
which are the most important features. The SJS measure allows to work with
partial ranked lists and therefore establishing differences between these scenar-
ios. Fig. 4 shows the SJS (partial ranked lists) and the SJS , KI (top-k lists)
highlighting this fact. SJS (partial ranked lists) takes a value slightly higher
than 0.90 for a situation where there is complete agreement about which are the
most important 600 features, but complete discrepancy about their importance.
Its value increases to 1 as the randomness in the feature ranking assignment de-
creases. In contrast with this, KI would assign a value of 1 what may misleading
when studying the stability issue.

4.2 Evaluation on an Spectral Dataset

The new measure has been used to experimentally assess the stability of four
standard feature selectors based on a filter approach: χ2 , Information Gain Ratio
(GR) [4], Relief and other based on the parameter values of an independent
classifier (Decision Rule 1R) [21].
Feature Selection Stability Assessment Based on the JS Divergence 607

0.9

0.8 SJS Top−k

0.7 SJS Ranked lists

0.6
SJ S

0.5

0.4

0.3

0.2

0.1

0
Random Identical
Top-600 ranked sublists

Fig. 4. SJ S (top-k list) and SJ S (partial ranked lists) for Feature Selection (FS) tech-
niques that extract the top-600 features out of 2000. The overlap among the sublists
with 600 features is complete. The ranking assigned to each feature varies from FS tech-
niques for which it is random (left) to FS techniques for which each feature ranking is
identical in each sublist (right).

We have conducted some experiments on a real data set of omental fat sam-
ples collected from carcasses of suckling lambs [18]. The whole dataset has 134
instances: 66 from lambs being fed with a milk replacer (MR), while the other
68 are reared on ewe milk (EM). Authentication of the type of feeding will be a
key issue in the certification of suckling lamb carcasses, with the rearing system
being responsible for the difference in prices and quality. The use of spectroscopy
for the discrimination of fat samples according to the rearing system provides
several advantages, mainly its speed and versatility. Determining which regions
of the spectrum have more discriminant power is also fundamental for the vet-
erinarian professionals. All FTIR spectra were recorded from 4000 to 750 cm-1
with a resolution of 4 cm-1, what leads to a total of 1687 features. The average
spectra for both classes is shown in Fig.5.
The dataset was randomly split in ten folds, launching the feature ranking
algorithm with nine out the ten folds, in a consecutive way. Five runs of this
process resulted in a total of N = 50 rankings. Feature ranking was carried out
with WEKA [21] and the computation of the stability with MATLAB [17].
The SJS (full ranked list) measure gives an overall view of the stability. The
results (Table 1) indicate that in the case of the spectral data, the most stable
methods seem to be Relief and GR, while 1R appears as the one with less global
stability.
The metric SJS also enables an analysis focused on the top ranked or selected
features. Fig. 6 depicts the SJS for a given number of the top-k selected features
(continuous line) and the SJS for the the top-k ranked features (dashed line).
608 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez

EM
MR
0.25

0.2

Absorbance
0.15

0.1

0.05

−0.05
1000 1500 2000 2500 3000 3500 4000
Wavenumber (cm−1)

Fig. 5. Average FT-IR spectrum of omental fat samples for Milk Replacer (MR) and
Ewe Milk (EM)

Table 1. Stability of several feature selectors evaluated with the similarity measure
based on the Jensen-Shannon divergence (SJ S ) on a set of 50 rankings

SJ S (full ranked list)


1R χ2 GR Relief
0.87 0.92 0.94 0.94

The differences between SJS for top-k lists and top-k ranked lists is explained
by the fact that in the latter, differences/similarities in the lowest ranks are
attached less importance than differences/similarities in the highest ranks. Thus,
results show that the four feature selectors share a common trend: SJS (top-k)
assigns a lower value of stability that may be sometimes substantially different.
Thus, for the 1R feature selector, SJS (ranked top-400) is 0.82, but it drops to
0.70 when all features are given a uniform weight. This is explained by the fact
that many differences appear at the bottom of the list and when they are given
the same importance as differences at the top of the list, the stability measure
drops considerably.
When we focus on the top-k (selected/ranked) features and the value of k is
low, the feature selectors are quite stable. For example, for k = 10, SJS takes
the value 0.92 for χ2 , 0.73 for 1R, 0.92 for GR and 0.91 for Relief.
The plots in Fig. 6 allow to see that the stability decreases as the cardinality
of the feature subset increases for the feature selection techniques 1R, χ2 and
GR while Relief shows an stability profile with high stability regardless of the
size of sublist. While looking at the whole picture GR is as stable as Relief in
general terms, when we focus on lists with the most important features, Relief’s
robustness does not decrease as the feature subset size increases.
The proposed metric SJS can be compared with the Spearman’s rank corre-
lation coefficient (SR ) when it comes to measure the stability of full ranked lists.
Likewise, it can be compared with the Kuncheva’s stability index (KI) if partial
lists are considered. Note, however, that SJS is suitable for whatever problem.
Feature Selection Stability Assessment Based on the JS Divergence 609

1R 2
χ
1 1 S Top−k
S Top−k JS
JS
S Ranked lists S Ranked lists
JS
JS
0.8 0.8

0.6 0.6
SJS

SJS
0.4 0.4

0.2 0.2

0 0
0 500 1000 1500 0 500 1000 1500
Top−k lists Top−k lists
Gain Ratio Relief
1 1

0.8 0.8

S Top−k
0.6 0.6 JS
SJS

SJS

S Ranked lists
JS

0.4 S Top−k 0.4


JS
SJS Ranked lists
0.2 0.2

0 0
0 500 1000 1500 0 500 1000 1500
Top−k lists Top−k lists

Fig. 6. Feature selection methods 1R, χ2 , GR and Relief applied on the Omental Fat
Spectra Dataset. Stability measure SJ S for feature subsets with different cardinality.

Table 2. Stability of several feature selectors evaluated with the similarity measure
based on the Spearman’s rank correlation coefficient (SR ) on a set of 50 rankings.

SR (full ranked list)


1R χ2 GR Relief
0.79 0.85 0.90 0.94

2
Measuring the robustness with SR and KI requires the computation of 50(50−1)
pairwise similarities for each algorithm to end up averaging these computations
as stated in Eq.(6). According to the SR values recorded in Table 2, Relief ap-
pears as the most stable (0.94) ranking algorithm, whereas 1R is quite unstable
(0.79). When SJS works on the full ranked lists, it gives a stability indication
similar to SR and the findings derived from them are not contradictory. When
SJS works on the top-k lists, its value is similar to the provided by KI (see
Fig. 7), what allows to see the SJS measure as a generalized SJS metric that
can work not only with full ranked lists or top-k lists, but also with top-k ranked
lists, while the others are restricted to a particular list format.
610 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez

1R 2
χ
1 1
S Top−k S Top−k
JS JS
KI Top−k KI Top−k
0.8 0.8

0.6 0.6
SJS

SJS
0.4 0.4

0.2 0.2

0 0
0 500 1000 1500 0 500 1000 1500
Top−k lists Top−k lists
Gain Ratio Relief
1 1

0.8 0.8

0.6 0.6
SJS

SJS

S Top−k
JS
0.4 0.4
S Top−k
JS KI Top−k
0.2 KI Top−k 0.2

0 0
0 500 1000 1500 0 500 1000 1500
Top−k lists Top−k lists

Fig. 7. Feature selection methods 1R, χ2 , GR and Relief applied on the Omental Fat
Spectra Dataset. Stability measure SJ S and KI for top-k lists with different cardinality.

5 Conclusions
The robustness of the feature ranking techniques used for knowledge discovery
is an issue of recent interest. In this work, we consider the problem of feature
selection/ranking stability and propose a metric based on the Jensen-Shannon
divergence (SJS ) able to capture the disagreement among the lists generated
in different runs by a feature ranking algorithm from different perspectives: (a)
considering the full ranked feature lists, (b) focusing on the top-k features, that is
to say, lists that contain the k most relevant features giving a uniform importance
to all them and (c) considering partial ranked lists that retain the most relevant
features together with the ranking information.
The new metric SJS shows the relative amount of randomness of the rank-
ing/selection algorithm, independently of the sublist size and unlike other met-
rics that evaluate pairwise similarities, SJS evaluates directly the whole set of
lists (with the same size). Up to our knowledge, no metrics have been proposed
so far to measure the similarity between partial ranked feature lists. Moreover,
the new measure accepts whatever representation of the feature selection output
and its behavior is: (i) close to the Spearman’s rank correlation coefficient for full
Feature Selection Stability Assessment Based on the JS Divergence 611

ranked lists and (ii) similar to the Kuncheva’s index for top-k lists. If the ranking
is taken into account, the differences at the top of the list would be considered
more important than differences that appear at the bottom part.
The stability of feature selection algorithms opens a wide area of research
that includes the development of more robust feature selection techniques, their
correlation with classifier performance and different approaches to analyze ro-
bustness. The proposal of visual techniques to ease the stability analysis and
the exploration of committee-based feature selectors is our immediate future
research.

References
1. Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P., Saeys, Y.: Robust biomarker
identification for cancer diagnosis with ensemble feature selection methods. Bioin-
formatics 26(3), 392 (2010)
2. Aslam, J., Pavlu, V.: Query Hardness Estimation Using Jensen-Shannon Diver-
gence Among Multiple Scoring Functions. In: Amati, G., Carpineto, C., Romano,
G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 198–209. Springer, Heidelberg (2007)
3. Boulesteix, A.-L., Slawski, M.: Stability and aggregation of ranked gene lists 10(5),
556–568 (2009)
4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley and Sons,
Chichester (2001)
5. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to instability problems with
sequential wrapper-based approaches to feature selection. Trinity College Dublin
Computer Science Technical Report, 2002–2028
6. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach.
Learn. Res. 3, 1157–1182 (2003)
7. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A.: Feature Extraction: Foundations
and Applications. Studies in Fuzziness and Soft Computing. Springer-Verlag New
York, Inc., Secaucus (2006)
8. He, Z., Yu, W.: Stable feature selection for biomarker discovery. Technical Report
arXiv:1001.0887 (January 2010)
9. Jurman, G., Merler, S., Barla, A., Paoli, S., Galea, A., Furlanello, C.: Algebraic
stability indicators for ranked lists in molecular profiling. Bioinformatics 24(2), 258
(2008)
10. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms. In:
Fifth IEEE International Conference on Data Mining, p. 8. IEEE, Los Alamitos
(2005)
11. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms: a
study on high-dimensional spaces. Knowledge and Information Systems 12, 95–116
(2007), doi:10.1007/s10115-006-0040-8
12. Kullback, S., Leibler, R.: On information and sufficiency. The Annals of Mathe-
matical Statistics 22(1), 79–86 (1951)
13. Kuncheva, L.I.: A stability index for feature selection. In: Proceedings of the 25th
conference on Proceedings of the 25th IASTED International Multi-Conference:
Artificial Intelligence and Applications, pp. 390–395. ACTA Press (2007)
14. Lin, J.: Divergence measures based on the Shannon entropy. IEEE Transactions
on Information Theory 37(1), 145–151 (1991)
612 R. Guzmán-Martı́nez and R. Alaiz-Rodrı́guez

15. Loscalzo, S., Yu, L., Ding, C.: Consensus group stable feature selection. In: Proceed-
ings of the 15th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD 2009, pp. 567–576 (2009)
16. Lustgarten, J.L., Gopalakrishnan, V., Visweswaran, S.: Measuring Stability of Fea-
ture Selection in Biomedical Datasets. In: AMIA Annual Symposium Proceedings,
vol. 2009, p. 406. American Medical Informatics Association (2009)
17. MATLAB. version 7.10.0 (R2010a). The MathWorks Inc., Natick, Massachusetts
(2010)
18. Osorio, M.T., Zumalacrregui, J.M., Alaiz-Rodrguez, R., Guzman-Martnez, R.,
Engelsen, S.B., Mateo, J.: Differentiation of perirenal and omental fat quality
of suckling lambs according to the rearing system from fourier transforms mid-
infrared spectra using partial least squares and artificial neural networks. Meat
Science 83(1), 140–147 (2009)
19. Saeys, Y., Abeel, T., Peer, Y.: Robust Feature Selection Using Ensemble Feature
Selection Techniques. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML
PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 313–325. Springer, Heidelberg
(2008)
20. Somol, P., Novovicova, J.: Evaluating stability and comparing output of feature
selectors that optimize feature subset cardinality. IEEE Transactions on Pattern
Analysis and Machine Intelligence 32, 1921–1939 (2010)
21. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech-
niques with Java Implementations. Morgan Kaufmann, San Francisco (1999)
22. Zucknick, M., Richardson, S., Stronach, E.A.: Comparing the characteristics of gene
expression profiles derived by univariate and multivariate classification methods.
Statistical Applications in Genetics and Molecular Biology 7(1), 7 (2008)
Mining Actionable Partial Orders in
Collections of Sequences

Robert Gwadera, Gianluca Antonini, and Abderrahim Labbi

IBM Zurich Research Laboratory


Rüschlikon, Switzerland

Abstract. Mining frequent partial orders from a collection of sequences


was introduced as an alternative to mining frequent sequential patterns
in order to provide a more compact/understandable representation. The
motivation was that a single partial order can represent the same ordering
information between items in the collection as a set of sequential patterns
(set of totally ordered sets of items). However, in practice, a discovered
set of frequent partial orders is still too large for an effective usage. We
address this problem by proposing a method for ranking partial orders
with respect to significance that extends our previous work on ranking
sequential patterns. In experiments, conducted on a collection of visits to
a website of a multinational technology and consulting firm we show the
applicability of our framework to discover partial orders of frequently
visited webpages that can be actionable in optimizing effectiveness of
web-based marketing.

1 Introduction
Mining subsequence patterns (gaps between the symbols are allowed) in a col-
lection of sequences is one of the most important data mining frameworks with
many applications including analysis of time-related processes, telecommunica-
tions, bioinformatics, business, software engineering, Web click stream mining,
etc [1]. The framework was first introduced in [2] as the problem of sequential
pattern mining and defined as follows. Given a collection of itemset-sequences
(sequence database of transactions of variable lengths) and a minimum frequency
(support) threshold, the task is to find all subsequence patterns, occurring across
the itemset-sequences in the collection, whose relative frequency is greater than
the minimum frequency threshold. Although state of the art mining algorithms
can efficiently derive a complete set of frequent sequential patterns under cer-
tain constraints, including mining closed sequential patterns [3] and maximal
sequential patterns [4], the set of discovered sequential patterns is still too large
for practical usage [1] by usually containing a large fraction of non-significant
and redundant patterns [5]. As a solution to this problem a method for ranking
sequential patterns with respect to significance was presented in [6].
Another line of research to address the limitations of sequential pattern mining
was partial order mining [7],[8],[9], where a partial order on a set of items (poset)
is an ordering relation between the items in the set. The relation is called partial

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 613–628, 2011.

c Springer-Verlag Berlin Heidelberg 2011
614 R. Gwadera, G. Antonini, and A. Labbi

a d

[a, b, d, c]
[a, d, b, c]
[a, d, c, b]
c
[d, a, b, c]
b
[d, a, b, c]

Fig. 1. A partial order Fig. 2. A corresponding set of total orders

order to reflect the fact that not every pair of elements of a poset are related.
Thus, in general the relation can be of three types: (I) empty, meaning there is no
ordering information between the items; (II) partial and (III) total corresponding
to a sequence. A partial order can be represented as a Directed Acyclic Graph
(DAG), where the nodes correspond to items and the directed edges represent
the ordering relation between the items. Figure 1 presents an example partial
order for items {a, b, c, d}. The main appeal of partial orders is to provide a
more compact representation, where a single partial order can represent the same
ordering information between co-occurring items in the collection of sequences
as a set of sequential patterns. As an example of this property of partial orders,
consider the partial order in Figure 1 and the corresponding set of total orders
that it summarizes in Figure 2. Now imagine that the set of the total orders
is the input to algorithm PrefixSpan [10] for sequential pattern mining. Setting
the minimum relative support threshold minRelSup = 0.2 we obtain twenty
three frequent sequential patterns, while only one partial order is required to
summarize the ordering information expressed by the set of sequences in Figure 2.
However, in practice, even a discovered set of frequent closed partial orders (using
algorithm Frecpo [9]) is still too large for an effective usage. Therefore, we address
this problem by proposing a method for ranking partial orders with respect to
significance that extends our previous work on ranking sequential patterns [6].
Using the ranking framework we can discover a small set of frequent partial
orders that can be actionable (informative) for a domain expert of analyzed data
who would like to turn the knowledge of the discovered patterns into an action
in his domain (e.g., to optimize web-based marketing for a marketing expert).

1.1 Overview of the Method


Our ranking framework is based on building a probabilistic reference model for
the input collection of sequences and then deriving an analytical formula for the
expected relative frequency for partial orders. Given such a model we discover
partial orders that are over-represented with respect to the reference model,
meaning they are more frequent in the input collection of sequences than ex-
pected in the reference model. The frequency of occurrence alone is not enough
to determine significance, i.e., an infrequent partial order can be more significant
than a frequent one. Furthermore, an occurrence of a subsequence pattern may
be meaningless [11] if it occurs in an sequence of an appropriately large size.
Mining Actionable Partial Orders in Collections of Sequences 615

The main thrust of our approach is that the expected relative frequency of a par-
tial order in a given reference model can be computed from an exact formula in
one step. In this paper we are interested only in over-represented partial orders
and our algorithm for ranking partial orders with respect to significance works
as follows: (I) we find the complete set of frequent closed partial orders for a
given minimum support threshold using a variant of algorithm Frecpo [9]; (II)
we compute their expected relative frequencies and variances in the reference
model and (III) we rank the frequent closed partial orders with respect to sig-
nificance by computing the divergence (Z-score) between the observed and the
expected relative frequencies. Given the reference model, a significant divergence
between an observed and the expected frequency indicates that there is a de-
pendency between occurrences of items in the corresponding partial order. Note
that the only reason we use the minimum relative support threshold is because
we are interested in over-represented patterns. In particular, we set a value of
the threshold that is low enough to discover low support significant patterns and
that is high enough such that the discovered significant patterns are actionable
for marketing specialists. Given the rank values we discover actionable partial
orders by first pruning non-significant and redundant partial orders and then by
ordering the remaining partial orders with respect to significance.
As a reference model for the input collection of sequence we use an indepen-
dence mixture model, that corresponds to a generative process that generates a
collection of sequences as follows: (I) it first generates the size w of a sequence to
be generated from the distribution of sizes in the input collection of sequences
and (II) it generates a sequence of items of size w from the distribution of items
in the input collection of sequences. So for such a reference model the expected
relative frequency refers to the average relative frequency in randomized col-
lections of sequences, where the marginal distribution of symbols and sequence
length are the same as in the input collection of sequences. Note that the refer-
ence model can be easily extended to Markov models in the spirit of [12]. The
reasons we consider the reference model to be the independence model in this
paper are as follows: (I) it has an intuitive interpretation as a method for dis-
covering dependencies; (II) it leads to a polynomial algorithm for computing the
expected relative frequencies of partial orders in the reference model and (III) it
is a reasonable model for our real data set as explained in Section 4.

1.2 Related Work and Contributions

Our ranking framework builds on the work [6] by extending it to the case of
arbitrary partial orders, where [6] provided a formula for the expected relative
frequency of a sequential pattern, i.e, a pattern that occurs in a collection of
itemset-sequences of variable lengths. This paper also fills out the important
gap between [11] and [13] by providing a formula for the expected frequency
of an arbitrary partial order in a sequence of size w, where [11] analyzed the
expected frequency of a serial pattern and [13] analyzed the expected frequency
of sets of subsequence patterns including the special case of the parallel pattern
(all permutations of a set of symbols).
616 R. Gwadera, G. Antonini, and A. Labbi

In finding the formula for the expected frequency of a partial order we were
inspired by the works in [14],[15], [16] that considered the problem of enumer-
ating all linear extensions of a poset, where a linear extension is a total order
satisfying a partial order. For example, given the partial order in Figure 1 the set
of sequences in Figure 2 is the set of all linear extensions of that partial order.
The challenges in analyzing partial orders in comparison to the previous works
are as follows: (I) analytic challenge: arbitrary partial orders can have complex
dependencies between the items that complicate the probabilistic analysis and
(II) algorithmic challenge: an arbitrary poset may correspond to large set of lin-
ear extensions and computing probability of such a set is more computationally
demanding then for a single pattern.
The contributions of this paper are as follows: (I) it provides the first method
for ranking partial orders with respect to significance that leads to an algo-
rithm for mining actionable partial orders and (II) it is the first application of
significant partial orders in web usage mining.
In experiments conducted on a collection of visits to a website of a multina-
tional technology and consulting firm we show the applicability of our framework
to discover partial orders of frequently visited webpages that can be actionable
in optimizing effectiveness of web-based marketing.
The paper is organized as follows. Section 2 reviews theoretical foundations,
Section 3 presents our framework for mining actionable partial orders in collec-
tions of sequences, Section 4 presents experimental results and finally Section 5
presents conclusions.

2 Foundations
A = {a1 , a2 , . . . , a|A| } is an alphabet of size |A|. S = {s(1) , s(2) , . . . , s(n) } is a set
(i) (i) (i)
of sequences of size n = |S|, where s(i) = [s1 , s2 , . . . , sn(i) ] is the i-th sequence
(i)
of length n(i) , where st ∈ A.
A sequence s = [s1 , s2 ,. . . , sm ] is a subsequence of sequence s = [s1 , s2 ,. . . , sm ],
denoted s  s , if there exist integers 1 ≤ i1 ≤ i2 . . . ≤ im such that s1 = si1 ,
s2 = si2 ,. . . , sm = sim . We also say that s is a supersequence of s and s is
contained in s .
The support (frequency) of a sequence s in S, denoted by supS (s), is defined
as the number of sequences in S that contain s as a subsequence. The relative
support (relative frequency) rsupS (s) = sup|S| S (s)
is the fraction of sequences that
contain s as a subsequence. Given a set of symbols {s1 , . . . , sm } we distinguish
the following two extreme types of occurrences as a subsequence in another se-
quence s : (I) serial pattern, denoted s = [s1 ,. . . , sm ], meaning that the symbols
must occur in the order and (II) parallel pattern, denoted s = {s1 , . . . , sm },
meaning that the symbols may occur in any order. We use the term sequen-
tial pattern with respect to the serial pattern in the sequential pattern mining
framework, where the pattern occurs across a collection of input sequences.
Mining Actionable Partial Orders in Collections of Sequences 617

2.1 Reference Model of a Collection of Sequences

Given an input collection of sequence S = {s(1) , s(2) , . . . , s(n) } its independence


reference model corresponds to a generative process that generates a randomized
version of S called R = {r(1) , r(2) , . . . , r(n) } as follows [6].
Let α = [α1 , α2 , . . . , αM ] be the distribution of sizes of sequences in S (αm =
P (|s(i) | = m), M = maxni=1 |s(i) |), where αm = Nn (|s n |=m) and Nn (|s(i) | = m)
(i)

is the number of occurrences of sequences of size m in S.


(i)
Let θ = [θ1 , θ2 , . . . , θ|A| ] be the distribution of items in S (θj = P (st = aj )
N (a )
for aj ∈ A), where θj = nnS j and nS is the number of items in S and Nn (aj )
is the number of occurrences of item aj in S.
Then r(i) ∈ R for i = 1, 2 . . . , n is generated as follows:
1. Generate the size of a sequence w from distribution α
2. Generate sequence r(i) of size w from distribution θ.
w (i)
Then P (r(i) , w) =αw ·P (r(i) |w), where P (r(i) |w) = t=1 P (rt ) is the probability
of sequence r(i) to be generated given w.
Given the reference model and a sequential pattern s, Ω n (s) is the random
variable corresponding to the relative frequency of s, where E[Ω n (s)] = P ∃ (s)
and Var[Ω n (s)] = n1 P ∃ (s) · (1 − P ∃ (s)), where P ∃ (s) is the probability that s
occurs (exists) in R. So we use the superscript ∃ to denote at least one occurrence
as a subsequence. P ∃ (s) can be expressed as follows


M

P (s) = αw · P ∃ (s|w), (1)
w=|s|

where P ∃ (s|w) is the probability that pattern s occurs in a sequence of size


w [6]. Thus, P ∃ (s) is expressed as a discrete mixture model, where the mixing
coefficients [α|s| ,. . . , αM ] model the fact that an occurrence of s as a subse-
quence in a sequence s depends on the size of s and may possibly occur in
any sequence s ∈ R for which |s | ≥ |s|. Note that (1) is also valid for a
parallel pattern or a partial order if P ∃ (s|w) refers to a parallel pattern or to
a partial order√ respectively. The significance rank can be expressed as follows
n(rsupS (s)−P ∃ (s))
sigRank(s) = √ ∃ ∃
.
P (s)(1−P (s))

2.2 Probability of a Serial and Parallel Pattern


Let W ∃ (e|w) be the set of all distinct sequences of length w containing at least

 occurrence of serial/parallel pattern e as a subsequence. Then P (e|w) =
one
s∈W ∃ (e|w) P (s), where P (s) is the probability of sequence s in a given reference

model (e.g., in a k-order Markov model). W (e|w) can be enumerated using a
Deterministic Finite Automaton (DFA) recognizing an occurrence of e.
The DFA for a serial pattern (sequential pattern) e = [e1 , e2 , . . . , em ] called
G< (e), whose example is presented in Figure 3 for e = [a, b, c], has the following
618 R. Gwadera, G. Antonini, and A. Labbi

 n0  n1  n2


{b} {c} An3
{a}

a b c
0 1 2 3

Fig. 3. G< (e) for serial pattern e = [a, b, c], where A is a finite alphabet of size |A| > 3
and {aj } = A \ aj is the set complement of element aj ∈ A

components [11]. The initial state is 0, the accepting state is m and the states
excluding the initial state correspond to indexes of symbols in e. The powers ni
for i = 1, 2, . . . m symbolize the number of times the i-th self-loop is used on the
path from state 0 to m. Thus, P ∃ (e|w) is equal to the sum of probabilities of
all distinct paths from state 0 to m of length w in G< (e). P ∃ (e|w) for a serial
pattern in 0-order Markov reference model can be expressed as follows


w−m  
m
P ∃ (e|w) = P (e) (1 − P (ek ))nk , (2)
m
i=0 k=1 nk =i
k=1

m
where P (e) = i=1 P (ei ) and P (ei ) is the probability of symbol ei in the ref-
erence model and (2) can be evaluated in O(w2 ) using a dynamic programming
algorithm [11].
The DFA for a parallel pattern e = {e1 , e2 , . . . , em } called G (e), whose
example is presented in Figure 4 for e = {a, b, c}, has the following compo-
nents [13]. The initial state is {∅}, the accepting state is {1, 2, . . . , m} and the
states excluding the initial state correspond to the non-empty subsets of in-
dexes of symbols in e. Let E(e) be the set of serial patterns corresponding
to all permutations of symbols in e. Let X be the set of all distinct simple
paths (i.e., without self-loops) from the initial state to the accepting state and
let Edges(path) be the sequence of edges on a path path ∈ X . Then clearly
E(e) = {Edges(path) : path ∈ X ∈ G (e)}. P ∃ (e|w) for a parallel pattern in
0-order Markov reference model can be computed in O(m!w2 ) by evaluating (2)
for each member of E(e), where in place of 1 − P (ek ) we use the probability of
the complement of corresponding self-loop labels of states in the path in G (e).

2.3 Partially Ordered Sets of Items


A poset is a set P equipped with an irreflexive and transitive ordering relation
R(P) (partial order) on set A(P). An ordered pair (a, b) ∈ R(P) is denoted
a < b, where a, b ∈ A(P). A linear extension of P is a permutation p1 , p2 , . . . , pm
such that i < j implies that pi < pj . E(P) = {s(1) , . . . , s(n) } is the set of
all linear extensions of P. Elements a, b are incomparable, denoted a b, if
P does not contain either a < b or a > b. If no pair of elements in A(P) is
incomparable in P then P is a totally ordered set. G(P) is a DAG, where nodes
correspond to the items and the directed edges to the relations such that an
edge a → b means a < b. As an example of the introduced terminology consider
poset P = {a < b, a < c, d < c} represented with G(P) in Figure 1, where
Mining Actionable Partial Orders in Collections of Sequences 619

n1,2
{c}
n2
{a, c}

a {1, 2}
{2}

c
b
c n2,3 An1,2,3
n∅ n1 {a}
{a, b, c} {b, c}
b

a a
{∅} {1} {2, 3} {1, 2, 3}

c
c
n3
{a, b} n1,3 b
b {b}

{3} a
{1, 3}

Fig. 4. G (e) for parallel pattern e = {a, b, c}, where A is a finite alphabet of size
|A| > 3 and {aj } = A \ aj is the set complement of element aj ∈ A

A(P) = {a, b, c, d}, R(P) = {(a, b), (a, c), (d, c)} and E(P) is presented in Figure
2. We use the terms poset and partial order interchangeably in the case where
all elements of a poset are part of the relation (e.g., as in Figure 1). Clearly, a
serial pattern is a total order and a parallel pattern is a trivial order. A graph
G is said to be transitive if for every pair of vertices u an v there is a directed
path in G from u to v. The transitive closure GT of G is the least subset of
V × V which contains G and is transitive. A graph Gt is a transitive reduction of
directed graph G whenever the following two conditions are satisfied: (I) there is
a directed path from vertex u to vertex v in Gt if and only if there is a directed
path from u to v in G and (II) there is no graph with fewer arcs than Gt satisfying
condition (I).
A partial order s is contained in partial order s , denoted s  s , if s ⊆
G (s ). We also say that s is a super partial order of s. The relative support
T 

rsupS (s) = sup|S|S (s)


is the fraction of sequences in S that contain s. Given a
relative support threshold minRelSup, a partial order s is called a frequent
partial order if rsupS (s) ≥ minRelSup. The problem of mining frequent partial
orders is to find all frequent partial orders in S given minRelSup. The support
has an anti-monotonic property meaning that supS (s) ≥ supS (s ) if s  s . A
partial order s is called a closed frequent partial order if there exists no frequent
partial order s such that s  s and rsupS (s) = rsupS (s ).
Given a set of symbols A = {a1 , . . . , am }, let sserial be a serial pattern over
A , spartial be a partially ordered pattern over A and sparallel be the parallel


pattern over A . Then the following property holds in the independence reference
model
P ∃ (sserial |w) ≤ P ∃ (spartial |w) ≤ P ∃ (sparallel |w). (3)
620 R. Gwadera, G. Antonini, and A. Labbi

Thus, (3) follows from the fact that in a random sequence of size w a serial pat-
tern is the least likely to occur since it corresponds to only one linear extension,
while the parallel pattern is the most likely to occur since it corresponds to m!
linear extensions. So the probability of existence of a partially ordered pattern,
depending on the ordering relation size, is at least equal to P ∃ (sserial |w) and at
most equal to P ∃ (sparallel |w).

3 Mining Actionable Partial Orders

The problem of mining actionable partial orders in collections of sequences can


be defined as follows. Given an input collection of sequences S and a minimum
relative support threshold minRelSup, the task is to discover actionable partial
orders by first ranking discovered partial orders with respect to significance and
then by pruning non-significant and redundant partial orders.
Note that in our method the only purpose of the support threshold for partial
orders is to discover actionable (exhibited by an appropriately large number of
users) patterns for marketing experts.

3.1 Expected Frequency of a Poset

We derive a computational formula for the expected relative frequency of a poset


E[Ω n (P|w)] = P ∃ (P|w) as follows. Since a poset P occurs in a sequence of size
w if at least one member of its set of linear extensions E(P) occurs in that
sequence, P ∃ (P|w) is equivalent to P ∃ (E(P)|w) and can be expressed as follows

P ∃ (P|w) = P (s), (4)
s∈W ∃ (E(P)|w)

where W ∃ (E(P)|w) is the set of all distinct sequences of length w containing at


least one occurrence of at least one member of E(P) as a subsequence and P (s)
is the probability of sequence s in a given reference model (e.g., in a k-order
Markov model). Our approach to finding W ∃ (E(P)|w) is to construct a DFA
for E(P) called G≤ (P). Figure 5 contains a formal definition of G≤ (P), where
the states are labeled with sequences [i(1) , . . . , i(n) ] and i(j) denotes the prefix
length for serial pattern s(j) ∈ E(P). Given E(P) = {s(1) , . . . , s(n) }, G≤ (P)
can be viewed as a trie (prefix tree) for members of E(P) with the addition of
properly labeled self-loops of nodes. Figure 6 presents an example G≤ (P) for
P = {a → b, c}. Thus, given G≤ (P), the set of sequences of size w containing at
least one occurrence of any of its linear extensions W ∃ (E(P)|w) can be expressed
as follows

W ∃ (E(P)|w) = {Edges(path) : path ∈ Lw ∈ G≤ (P)}, (5)

where Lw is the set of all distinct paths of length w, including self-loops, from
the initial to any of the accepting states.
Mining Actionable Partial Orders in Collections of Sequences 621

– the initial state is [0, . . . , 0] and each out of n = |E(P)| accepting states corresponds to
a member of E (P)
– each non-initial state is [i(1) , . . . , i(n) ] denoting the prefix length for serial pattern s(j) ∈
E (P) in a prefix tree of the members of E (P)
– A self-loop from state [i(1) , . . . , i(n) ] to itself exists and has label equal to
• A if ∃i(j) = |A(P)|
 (j)
• A − i(j) {si(j) +1 } if ∀i(j) < |A(P)|.

Fig. 5. Formal definition of G≤ (P)

An3,2,1
 n2,1,0
{c}

c 3, 2, 1
2, 1, 0

b  n1,2,1 An2,3,1
 n1,1,0
{b}
{b, c}

 n0,0,0 c b
{a, c} 1, 1, 0 1, 2, 1 2, 3, 1
a

0, 0, 0  n0,0,1
 n1,1,2
{a}
c {b}
An2,1,3

0, 0, 1 a
1, 1, 2 b
2, 1, 3

Fig. 6. G≤ (P) for partial order P = {a → b, c}, where A is a finite alphabet of size
|A| > 3 and {aj } = A \ aj is the set complement of element aj ∈ A

3.2 Algorithm for Computing the Probability of a Poset

The main idea of computing P ∃ (P|w) efficiently is to build G≤ (P) in a depth-


first search manner on-the-fly such that at any time only at most |A(P)| nodes
are unfolded that correspond to a member of E(P). In order to build G≤ (P)
on-the-fly we take the inspiration from an algorithm for enumerating linear ex-
tensions that can be summarized as follows [16]. Given a poset P the algorithm
proceeds as follows: (I) select a subset of nodes that have no predecessors N in
G(P) and erase all edges (relations) involving them; (II) output the set of all
permutations of nodes in N ; (III) recursively consider the set of currently exist-
ing nodes that have no predecessors N  and (IV) while backtracking combine the
generated sets of nodes using Cartesian product operation to obtain the final set
of linear extensions. Let X be the set of all distinct simple paths (i.e., without
self-loops) from the start-vertex to any end-vertex in G≤ (P). We build G≤ (P)
622 R. Gwadera, G. Antonini, and A. Labbi

in lexicographic order of paths from the start-vertex to all end-vertices, i.e., we


enumerate members of E(P) in lexicographic order. Thus, the computation of
P ∃ (P|w) corresponds to a depth-first search traversal of the prefix tree of linear
extensions as follows:
1. Initialize P ∃ (P|w) = 0
2. For every generated path ∈ X ∈ G≤ (P) in a DFS manner such that path
corresponds to a member of E(P) compute the probability of P ∃ (path|w)
from (2), where in place of 1 − P (ek ) use the probability of the complement
of corresponding self-loop labels of vertices in the path
3. P ∃ (P|w) = P ∃ (P|w) + P ∃ (path|w)
The time complexity of the algorithm is O(|E(P)|w2 ) and the space complexity
is O(|A(P)|).

3.3 Pruning Non-significant and Redundant Patterns


We prune under-represented and non-significant partial orders using a signifi-
cance pruning as follows:
1. sigRank(s) < 0
2. pvalue(s) > signif icanceT hreshold,
where sigRank(s) < 0 is true for under-represented partial orders and pvalue(s) =
P (sigRank > sigRank(s)), where we set signif icanceT hreshold = 0.05.
We prune redundant partial orders (partial orders having the same semantic
information and similar significance rank) partial orders as follows:
1. Bottom-up pruning: remove a partial order s if there exits another partial
order s such that s  s and sigRank(s ) > sigRank(s), meaning s is
contained in s and has lower rank than s so it is redundant
2. Top-down pruning: remove a partial order s if there exits another partial
order s such that s  s , sigRank(s) > sigRank(s ) and
sigRank(s) − sigRank(s )
< pruningT hreshold, (6)
sigRank(s )
where we set pruningT hreshold = 0.05, meaning s is contained in s and
the rank of s is not significantly greater than the rank of s so s is redundant
[5].

3.4 Algorithm for Mining Actionable Partial Orders


Since in practice sequences may contain duplicates in order to provide proper
input to the algorithm for discovering closed partial orders we map the origi-
nal item alphabet A to a unique alphabet A , where sequence-wise repeating
occurrences of the same symbols are mapped to distinct symbols [7],[8].
Given an input collection of sequences S = {s(1) , s(2) ,. . . , s(n) }, where s(i) =
(i) (i) (i)
[s1 , s2 ,. . . , sn(i) ], the ranking algorithm proceeds as follows:
Mining Actionable Partial Orders in Collections of Sequences 623

1. Map the collection to a unique alphabet A


2. Given minRelSup obtain a set of frequent closed partial orders F .
3. Compute α = [α1 , α2 , . . . , αM ], where αw = Nn (|s n |=w) and Nn (|s(i) | = w)
(i)

is the number of sequences of size w in S.


N (a )
4. Compute θ = [θ1 , θ2 , . . . , θ|A| ], where θj = nnS j and nS is the number of
items in S and Nn (aj ) is the number of occurrences of item aj in S
5. For every frequent closed partial order P ∈ F do the following:
(a) compute P ∃ (P|w) from (4) using the algorithm from Section 3.2
(b) compute P ∃ (P) from (1) and compute the significance rank as follows

n(rsupS (P)−P ∃ (P)))
sigRank(P) = √ ∃ ∃
.
P (P))(1−P (P)))
6. Perform the reverse mapping for the discovered patterns from A to A
7. Perform pruning
(a) remove partial orders containing cycles and redundant edges (selecting
only transitive reductions)
(b) remove non-significant and redundant partial orders using the top-down
and bottom-up pruning

4 Experiments
We conducted the experiments on a collection of Web server access logs to a
website of a multinational technology and consulting firm. The access logs consist
of two fundamental data objects:
1. pageview: (the most basic level of data abstraction) is an aggregate repre-
sentation of a collection of Web objects (frames, graphics, and scripts) con-
tributing to the display on a users browser resulting from a single user action
(such as a mouse-click). Viewable pages generally include HTML, text/pdf
files and scripts while image, sound and video files are not.
2. visit (session): an interaction by an individual with a web site consisting of
one or more requests for a page (pageviews). If an individual has not taken
another action, typically additional pageviews on the site within a specified
time period the visit will terminate. In our data set the visit timeout value
is 30 minutes and a visit may not exceed a length of 24 hours. Thus, a visit
is a sequence of pageviews (click-stream) by a single user.
The purpose of the experiments was to show that the discovered partial orders of
visited pages can be actionable in optimizing effectiveness of web-based market-
ing. The data contains access logs to the website of the division of the company
that deals with Business Services (BS) and spans the period from 1-01-2010 to
1-08-2010. The main purpose of the web-based marketing for that division is to
advertise business services to potential customers arriving to the web site from
a search engine by using relevant keywords in their query strings. Therefore, we
considered the following subset of all visits: visits referred from a search engine
(Google, Yahoo, Bing, etc.), having a valid query string and at least four distinct
pages among their pageviews. As a result we obtained a collection of sequences
624 R. Gwadera, G. Antonini, and A. Labbi

of size 13000, where every sequence corresponds to a series of identifiers of pages


viewed during a visit. The alphabet A contains 1501 webpages and the average
visit length is equal to 8.7 views, which implies that we should expect to discover
partial orders of a rather small cardinality. To our surprise besides the first views
in visits also following views often are referred from a search engine, including
the internal search engine of the company. This behavior can be explained by the
fact that many users who enter the BS site, when the first page does not satisfy
their information needs, they get impatient and resort to using either external
search engines or the internal search engine. As a result, they land on a different
page of the BS site. Such a behavior causes that often consecutive page views
in a single visit are not directly connected by HTML links. So this phenomenon
suggests that the independence model is a reasonable reference model of the
collection of the visits to the BS site.
We set minRelSup = 0.01 and obtained 2399 frequent closed partial orders
using implemented variant of algorithm Frecpo [9], where after ranking and prun-
ing only 185 of them were left. The discovered partial orders were subsequently
shown to a domain expert who selected the following partial orders. As a sim-
ple baseline we used the method for ranking sequential patterns with respect to
significance [6].
Figure 7 presents the most significant partial order (rank 1) that is a total
order. The pattern consists of the following pages Business Analytics (BA) main
page, BA people, BA work and BA ideas. As expected, this pattern is also the
most significant sequential pattern for the input collection of sequences using
the ranking method from [6]. The most interesting fact about this pattern is
that users looking for BA solutions, after arriving at BA main page, choose BA
people as the second page. Given that knowledge, the company should enhance
the content of the BA people page in order to convince the users that behind the
BA solutions stand competent and experienced people. Furthermore, in order to
increase the search engine traffic to the BA products the BA people should have
personal web pages with links pointing to the BA product pages.
Figure 8 presents the partial order at rank 6 that is the first in decreasing
order of that rank that contains the BS main page. This fact clearly confirms
that there is a group of users who after arriving at the BS/BA main page from
a search engine, look for the people (executive team) who stand behind the
BS/BA services. So this partial order reinforces the need for appropriate per-
sonal/company pages of the BS/BA people.
Figure 9 presents the partial order at rank 7 that may correspond to users
who, after arriving at the BS/BA main page, look for BS/BA people but because
of an under-exposed link to the BA people page, tend to perform a random walk

ba_main ba_people ba_work ba_ideas

Fig. 7. Partial order at rank 1, where sigRank = 2.30e + 02 and the relative frequency
Ω n = 3.09e − 02
Mining Actionable Partial Orders in Collections of Sequences 625

bs_main

ba_people ba_work
ba_main

Fig. 8. Partial order at rank 6, where sigRank = 4.74e + 01 and the relative frequency
Ω n = 1.56e − 02

ba_ideas
ba_main

ba_people
bs_main
ba_work

Fig. 9. Partial order at rank 7, where sigRank = 4.51e + 01 and the relative frequency
Ω n = 1.02e − 02

financial_management

bs_main sales_marketing_services
human_capital_management

Fig. 10. Partial order at rank 12, where sigRank = 3.52e + 01 and the relative fre-
quency Ω n = 1.03e − 02

including the BA work and ideas page until they arrive at the BA people page.
This pattern may suggest that by emphasizing the link to BA people page on
the BA/BS main pages the users would directly get to the BA people page.
Figure 10 presents the partial order at rank 12 that is the first in decreasing
order of the rank to contain the financial management and the human capital
management pages. This partial order may suggests that those two pages should
be accessible directly from the BS main page in order to increase their visibility.
Figure 11 presents the partial order at rank 24 that is the first in decreasing
order of the rank that contains the strategy planing page. As it turns out it
significantly co-occurs with the financial management, the human capital man-
agement and the sales marketing services page. This pattern may suggest that
the three co-occurring pages with the strategy planing page should contain well-
exposed links to the strategy planing page to increase its visibility.
Figure 12 presents a comparison of P ∃ (P) against the baseline P ∃ (sserial )
and P ∃ (sparallel ), where sserial is a serial and sparallel is the parallel pattern
over symbols of P, where a proper value of P ∃ (spartial ) = P ∃ (P) should sat-
isfy (3). In order to compute P ∃ (sserial ) and P ∃ (sparallel ) for the discovered
626 R. Gwadera, G. Antonini, and A. Labbi

strategy_planning

financial_management

bs_main
sales_marketing_services

human_capital_management

Fig. 11. Partial order at rank 24, where sigRank = 2.73e + 01 and the relative fre-
quency Ω n = 1.02e − 02

Partial order P ∃ (sserial ) P ∃ (spartial ) P ∃ (sparallel )


Figure 7 2.2e-04 2.2e-04 1.9e-03
Figure 8 8.2e-04 1.2e-03 5.7e-03
Figure 9 1.2e-04 5.7e-04 1.7e-03
Figure 10 5.3e-04 8.9e-04 4.0e-03
Figure 11 1.9e-04 1.3e-03 2.6e-03

Fig. 12. P ∃ (spartial ) = P ∃ (P) for the discovered partial orders in comparison to their
serialized/paralellized versions P ∃ (sserial )/P ∃ (sparallel ), where as to be expected for a
valid value of P ∃ (spartial ), P ∃ (sserial ) ≤ P ∃ (spartial ) ≤ P ∃ (sparallel ) is satisfied in all
cases

partial orders we “serialized”/”parallelized” them appropriately. Thus, for the


total order from Figure 7 P ∃ (sserial ) = P ∃ (spartial ) and for the rest of the par-
tial orders, depending on their structure, their probability is either closer to
P ∃ (sserial ) or to P ∃ (sparallel ). For example, for the partial orders from Figure 7
and 10 P ∃ (spartial ) is closer to P ∃ (sserial ) and for the partial order from Figure
11 P ∃ (spartial ) is closer to P ∃ (sparallel ).
Figure 13 compares the number of frequent closed partial orders, the number
of final partial orders obtained from our algorithm and effectiveness of pruning
methods as a function of the minimum support threshold.
The results can be summarized as follows: (I) the number of final partial
orders is equal to 7.8% of the frequent closed partial orders on average; (II) the
significance pruning prunes 63.7% of the final partial orders on average; (III) the
bottom-up pruning prunes 77.4% of the partial orders left after the significance
pruning and (IV) the top-down pruning prunes 4% of the partial orders left
after the bottom-up pruning. Furthermore, the variance of the pruning results
is rather small suggesting the the pruning methods exhibit a similar behavior
across different values of minRelSup.
Mining Actionable Partial Orders in Collections of Sequences 627

minRelSup Number of Number of Pruning effectiveness ([%] pruned)


Frequent closed Final Significance Bottom-up Top-down
partial orders partial orders pruning pruning pruning
0.2 471 44 53.9 78.3 6.4
0.015 821 85 59.8 73.3 3.4
0.014 983 93 61.3 74.2 5.1
0.013 1163 106 62.7 74.0 6.2
0.012 1450 124 65.0 74.4 4.6
0.011 1819 150 66.2 74.3 5.1
0.01 2399 185 67.6 75.3 3.6
0.009 3124 232 68.2 76.0 2.5
0.008 4566 300 68.7 78.4 2.9
0.007 7104 439 67.6 80.3 2.9
0.006 12217 683 65.4 83.4 3.0
0.005 26723 1439 58.0 86.9 2.4
Average [%] 7.8 63.7 77.4 4.0

Fig. 13. Number of frequent partial orders, number of final partial orders obtained from
our algorithm and effectiveness of pruning methods as a function of minRelSup. The
pruning methods are applied in the following order: significance pruning, bottom-up
pruning and top-down pruning

5 Conclusions

We presented a method for ranking partial orders with respect to significance and
an algorithm for mining actionable partial orders. In experiments, conducted on a
collection of visits to a website of a multinational technology and consulting firm
we showed the applicability of our framework to discover partially ordered sets
of frequently visited webpages that can be actionable in optimizing effectiveness
of web-based marketing.

References
1. Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and
future directions. Data Mining and Knowledge Discovery 15(1) (2007)
2. Agrawal, R., Srikant, R.: Mining sequential patterns. In: ICDE, pp. 3–14 (1995)
3. Yan, X., Han, J., Afshar, R.: Clospan: Mining closed sequential patterns in large
datasets. In: SDM, pp. 166–177 (2003)
4. Guan, E., Chang, X., Wang, Z., Zhou, C.: Mining maximal sequential patterns. In:
2005 International Conference on Neural Networks and Brain, pp. 525–528 (2005)
5. Huang, X., An, A., Cercone, N.: Comparison of interestingness functions for learn-
ing web usage patterns. In: Proceedings of the Eleventh International Conference
on Information and Knowledge Management, CIKM 2002, pp. 617–620. ACM, New
York (2002)
6. Gwadera, R., Crestani, F.: Ranking sequential patterns with respect to signifi-
cance. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010.
LNCS(LNAI), vol. 6118, pp. 286–299. Springer, Heidelberg (2010)
628 R. Gwadera, G. Antonini, and A. Labbi

7. Mannila, H., Meek, C.: Global partial orders from sequential data. In: KDD 2000:
Proceedings of the sixth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 161–168. ACM, New York (2000)
8. Casas-Garriga, G.: Summarizing sequential data with closed partial orders. In:
Proceedings of the Fifth SIAM International Conference on Data Mining, April
2005, pp. 380–390 (2005)
9. Pei, J., Wang, H., Liu, J., Wang, K., Wang, J., Yu, P.S.: Discovering frequent
closed partial orders from strings. IEEE Transactions on Knowledge and Data
Engineering 18, 1467–1481 (2006)
10. Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q.: Mining sequential
patterns by pattern-growth: The prefixspan approach. TKDE 16 (November 2004)
11. Gwadera, R., Atallah, M., Szpankowski, W.: Reliable detection of episodes in event
sequences. In: Third IEEE International Conference on Data Mining, pp. 67–74
(November 2003)
12. Gwadera, R., Atallah, M., Szpankowski, W.: Markov models for discovering signif-
icant episodes. In: SIAM International Conference on Data Mining, pp. 404–414
(April 2005)
13. Atallah, M., Gwadera, R., Szpankowski, W.: Detection of significant sets of episodes
in event sequences. In: Fourth IEEE International Conference on Data Mining, pp.
67–74 (October 2004)
14. Varol, Y.L., Rotem, D.: An algorithm to generate all topological sorting arrange-
ments. The Computer Journal 24(1), 83–84 (1981)
15. Pruesse, G., Ruskey, F.: Generating linear extensions fast. SIAM J. Comput. 23,
373–386 (1994)
16. Knuth, D.E., Szwarcfiter, J.L.: A structured program to generate all topological
sorting arrangements. Inf. Process. Lett
A Game Theoretic Framework for Data Privacy
Preservation in Recommender Systems

Maria Halkidi1 and Iordanis Koutsopoulos2


1
Dept. of Digital Systems, University of Piraeus
mhalk@unipi.gr
2
Dept. of Computer and Communication Engineering,
University of Thessaly and CERTH
jordan@uth.gr

Abstract. We address the fundamental tradeoff between privacy preser-


vation and high-quality recommendation stemming from a third party.
Multiple users submit their ratings to a third party about items they have
viewed. The third party aggregates the ratings and generates personal-
ized recommendations for each user. The quality of recommendations for
each user depends on submitted rating profiles from all users, including
the user to which the recommendation is destined. Each user would like
to declare a rating profile so as to preserve data privacy as much as possi-
ble, while not causing deterioration in the quality of the recommendation
he would get, compared to the one he would get if he revealed his true
private profile.
We employ game theory to model and study the interaction of users
and we derive conditions and expressions for the Nash Equilibrium Point
(NEP). This consists of the rating strategy of each user, such that no user
can benefit in terms of improving its privacy by unilaterally deviating
from that point. User strategies converge to the NEP after an iterative
best-response strategy update. For a hybrid recommendation system, we
find that the NEP strategy for each user in terms of privacy preservation
is to declare false rating only for one item, the one that is highly ranked in
his private profile and less correlated with items for which he anticipates
recommendation. We also present various modes of cooperation by which
users can mutually benefit.

Keywords: privacy preservation, recommendation systems, game


theory.

1 Introduction
The need for expert recommender systems becomes ever increasing in our days,
due to the massive amount of information and abundance of choices available
for virtually any human action involving decision making, which is connected
explicitly or implicitly to the Internet. Recommender systems arise with various
contexts, from providing personalized search results and targeted advertising,
to making social network related suggestions, up to providing personalized sug-
gestions on various goods and services. Internet users become more and more

D. Gunopulos et al. (Eds.): ECML PKDD 2011, Part I, LNAI 6911, pp. 629–644, 2011.

c Springer-Verlag Berlin Heidelberg 2011
630 M. Halkidi and I. Koutsopoulos

dependent on efficient recommender systems in order to expedite purchase of


goods, selection of movies, places to dine and spend their vacation, and even to
decide whom to socialize with and date. In general, users rely on recommenda-
tion systems so as to obtain quick and accurate personalized expert advice and
suggestions, which will aid them in decision making. While the final decision
at the user end depends on various psychological and subjective factors, it is
beyond doubt that recommendation systems will enjoy accelerating penetration
to users.
The efficiency of a recommender system amounts to high-quality personal-
ized recommendations it generates for different users. Recommendation systems
are fundamentally user-participatory. The quality of recommendations for an
individual user relies on past experience and participation of other users in the
rating process. Furthermore, even if not immediately apparent, each individual
user can to a certain extent affect the quality of recommendations for himself by
his own ratings about what he has experienced.
To see this, consider the following simple example. Suppose there exist two
users, 1 and 2. User 1 has viewed and rated item A, while user 2 has viewed
and rated item B. Suppose that the recommendation system recommends item
B to user 1 if a certain metric exceeds a threshold, otherwise it does not. This
metric will depend on (i) how high was the rating of user 2 for item B, (ii) how
similar is item B to A, (iii) how high was the rating of user A for item 1. Clearly,
whether or not item B will be recommended to user 1 depends on the ratings of
both users for the items they have viewed.
Since recommendation systems involve data exchange between the users and
the third party that performs recommendations, privacy concerns of users are
inescapable. Users prefer to preserve their privacy by not revealing much infor-
mation to the third party about their private personal preferences and ratings.
On the other hand, users would like to receive high-quality recommendation re-
sults. Namely, the recommendation they would get as a result of not declaring
their true private ratings should be as close as possible to the one they would get
if they revealed their private data. In this paper, we attempt to understand this
fundamental tradeoff between privacy preservation and good recommendation
quality. Towards this end, we explicitly capture the mode of user interaction
towards shaping the tradeoff above. Specifically we pose and attempt to answer
the following questions:
– How can we quantify privacy preservation and recommendation quality?
– What is the resulting degree of privacy preservation of users if each user
determines his strategy in terms of revealing his private ratings in a selfish
way that takes into account only his personal objective?
– How can we characterize the stable operating points in such a system in
terms of rating profile revelation?
– Can users coordinate and jointly determine their rating revelation strategy
so as to have mutual benefit?
A Game Theoretic Framework for Data Privacy Preservation 631

1.1 Related Work

Recommender systems automate the generation of recommendations based on


data analysis techniques [9]. Recommendations for movies on Netflix or books
on Amazon are some real-world examples of recommender systems. The ap-
proaches that have been proposed in the literature can be classified as follows:
(i) Collaborative filtering, (ii) Content-based ones, and (iii) hybrid approaches.
In collaborative filtering (CF) systems, a user is recommended items based on
past ratings of other users. Specifically, neighborhood-based CF approaches as-
sume that users with correlated interests will most likely like similar items. The
Pearson’s correlation coefficient is the most widely used measure of similarity
between ratings of two users [14]. However there exist several other measures
that are used in the literature [17]. Based on a certain similarity measure, these
approaches select k users (referred to as a user’s neighbors) which have the
highest similarity with the user considered for recommendation. Then, a predic-
tion is computed by properly aggregating the ratings of selected neighbors. An
extension to neighborhood-based CF is the item-to-item collaborative filtering
approach [7], [15]. This approach matches a user’s rated items to similar items
rather than similar users.
On the other hand, Content-based approaches provide recommendations by
comparing the content of an item to the content of items of potential interest a
user. There are several approaches that treat the content-based recommendation
problem as an information retrieval task. Balabanovic et al. [1] consider that user
preferences can be treated as a query, and unrated objects are scored based on
their similarity to this query. An alternative approach is proposed in [11], which
treats recommendation as a classification problem. Hybrid approaches aim to lever-
age advantages of both content-based and collaborative filtering ones. Cotter
et al. [4] propose a simple approach that collects results of both content-based
and collaborative filtering approaches, and merges these results to produce the
final recommendation. Melville et al. [8] propose a framework that uses content-
based predictions to convert a sparse user ratings matrix into a full ratings matrix,
and subsequently it employs a CF method to provide recommendations.
Since recommender servers need to have access to user preferences in order to
predict other items that may be of interest to users, privacy of users is put at risk.
A number of different techniques has been proposed to address privacy issues
in recommender systems. Polat and Du [13] propose a randomized perturbation
technique to protect user privacy in CF approaches. Although randomized per-
turbation techniques modify the original data to prevent the data collector from
learning user profiles, the proposed scheme turns out to provide recommenda-
tions with decent accuracy. Another category of works refers to approaches that
store user profiles locally and run the recommender system in a distributed fash-
ion. Miller et al. [10] propose the PocketLens algorithm for CF in a distributed
environment. Their approach requires only the transmission of similarity mea-
sures over network, and thus it protects user privacy by keeping their profiles
secret. The work [2] addresses the problem of protecting user privacy through
substituting the centralized CF system by a virtual peer-to-peer one. Also, user
632 M. Halkidi and I. Koutsopoulos

profiles are partially modified by adding some degree of uncertainty. Although


these methods almost eliminate user privacy losses, they require high degree of
cooperation among users so as to achieve accurate recommendations.
Lathia et al. [6] introduce a new measure to estimate the similarity between
two users without breaking user privacy. A randomly generated set of ratings is
shared between two users, and then users estimate the number of concordant,
discordant and tied pairs of ratings between their own profiles and the randomly
generated one. An alternative method for preserving privacy is presented in [3].
Users create communities, and each user seeks recommendations from the most
appropriate community. Each community computes a public aggregation of user
profiles without violating individual profile privacy, based on distributed singular
value decomposition (SVD) of the user rating matrix. A distributed mechanism
that focuses on obfuscating user-item connection is proposed in [16]. Each user
arbitrarily selects to contact other users over time and modifies his local pro-
file off-line through an aggregation process. Users periodically synchronize their
profiles at server (online) with their local ones.
In our work, we develop a game theoretic framework for addressing the pri-
vacy preserving challenge in recommender systems. Game theory has recently
emerged as a mathematical tool for modeling the interaction of multiple self-
ish rational agents with conflicting interests, and for predicting stable system
points (equilibrium points) from which no agent can obtain additional benefit
by unilaterally moving away from them. While game theory has been extensively
used in various contexts [12], very few works have used game theory for privacy
related issues. The work [5] proposes a formulation of the privacy preserving
data mining (PPDM) problem as a multi-party game. It relaxes many of the
assumptions made by existing PPDM approaches,thus aiming to develop new
robust algorithms for preserving privacy in data mining.

1.2 Our Contribution

In this work we address the fundamental tradeoff between privacy preservation


and high-quality recommendation. We assume that multiple users submit their
ratings to a third party about the items they have viewed. The third party
aggregates these ratings and generates personalized recommendations for each
user. The quality of recommendations for each user depends on submitted rat-
ing profiles from all users, including the user to which the recommendation is
destined. Each user would like to declare a rating profile so as to preserve data
privacy as much as possible, while not causing deterioration in the quality of the
recommendation he would get, compared to the one he would get if he revealed
his true private profile.
The contributions of our work to the literature are as follows: (i) We develop
a mathematical framework for quantifying the goal of privacy preservation and
that of good quality recommendations, and we define a user’s strategy in terms
of deciding about his declared rating profile to the third party; (ii) we employ
game theory to model and study the interaction of multiple users and we derive
conditions and expressions for the Nash Equilibrium Point (NEP). This consists
A Game Theoretic Framework for Data Privacy Preservation 633

of the rating strategy of each user, such that no user can benefit in terms of
improving its privacy by unilaterally deviating from that strategy. User strategies
converge to the NEP after an iterative best-response strategy update; (iii) for
a hybrid recommendation system, we find that the NEP strategy for each user
in terms of privacy preservation is to declare false rating only for one item, the
one that is highly ranked in his private profile and less correlated with items
for which he anticipates recommendationq; (iv) We present various modes of
cooperation by which users can mutually benefit. To the best of our knowledge,
this is the first work that applies the framework of game theory to address the
arising user interaction in privacy preserving recommendation systems.
The rest of the paper is organized as follows. In section 2 we present the
model and assumptions for our approach. Section 3 elaborates on the case of
a hybrid recommendation system. In section 4 we obtain valuable insights for
conflict and cooperation in user interaction by analyzing the case of two users.
Section 5 includes numerical results and section 6 concludes our study.

2 Model and Problem Definition


2.1 Ratings and Recommendation
Consider a set of U of N users and a set of items I available for recommenda-
tion. Each user i has already viewed, purchased, or in general it has obtained
experience for a small subset of items Si ⊂ I. Usually it is |Si | << |U|, where
|A| denotes the cardinality of set A. Denote by pi = (pik : k ∈ Si ) the vector
of ratings of user i for the items it has viewed, where pik is the rating of user
i for item k ∈ Si . Without loss of generality, we assume that pik takes positive
values in a continuous set which is upper bounded, i.e. it is 0 ≤ pik ≤ P . The
vector of ratings pi is private information for each user i, and we refer to that as
the private profile or private ratings vector of user i. Clearly, the private profile
consists of the identities of viewed items and their ratings.
After viewing or experiencing items k ∈ Si , user i has to submit a rating to
a third party, which will be a recommendation server. Let qi = (qik : k ∈ Si ) be
the vector of declared ratings from user i to the server. This can in general be
different from pi . We refer to qi as the declared profile or the declared ratings
vector of user i. The declared profile consists of the identities of viewed items and
their declared rating. In this work, we assume that the user will always declare
all items it has viewed. Hence, qi will include only items k ∈ Si and only these,
and the user may only alter the ratings for these items.
The recommendation server is the repository of all ratings submitted by all
users. It collects declared user profiles and is responsible for issuing the different,
personalized recommendations to different users. Let P = (pi : i ∈ U) be the
ensemble of private ratings of users. Let Q = (qi : i ∈ U) be the ensemble of
declared ratings of all users to the server. When the server receives a recommen-
dation request from a user i, it takes into account the ensemble of ratings Q to
compute a recommendation vector with ratings for items that user i has not yet
viewed. Let ri = (ri :  
∈ Si ) be the recommendation vector for user i.
634 M. Halkidi and I. Koutsopoulos

We will assume that the recommendation server employs a generic mapping


fi (·) to compute the recommendation vector for each user i. Hence, we denote
the dependence of the recommendation for user i on declared ratings of all users
as ri = fi (Q) = fi (q1 , . . . , qN ). Here, we have implicitly assumed that rat-
ings of all users in the system are taken into account. However, in general the
recommendation server may take into account just a subset of users and their
ratings in order to compute the recommendation for a user i. In this work, we
are not concerned with designing a recommendation mapping; we will assume
that a given mapping is employed by the server, and this mapping is known
to all users. Note that ri depends on rating vector qi that user i has provided
about the items he has viewed. A hint about that dependence was provided in
the introduction, and it will be revisited in the sequel.
Next, the recommendation vector is fed back to user i in some way that is
intrinsic in the specific recommendation system. In general, a part of vector ri
is returned to user i. For instance, the server may return just one item, the one
with the highest rating out of those in set { :   ∈ Si }, or in general it may
return the L highest rated items from the set above. In this work, without loss
of generality, we will assume that the entire vector of ratings ri is returned to
user i, possibly reordered, such that the highest rated components appear first.

2.2 Privacy Metric


For each user i we define a metric that quantifies the degree at which privacy is
preserved for user i. Intuitively, the degree of privacy preservation depends on the
private profile and the declared profile of user i. We denote this dependence by a
continuous function gi (pi , qi ). In general, different users may value their privacy
differently, hence functions gi (·) in general are different for different users. Here,
without loss of generality, we assume that all users are characterized by the same
privacy preservation function g(·). Thus, the privacy preservation for user i is
quantified as g(pi , qi ).

2.3 Recommendation Quality


The users would like to get good quality recommendations for items that have not
been viewed yet. The recommendation that each specific user receives depends
on declared profiles of other users to the server, but also on the declared profile
of this specific user. Even if a user declares his true private profile, the ratings
he would get would still depend on the declared profiles of other users. Thus the
user may still receive suboptimal ratings, while at the same time compromising
its privacy. The problem for each user i is to specify its declared profile so
as to maximize the degree of preserved privacy, while at the same time not
affecting much the quality of the recommendation. The latter means the user
wants calibrate its declared profile so as to receive recommendations close to the
ones he would receive if he would have declared his true private profile, regardless
of the declaration policy of other users.
A Game Theoretic Framework for Data Privacy Preservation 635

Let us denote by q−i = (q1 , . . . , qi−1 , qi+1 , . . . , qN ) the declared rating vector
of all users except user i. Thus, ri = fi (qi , q−i ). Now, let r̃i = fi (pi , q−i ) be
the resulting recommendation vector if user i declared its true profile. Then, the
goal above is quantified by the following constraint for user i:
2 2
(ri − r̃i ) ≤ D ⇔ [fi (qi , q−i ) − fi (pi , q−i )] ≤ D , (1)
where D is an upper bound that denotes the maximum distortion that can
be tolerated in the recommendation by user i. We assume that all users are
characterized by the same such maximum tolerable distortion amount D.

2.4 Problem Formulation


Intuitively, the user would like to submit rating profiles that are sufficiently far
away from its real private profile so as to preserve as much privacy as possible,
by hiding its private profile. On the other hand, he would like to make the
declaration above such that the recommendation to him will not be affected too
much, and in that sense he would like to maintain the recommendation vector
close enough in distance, at most D to the one he would get if he declared
the true private profile. The challenge arises because the constraint (1) above
includes the strategies q−i of other users. The objective above can be formulated
from the point of view of each user i as follows:
max g(pi , qi ) (2)
qi

subject to:
2
[fi (qi , q−i ) − fi (pi , q−i )] ≤ D (3)
In other words, user i has to select its declared profile vector out of a set of
feasible profile vectors which satisfy (1). Nevertheless, this set of feasible vectors
is determined by declared profiles q−i of other users. Denote by F (q−i ) this
feasible set of vectors.
Notice that the problem stated above involves only the point of view of user
i which behaves in a selfish but rational manner. That is, he cares only about
its own maximum privacy conservation, subject to keeping the quality of the
recommendation good enough, and he does not take into account the objectives
of other users. In his effort to optimally address and resolve this tradeoff, and in
particular to ensure that the recommendation vector will be close enough to the
one he would get under full privacy compromise, other users’ strategies matter.
These other users also act in the same rational way since they strive to fulfill
their own privacy preservation objectives, while trying to maintain good quality
recommendation for themselves.

Definition of NEP: A strategy profile Q∗ = (q∗1 , . . . , q∗N ) is called Nash Equi-


librium Point (NEP) for the privacy preservation problem above if for each user
i = 1, . . . , N , the following property holds:
g(pi , q∗i ) ≥ max = q∗i
g(pi , qi ) ∀qi  (4)
qi ∈F (q∗
−i )
636 M. Halkidi and I. Koutsopoulos

The NEP denotes the point that comprises strategies of all users, from which
no user will benefit if it deviates from his strategy unilaterally. In our problem,
in the NEP (q∗1 , . . . , q∗N ), no user i can further increase its privacy preservation
metric g(·) by altering its declared profile to qi  = q∗i , provided that all other
users stay with their NEP declared profiles.

Cooperative user Strategies: Agents may coordinate among themselves in


an effort to mutually benefit from a cooperative approach. A global objective
G(P,
 Q) needs to be defined for the system first. For instance, G(P, Q) =
i∈U g(pi , qi ) denotes the total amount of preserved privacy in the system.
Or, G(P, Q) = mini∈U g(pi , qi ), denoting the user with the least-preserved pri-
vacy. In a coordinated approach, users act jointly so as to optimize the global
objective. A feasible cooperation regime for the N users in U is a joint profile
declaration strategy Q0 = (q01 , . . . , q0N ) such that:

q0i ∈ F (q0−i ), and g(pi , q0i ) ≥ g(pi , q∗i ), ∀ i ∈ U, (5)

where Q∗ is the NEP. Namely, a cooperation regime is feasible if: (i) belongs to
the set of feasible vectors as specified by constraint (1) for all users, (ii) each
user has a privacy at least as much as the one he receives at the NEP. This
latter requirement renders cooperation meaningful for the user and provides the
incentive to the user so as to participate in the coordinated effort.
A first goal of cooperation is to jointly find the set of feasible cooperation
regimes, call it Fc . If Fc 
= ∅, there exists at least one joint strategy Q0 such
that all users are privacy-wise better off compared to the NEP, and this strategy
can be found from solving the set of inequalities in (5). Out of the set of feasible
cooperation regimes, a further goal could be to select one that maximizes the
global privacy objective G(P, Q) or one that guarantees certain properties of
the privacy preservation vector (g(p1 , q1 ), . . . , g(pN , qN )).

3 The Case of a Hybrid Recommendation System


We consider a specific instance of recommendation system as case study to
demonstrate our game theoretic model and analysis and derive various important
insights.

3.1 Model Specifics


First, we present the specifics of our model in terms of the recommendation
metric computed by the server, the specific privacy preservation and recommen-
dation quality metrics.

Recommendation Metric: In this subsection, we discuss the model we adopt


for functions fi (·) that signify the recommendation metrics that are computed
for each user i. Each user declares its profile qi for items k ∈ Si . For each user i,
A Game Theoretic Framework for Data Privacy Preservation 637

the recommendation server applies the following measure to compute metrics ri
for items  
∈ Si ,  ∈ Sj for j 
= i, so as to rate them and include them in the
recommendation vector that is sent to each user i:
1  1 
ri = qj · ρk qik , (6)
N −1 |Si |
j
=i: k∈Si
∈Sj

where ρk ∈ [0, 1] is the correlation between items k and . The server computes
the metric above for all   ∈ Si and forms vector ri . We will assume that the
|I| × |I| correlation matrix that contains the pairwise correlations between any
two items in the system is computed a priori, it is fixed, it is preloaded to the
server and is known by user agents. For example, if the items are movies, the
correlation between two movies could be directly related to the common theme
of the movie, common starring actors, the director or other attributes.
The recommendation metric above pertaining to user i can be viewed as an
instance of a hybrid recommendation. Indeed, the first term above implies a
collaborative filtering approach, in which, for each item  under tentative rec-
ommendation to user i, the ratings of all other users are aggregated. On the
other hand, the second term can be viewed as representative of a content-based
recommendation approach, since it involves a correlation metric that connects
item  (candidate for recommendation) with other items that user i has viewed.
Here, the aggregation function in the first part is taken to be simply the mean
rating of all other users j 
= i which have already viewed the item. Clearly various
modes of aggregating the ratings of other users can be employed. For example,
different weights may be applied in the aggregation. Or, only the ratings from a
subset of users are taken into account, e.g. K users which have viewed common
items with user i, where K is a parameter of the recommendation server. These
K users are denoted by set Ui . In this case the first term would be equal to:
1 
qj
K
j∈Ui :|Ui |=K
Si ∩Sj 
=∅

In our model, we adopt (6) as the recommendation metric in order to have


analytical tractability and expose our approach. We note that a similar treatment
and game theoretic results hold for other types of functions fi (·).

Privacy Preservation: The function g(·) that quantifies privacy preservation


for user i is taken to be equal to:
 2
g(pi , qi ) = pik (pik − qik ) (7)
k∈Si

The metric above reflectsthe intuitive fact that privacy preservation increases as
the Euclidean distance k∈Si (pik − qik )2 between the declared and the private
profiles increases. This distance is weighted by the private rating pik so as to
capture the fact that, among items whose private and declared rating have the
638 M. Halkidi and I. Koutsopoulos

same distance, it is preferable from a privacy preservation perspective to change


the rating of items that are higher rated in reality. Note also that other types
of metrics that include various measures of distance between vectors other than
the Euclidean one can be used.

Recommendation Quality: Since users modify their private ratings when


they declare them to the server in an effort to increase their privacy, they affect
the quality of the recommendation they get from the server. For user i, we
measure this effect in terms of the difference between the recommendation user
i gets if he declares profile qi and the one he would get if he declared the real
private rating pi , regardless of what other users do. Other users j  = i make
in general declarations qj . Thus, the constraint that needs to be fulfilled for
acceptable recommendation quality for user i is derived by using (3) and (6):

1   1 
qj · ρk (qik − pik )2 ≤ D (8)
|Si | N −1
k∈Si 
∈ Si j
=i:
∈Sj

Information Exchange between users and the Server: An iterative pro-


cess of data exchange between users and the recommendation server takes place.
We envision a software agent at the side of each user i which acts on behalf of
the user. The agent is responsible for preserving privacy of each user i and to de-
liver good recommendation quality results to each user i. The agent continuously
sends queries for recommendation to the server. The steps of data exchange are
summarized as follows:
(init)
– STEP 0: An initial or default rating vector qi is used by each user.
– For each user i = 1, . . . , N :
– STEP 1: At each iteration cycle t, the server passes to each agent i the
ratings from other users that refer to items that user i has not viewed yet.
These ratings are based on the information that agents of other users have
sent to the server at the same iteration cycle. That is, the server passes to
 (t)
user i the quantity N 1−1 j =i qj which is an aggregated version of ratings
(t)
{qj } for each item  
∈ Si and users j 
= i.
– STEP 2: The agent of each user i gets to observe the ratings of other
users, namely it observes the first part of (8). It then solves the optimization
problem (P):
(t)
 (t) 2
max g(pi , qi ) = pik (pik − qik ) , (9)
(t)
qi k∈Si

(t−1)
subject to constraint (8), which includes {qj }j
=i from the previous itera-
(t)
tion, and thus it computes his own declared rating vector qi for the current
iteration t.
(t)
– STEP 3: Each agent declares its rating qi to the server.
A Game Theoretic Framework for Data Privacy Preservation 639

Calculates
recommendation Recommendation Aggregates ratings
vectors ri ,i=1,...,N server

sends to users
q1 qi ri rN
r1 q
N

p1 pi pN private rating vector

User 1 User i User N


Compute rating
vector q i
Send to server

Fig. 1. Overview of system architecture and of the data exchange process at each
iteration

– STEP 4: The server uses these ratings to compose aggregated quantities


for items that have not been viewed by other users. Go to Step 1. Repeat
until convergence.
The system is depicted in Figure 1. Agents in general update their rating vectors
in subsequent iterations. At each iteration cycle t, each user i solves its own
(t−1)
optimization problem based on the ratings of other users {qj }j
=i , that have
been passed to i at the end of the previous iteration cycle. User i declares its
ratings to the server. The server collects rating vectors from all users and it
announces the relevant parts to different users in order for the new iteration to
start. The procedure continues as above at each iteration.

Convergence to the NEP: The procedure described above involves a Linear


Programming problem that is solved by each user. The submitted ratings of other
users appear in the constraint (8). The iterative procedure above is an instance
of iterative best-response update from each user. It is known from fundamental
game theory that the sequence of best-response updates for linear problems
(init)
converges to the NEP, starting from any initial vector qi .

4 Game Theoretic Analysis


2
By setting xik = (pik − qik ) , and xi = (xik : k ∈ Si ), it can be seen that
problem (P) that is solved by each user i at each iteration is written as:
 
max pik xik , subject to: βik xik ≤ D(N − 1) , (10)
xi
k∈Si k∈Si
640 M. Halkidi and I. Koutsopoulos

with
1  
βik = qj ρk (11)
|Si |

∈Si j
=i:∈Sj

and it is a Linear Programming problem. The solution to this problem is found


among the extreme points of the feasible set. Each user finds item,
βik
k ∗ = arg min (12)
k∈Si pik
and it sets
pik∗
xik∗ = D(N − 1)|Si | (13)
βik∗
= k ∗ , it is xik = 0. Moving back to the initial variables, we
For all other items k 
deduce that for item k ∗ , the rating declaration should be:

D(N − 1)|Si |pik∗
qik∗ = pik∗ ± (14)
βik∗
= k ∗ . It can be observed that agent i maximizes
while qik = pik for other items k 
its preserved privacy if it declares its true private profile for all viewed items,
except one, k ∗ , for which the quantity
 
βik ∈Si ρk
 =i:∈Sj qj
j
= (15)
pik pik
is the smallest among items it has viewed. The denominator implies that an item
has more chances to be the selected one k ∗ , if it is highly rated in its private
rating vector. The numerator implies that this item should have low correlation
with items that it has not viewed for which the average declared rating of other
users for this item is low. This is clearly meaningful for privacy preservation.
We note that the above result about privacy maximization emerges because we
quantify privacy and recommendation quality with Euclidean distance metric.
A metric based on vector norms other than Euclidean would alter the nature of
the solution.

4.1 Special Case: N=2 Users


In order to demonstrate properties of the equilibrium, consider the simplest
nontrivial case of N = 2 users, each of which has viewed two items. User 1 has
viewed items in set I = {A, B} and has private rating vector (p1A , p1B ), while
user 2 has viewed items in I2 = {B, C} with private rating vector (p2B , p2C ).
2
Let xik = (pik − qik ) for i = 1, 2 and k = A, B.

Game Theoretic Interaction: If users 1 and 2 act autonomously and without


coordination, each user will attempt to maximize its own privacy. Thus, the
problem faced by user A is:
2D
max p1A x1A + p1B x1B , subject to: ρAC x1A + ρBC x1B ≤ , (16)
x1A ,x1B q2C
A Game Theoretic Framework for Data Privacy Preservation 641

where the factor of 2 comes due to the 1/|S1 | factor in the left-hand side of
inequality. Similarly, for user 2, the problem is:
D
max p2B x2B + p2C x2C , subject to: ρAB x2B + ρAC x2C ≤ . (17)
x2B ,x2C q1A

The NEP is the point (x∗1 , x∗2 ) = (x∗1A , x∗1B , x∗2B , x∗2C ) that solves the two prob-
lems above. Depending on the private rating vectors p1 = (p1A , p1B ), p2 =
(p2B , p2C ) and correlations ρAC , ρBC , ρAB , we distinguish four cases:
ρAC ρBC ρAB ρAC
≶ , and ≶ (18)
p1A p1B p2B p2C
Observe that user 1 will declare the true private profile for the item for which
the fraction above is larger, and it declare a different rating for the item for
which the fraction is the smaller. Thus, he prefers to declare different rating for
the item that is higher ranked and least correlated to item C that is candidate
for recommendation to him. For instance, if ρpAC1A
< ρpBC
1B
, user 1 will maximize
its privacy by setting x1B = 0, thus declaring q1B = p1B , while x1A = ρAC 2D
q2C .
For each of the four cases above, we have a respective NEP. For example, if
ρAC ρBC ρAB ρAC
< and < (19)
p1A p1B p2B p2C
then the NEP is:
2D 2D
x1 = ( , 0), and x2 = (  , 0) (20)
ρAC p2C ρAB (p1A ± ρAC
2D
)
p2C

or, equivalently
 

2D  2D
q∗1 = (p1A ± ∗
, p1B ) , q2 = (p2B ±   , p2C ) ,
ρAC p2C ρ AB (p1A ±
2D
ρAC p2C )
(21)
and the privacy metrics are P1 = p1A x1A , P2 = p2B x2B .

5 Numerical Results

In this section, we evaluate the performance of our game theoretic approach in


terms of privacy preservation and recommendation quality. To assess the privacy
preservation in our system, we have adopted the privacy metric presented in (7).
We consider a recommender system consisting of N = 50 users and |I| = 20
items. The content correlation of items is calculated a priori and announced to
the agents that represent the users. The preferences of users for items are ran-
domly selected for the sake of performance evaluation, and we assume that user
642 M. Halkidi and I. Koutsopoulos

70

user5
60 user10
user20
user50
50
Privacy preservation

40

30

20

10

0
2 3 4 5 6 7 8 9 10
max distortion tolerance in recommendation (D)

Fig. 2. Privacy preservation versus maximum distortion tolerance in recommendation

max distortion tolerance (D) =2


16

14

12
Privacy preservation

10
user5
user10
8 user20

2
0 5 10 15 20
Iteration

Fig. 3. Convergence of the iterative best response strategy for different users

ratings lie in the value interval [1, 5]. In Figure 2, we depict the privacy preserva-
tion metric for different users as a function of the maximum distortion tolerance
D in recommendation. We observe that as user tolerance to recommendation
quality increases, the privacy preservation metric also increases. This confirms
the tradeoff between privacy preservation and quality of recommendation. Thus,
users that are less tolerant to the error of recommendation quality they receive
from the server, have to reveal more information about their preferences.
Our approach is based on the iterative best response process of data exchange
between users and the recommendation server that was discussed in section 3.
A Game Theoretic Framework for Data Privacy Preservation 643

User 10
22
D=2
20 D=4
D=6
18

Privacy preservation 16

14

12

10

4
0 5 10 15 20
Iteration

Fig. 4. Convergence of the iterative best response strategy for different values of D

Figure 3 depicts and verifies the convergence of the privacy preservation iteration
for different users at the NEP as the rating vector exchange process progresses.
Then we consider that a specific user chooses to vary the values of his/her
maximum distortion tolerance in recommendation (D). Figure 4 also shows that
the privacy preservation iteration of a user convergences at the NEP and this
can also be verified for different values of D. It is clear that after a small number
of iterations, usually no more than 2 − 3, the system converges and the privacy
preservation metric of a user at the NEP is determined. This fast convergence
of the best response update is a direct consequence of the linear programming
type of problem that each user solves.

6 Conclusion
In this work, we took a first step towards characterizing the fundamental tradeoff
between privacy preservation and good quality recommendation. We introduced
a game theoretic framework for capturing the interaction and conflicting inter-
ests of users in the context of privacy preservation in recommendation systems.
Viewed abstractly from the perspective of each user, the privacy preservation
problem that arises in the process of deciding about the declared profile reduces
to that of placing the declared rating vector sufficiently far away from the actual,
private vector. The constraint on having recommendation quality close enough
to the one that would be achieved if the true profile was revealed, places a
constraint on the meaningful distance between the actual and declared profiles.
Nevertheless, the key challenge is that the extent to which this constraint is sat-
isfied, depends on the declared profiles of other users as well, which in turn face
a similar profile vector placement problem. We attempted to capture this inter-
action, we characterized the Nash Equilibrium Points, and we proposed various
modes of cooperation of users.
644 M. Halkidi and I. Koutsopoulos

References
1. Balabanovic, M., Shoham, Y.: Fab: Content-based collaborative recommendation.
Communications of the Association for Computing Machinery 40(3) (1997)
2. Berkovsky, S., Eytani, Y., Kuflik, T., Ricci, F.: Enhancing privacy and preserving
accuracy of a distributed collaborative filtering. In: Proc. of ACM RecSys (2007)
3. Canny, J.: Collaborative filtering with privacy. In: IEEE Symposium on Security
and Privacy (2002)
4. Cotter, P., Smyth, B.: PTV: Intelligent personalized tv guides. In: Proc. of
AAAI/IAAI (2002)
5. Kargupta, H., Das, K., Liu, K.: A game theoretic approach toward multi-party
privacy-preserving distributed data mining. In: Proc. of PKDD (2007)
6. Lathia, N., Hailes, S., Capra, L.: Private distributed collaborative filtering using
estimated concordance measures. In: Proc. of ACM RecSys (2007)
7. Linden, G., Smith, B., York, J.: Amazon.com recommendations: Item-to-item col-
laborative filtering. IEEE Internet Computing 7(1) (2003)
8. Mellville, P., Mooney, R.J., Nagarajan, R.: Content-boosted collaborative filtering
for improved recommedations. In: Proc. of the National Conference on Artificial
Intelligence (2002)
9. Melville, P., Sindhwani, V.: Recommender Systems, Encyclopedia of Machine
Learning. Springer, Heidelberg (2010)
10. Miller, B., Konstan, J.A., Riedl, J.: Pocketlens: Toward a personal recommender
system. ACM Transactions on Information Systems 22(3) (2004)
11. Mooney, R.J., Roy, L.: Content-based book recommending using learning for text
categorization. In: Proc. of ACM Conf. on Digital Libraries (2000)
12. Nisan, N., Roughgarden, T., Tardos, E., Vazirani, V.V.: Algorithmic Game Theory,
Cambridge (2007)
13. Polat, H., Du, W.: Privacy-preserving collaborative filtering using randomized per-
turbation techniques. In: Proc. of Inter. Conf. on Data Mining, ICDM (2003)
14. Resnick, P., Iacovou, N., Sushak, M., Bergstrom, M., Reidl, J.: Grouplens: An
open architecture for collaborative filtering of netnews. In: Proc. of the Computer
Supported Cooperative Work Conference (1994)
15. Sarwar, B., Karypis, G., Konstan, J., Reidl, J.: Item-based collaborative filtering
recommendation algorithms. In: Proc. of the Inter. Conf. of WWW (2001)
16. Shokri, R., Pedarsani, P., Theodorakopoulos, G., Hubaux, J.P.: Preserving privacy
in collaborative filtering through distributed aggregation of offline profiles. In: Proc.
of ACM RecSys (2009)
17. Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Ad-
vances in Artificial Intelligence (2009)
Author Index

Abbet, Philip III-626 Buhmann, Joachim M. I-423


Aggelis, Vasilis I-8 Buntine, Wray I-296
Agrawal, Rakesh I-1 Burger, Thomas I-359
Akdere, Mert II-661 Busa-Fekete, Róbert I-263
Akrour, Riad I-12
Alaiz-Rodrı́guez, Rocı́o I-597
Caelen, Olivier I-249
Ali, Omar III-613
Campadelli, Paola II-374
Allab, Kais I-28
Cao, Tianyu I-280
Almeida, Hélio I-44
Capponi, Cécile II-209
Al-Stouhi, Samir I-60
Carbonell, Jaime III-430
Amini, Massih-Reza III-443
Carreras, Xavier I-156
Ammar, Sourour III-113
Casiraghi, Elena II-374
Anagnostopoulos, Aris I-76
Ceci, Michelangelo II-358, III-333
Anand, Rajul I-92
Ceruti, Claudio II-374
Andrienko, Gennady III-654
Çetintemel, Uǧur II-661
Andrienko, Natalia III-654
Chau, Duen Horng (Polo) II-245
Antonini, Gianluca I-613
Chehreghani, Morteza Haghir I-423
Antzoulatos, Gerasimos S. I-108
Chen, Changyou I-296
Appice, Annalisa III-333
Cheng, Weiwei I-312, III-414
Arai, Hiromi I-124
Cherian, Anoop III-318
Asur, Sitaram III-18
Chidlovskii, Boris I-328
Atzmueller, Martin III-129
Choi, Seungjin II-130, III-537
Awais, Muhammad I-140
Clémençon, Stéphan I-343
Coenen, Frans II-65
Balle, Borja I-156
Courty, Nicolas I-359
Barabási, Albert-László I-3
Cristianini, Nello III-613
Barber, David I-487
Barbieri, Nicola I-172
Barbu, Costin II-597 Daumé III, Hal III-97
Barsky, Marina II-177 Dauphin, Yann II-645
Bellet, Aurélien I-188 De Bie, Tijl III-613
Benabdeslem, Khalid I-28, I-204 de Lannoy, Gaël I-455
Bengio, Yoshua II-645 del Coz, Juan José II-484
Berger, Simon A. III-256 Dembczyński, Krzysztof III-414
Bertoni, Alberto I-219 Denoyer, Ludovic I-375
Bifet, Albert III-597, III-617 De Raedt, Luc I-581
Bishop, Christopher I-4 Dimitrakakis, Christos III-34
Blekas, Konstantinos II-146, III-638 Ding, Chris II-390, II-405
Boden, Brigitte I-565 Doerfel, Stephan III-129
Böhmer, Wendelin I-235 Du, Lan I-296
Bontempi, Gianluca I-249 Dubout, Charles III-626
Brefeld, Ulf I-407 Dulac-Arnold, Gabriel I-375
Broder, Andrei I-5 DuVall, Scott L. III-97
Brova, George I-76 Džeroski, Sašo III-333
646 Author Index

Eliassi-Rad, Tina III-506 Habrard, Amaury I-188


Elkan, Charles II-437 Halkidi, Maria I-629
Éltető, Tamás I-263 Hamlen, Kevin W. III-522
Hamprecht, Fred A. II-453
Fallah Tehrani, Ali III-414 Han, Jiawei I-549, II-177
Faloutsos, Christos II-245 Hara, Satoshi II-1
Fan, Wei II-597 Hayashi, Kohei II-501
Farajtabar, Mehrdad I-391 He, Dan II-17
Fassetti, Fabio III-621 Heess, Nicolas M.O. III-81
Fern, Alan III-159 Heinrich, Gregor II-32
Fernandes, Eraldo R. I-407 Hidasi, Balázs II-48
Flach, Peter II-193 Hijazi, Mohd Hanafi Ahmad II-65
Flaounas, Ilias III-613 Hindawi, Mohammed I-204
Fleuret, François III-626 Holec, Matěj II-277
Frank, Jordan III-630 Hollmén, Jaakko II-229
Frank, Mario I-423 Holmes, Geoff III-597, III-617
Frasca, Marco I-219 Hotho, Andreas III-129
Freire da Silva, Valdinei I-439 Hu, Tony Xiaohua I-280
Frénay, Benoı̂t I-455 Huang, Heng II-390, II-405
Fu, Zhouyu I-471 Huang, Houkuan III-491
Furmston, Thomas I-487 Huang, Jonathan II-97
Fürnkranz, Johannes I-312 Huberman, Bernardo A. III-18
Hüllermeier, Eyke I-312, III-414
Huynh, Tuyen N. II-81
Gallagher, Brian III-506
Gallinari, Patrick I-375 Iocchi, Luca II-326
Galuba, Wojciech III-18
Gáspár-Papanek, Csaba II-48 Jakubowicz, Jérémie I-343
Gaudel, Romaric I-343 Jankowski, Piotr III-654
Geurts, Pierre III-113 Jansen, Timm III-617
Giannotti, Fosca III-650 Japkowicz, Nathalie II-193
Gionis, Aristides II-549 Jiang, Chuntao II-65
Globerson, Amir II-470 Jiang, Xiaoye II-97
Glorot, Xavier II-645 Jiang, Yi II-114
Goethals, Bart III-634 Jurca, Radu I-9
Gonçalves, Marcos A. III-240
Gori, Marco I-6 Kang, U. II-245
Goutte, Cyril III-443 Kang, Yoonseop II-130
Graziano, Vincent I-503 Kantarcioglu, Murat III-522
Grbovic, Mihajlo I-516 Karavarsamis, Sotiris III-638
Greco, Gianluigi III-621 Karavasilis, Vasileios II-146
Grosskreutz, Henrik I-533 Kashima, Hisashi II-501
Grünewälder, Steffen I-235 Kaski, Samuel II-310
Gu, Quanquan I-549 Ke, Tai-You II-245
Guedes, Dorgival I-44 Kégl, Balázs I-263
Guibas, Leonidas II-97 Kelm, B. Michael II-453
Günnemann, Stephan I-565 Kersting, Kristian III-475
Gutmann, Bernd I-581 Kertész-Farkas, Attila II-162
Guzmán-Martı́nez, Roberto I-597 Khurshid, Sarfraz III-49
Gwadera, Robert I-613 Kim, Sangkyum II-177
Author Index 647

Kittler, Josef I-140 Mikolajczyk, Krystian I-140


Klement, William II-193 Moens, Sandy III-634
Kloft, Marius III-65 Montañés, Elena II-484
Knobbe, Arno III-459 Mooney, Raymond J. II-81, II-629
Koço, Sokol II-209 Morik, Katharina III-349
Koethe, Ullrich II-453 Moschitti, Alessandro III-175
Koivisto, Mikko II-581 Muller, Xavier II-645
Kong, Xiangnan III-223 Myers, Michael P. II-162
Kostakis, Orestis II-229
Koutnı́k, Jan I-503 Nakagawa, Hiroshi II-533
Koutra, Danai II-245 Nanni, Mirco III-650
Koutsopoulos, Iordanis I-629 Narita, Atsuhiro II-501
Kramer, Stefan III-256 Neumayer, Robert III-646
Kranen, Philipp III-617 Neville, Jennifer III-506
Kremer, Hardy III-617 Nguyen, Canh Hao II-517
Krempl, Georg II-261 Nickisch, Hannes I-235
Kuželka, Ondřej II-277 Nie, Feiping II-405
Nikou, Christophoros II-146
Labbi, Abderrahim I-613 Nørvåg, Kjetil III-646
Lang, Tobias II-613 Ntarmos, Nikos III-638
Lappas, Theodoros II-293
Laurent, Johann I-359 Obermayer, Klaus I-235
Leen, Gayle II-310 Obradovic, Zoran III-553
Lefakis, Leonidas III-626 Oiwa, Hidekazu II-533
Leonetti, Matteo II-326 Ong, Rebecca III-650
Leray, Philippe III-113
Li, Tao III-569 Palaniappan, Kannappan II-597
Li, Zhenhui I-549 Pao, Hsing-Kuo Kenneth II-245
Lijffijt, Jefrey II-341 Papapetrou, Panagiotis II-229, II-341,
Lin, Youfang III-491 II-549
Loglisci, Corrado II-358 Park, Laurence A.F. II-565
Lombardi, Gabriele II-374 Park, Sang-Hyeun I-312
Lu, Guojun I-471 Parviainen, Pekka II-581
Luo, Dijun II-390, II-405 Paurat, Daniel I-533
Peltonen, Jaakko II-310
Mamitsuka, Hiroshi II-517 Peng, Jing II-597
Manco, Giuseppe I-172 Pfahringer, Bernhard III-597, III-617
Mannila, Heikki I-7, II-341, II-549 Pinelli, Fabio III-650
Mannor, Shie III-630 Ponassi, Enrico III-642
Marchiori, Elena II-421 Pongor, Sándor II-162
Marthi, Bhaskara III-1 Poupart, Pascal II-613
Matsushima, Shin II-533 Precup, Doina III-630
Matwin, Stan II-193 Preux, Philippe I-375
Mavroeidis, Dimitrios II-421 Puolamäki, Kai II-341
Meira Jr., Wagner I-44
Menon, Aditya Krishna II-437 Quattoni, Ariadna I-156
Menze, Bjoern H. II-453 Quevedo, José Ramón II-484
Meo, Rosa III-642
Meshi, Ofer II-470 Rabiee, Hamid Reza I-391
Mesnil, Grégoire II-645 Raghavan, Sindhu II-629
648 Author Index

Rai, Piyush III-97 Storkey, Amos III-289


Ramamoorthy, Subramanian II-326 Stumme, Gerd III-129
Read, Jesse III-617 Sugiyama, Mahito III-365
Reali Costa, Anna Helena I-439 Sundaresan, Neel I-10
Reddy, Chandan K. I-60, I-92 Sunehag, Peter III-1
Reiz, Beáta II-162 Suzuki, Einoshin III-207
Ren, Jiangtao II-114 Szabóová, Andrea II-277
Renso, Chiara III-650 Szarvas, György I-263
Rifai, Salah II-645
Rinzivillo, Salvatore III-650 Tang, Jie III-381
Riondato, Matteo II-661 Tang, Wenbin III-381
Robards, Matthew III-1 Tatti, Nikolaj III-398
Roglia, Elena III-642 Terracina, Giorgio III-621
Rohban, Mohammad Hossein I-391 Terzi, Evimaria I-76, II-293
Romero, Daniel M. III-18 Thon, Ingo I-581
Rothkopf, Constantin A. III-34 Thuraisingham, Bhavani III-522
Roychowdhury, Shounak III-49 Ting, Kai Ming I-471
Rozza, Alessandro II-374 Tomioka, Ryota II-501
Rückert, Ulrich III-65 Tong, Bin III-207
Toussaint, Marc II-613
Saal, Hannes P. III-81 Trasarti, Roberto III-650
Saha, Avishek III-97 Tsatsaronis, George III-646
Sakuma, Jun I-124 Tsoumakas, Grigorios III-145
Sanner, Scott III-1
Schmidhuber, Jürgen I-503 Uguroglu, Selen III-430
Schnitzler, François III-113 Upfal, Eli II-661
Schoenauer, Marc I-12 Usunier, Nicolas III-443
Scholz, Christoph III-129
Sebag, Michele I-12 Valentini, Giorgio I-219
Sebban, Marc I-188 van Leeuwen, Matthijs III-459
Sechidis, Konstantinos III-145 Veloso, Adriano III-240
Seeland, Madeleine III-256 Venkatasubramanian, Suresh III-97
Seetharaman, Guna II-597 Verleysen, Michel I-455
Seidl, Thomas I-565, III-617 Verscheure, Olivier I-11
Selman, Joseph III-159 Vijayakumar, Sethu III-81
Severyn, Aliaksei III-175 Vincent, Pascal II-645
Shaban, Amirreza I-391 Vlahavas, Ioannis III-145
Shah, Mohak III-191 Vrahatis, Michael N. I-108
Shao, Hao III-207 Vreeken, Jilles III-398, III-634
Shi, Chuan III-223 Vrotsou, Katerina III-654
Siddiqui, Zaigham Faraz II-261 Vucetic, Slobodan I-516
Silva, Rodrigo III-240
Sokolovska, Nataliya III-273 Wahabzada, Mirwaes III-475
Spiliopoulou, Athina III-289 Wan, Huaiyu III-491
Spiliopoulou, Myra II-261 Wang, Bai III-223
Splitthoff, Daniel N. II-453 Wang, Dingding III-569
Sra, Suvrit III-305, III-318 Wang, Song I-280
Stamatakis, Alexandros III-256 Wang, Tao III-506
Stojanova, Daniela III-333 Wartell, Richard III-522
Stolpe, Marco III-349 Washio, Takashi II-1
Author Index 649

Wehenkel, Louis III-113 Zaki, Mohammed J. I-44


Wu, Xian II-597 Zdonik, Stanley B. II-661
Wu, Xindong I-280 Železný, Filip II-277
Wu, Zhihao III-491 Zhang, Dengsheng I-471
Zhang, Ping III-553
Zhang, Yi III-569
Yamamoto, Akihiro III-365 Zhang, Yu III-585
Yan, Fei I-140 Zheng, Yalin II-65
Yeung, Dit-Yan III-585 Zhou, Yan III-522
Yoo, Jiho III-537 Zhuang, Honglei III-381
Yu, Philip S. III-223 Žliobaitė, Indrė III-597

You might also like